Big data project for

IoT

About the Client

The Client is a well-known global pharmaceutical company with over 30 medicine manufacturing facilities spread worldwide. Their expertise lies in the development of antibodies and cutting-edge cancer treatments. Setting a remarkable benchmark in the industry, they boast the widest range of vaccines in their portfolio, distributing a staggering two million doses each day. Moreover, they actively engage in research and development endeavors that specifically target drug-resistant infections, showcasing their commitment to advances in medicine.

Complication

The Client’s primary objective is to achieve uninterrupted, real-time access to data from all factories. This initiative aims to facilitate constant monitoring and analysis of product quality while optimizing costs.

The Client faced two significant challenges in achieving their objective. Firstly, the limited computing power in their OT infrastructure with access to Internet of Things (IoT) devices posed limitations to real-time data processing. Secondly, the lack of centralized data storage presented certain obstacles to efficient data analysis. To address these challenges, the Client partnered with Datumo to develop an end-to-end system that enables real-time access to data from all factories, consolidating the data in a central Cloud storage solution.

The value we delivered

Constant Monitoring and Quality Analysis: The system detects flawed batches of medicine at an early stage, preventing their shipment to pharmacies. This reduces health risks and safeguards the reputation of the Client.
Cost optimization: Bad batches of medicine not only pose health risks, but also carry the potential for lawsuits and unnecessary expenses. Identifying and addressing quality issues during the production stage is significantly more cost-effective than resolving them at later stages in the supply chain. The cost of rectifying an issue in the production stage increases 5 fold in the logistics stage and a staggering 100 fold in the distribution stage. Therefore, the system's capability to detect and rectify flawed batches early on is instrumental in minimizing financial losses.
Undisturbed Access to Data: The uninterrupted access allows the Client to continuously improve their machine learning (ML) algorithms responsible for detecting flawed batches. By leveraging the power of ML and real-time data, the Client can enhance their quality control processes and optimize production efficiency.
Knowledge Sharing and System Maintenance: Datumo provides knowledge transfer and support to the Client, enabling them to effectively maintain and manage the implemented system. This knowledge sharing ensures that the Client can independently maintain system efficiency, keeping it up to date and aligned with evolving industry standards.

Innovative solutions and advanced technologies

The data is produced by IoT devices controlled by Programmable Logic Controllers (PLC). The proposed solution is to build ML algorithms providing recommendations for engineers on how to reconfigure the PLC drivers in order to correct substandard batches. However, in order to create good ML models you need a high volume of quality data. Collecting the data, scattered among 30 factory systems, in one central storage is the biggest challenge.

While building ML models recommending a PLC configuration we solved both issues:

limited computing power in OT infrastructure with access to IoT devices,
a lack of centralized data.

Both can be addressed via a Cloud environment that provides flexible access to computing power and overcomes the problem of inefficient data joins between different facilities by directly connecting to factories and collecting data for central Cloud storage.

Another important factor is the total cost of ownership and resilience of the solution. The solution is designed in such a way that onboarding multiple factories does not linearly scale the costs. As VM resources are quite limited on a factory level, we decided to solve it by pushing resiliently buffered data to the Cloud using Azure IoT Edge and Azure IoT Hub. All processing and further analytics is done on the Azure Cloud resulting in cost optimization of onboarding additional factories.

In order to meet real-time data processing requirements Apache Kafka gathers the data from IoT devices. It enables logging/monitoring and collecting metrics in real-time which is required to monitor and support the production lines. Heavy real-time processing of the data (enrichment and cleaning) is done in Kafka Streams. In order to store the data in the storage layer (for further non real-time analytics) Kafka Connect and ADLS connector are used.

DWH (Data Warehouse) is based on Databricks technology which is used for all ETL pipelines. Spark jobs are responsible for processing and enriching the data. The data is stored as partitioned Delta Tables. Databricks is also used as a data science workbench where the ML models are developed and experiments can be run.

Last but not the least, as per Client requirements the ML pipelines run in both Azure ML and Vertex AI (GCP) depending on the use-case and are better fit for the algorithms provided by Azure or GCP. Thanks to this, we created and served ML models which provide accurate recommendations for IoT engineers which leads to correct PLC driver settings and minimization of flawed batches.

Knowledge Zone

Get to know us, discover our interests, projects and training courses.

Knowledge Zone

Datumo Camp 2025: Getting to know each other even better

Datumo at SQLDay 2025: A recap of innovation, networking and Big Data trends

How strategic AI integration transforms lean teams into enterprise growth engines

Get expert advice for free