Big data project for
The Client is one of the biggest pharmaceutical companies in the world owning around 30 medicine manufacturing facilities which require real-time monitoring and data processing. The data is produced by IoT devices controlled by PLC drivers. The medicine is later batched and quality tests are then performed on each batch. It is important to detect flawed batches early to avoid shipping them to the pharmacies which may result in health issues, lawsuits and superfluous costs.
What & how?
The proposed solution was to build ML algorithms providing recommendations for engineers on how to reconfigure the PLC drivers in order to correct substandard batches. However, in order to create good ML models you need a high volume of quality data. Collecting the data, scattered among the systems of 30 factories, in one central storage was the biggest challenge.
On the way to building ML models recommending a PLC configuration we solved two issues:
- limited computing power in IoT devices
- a lack of centralised data.
Both can be addressed via a Cloud environment that provides flexible access to computing power and overcomes the problem of inefficient data joins between different facilities by directly connecting to factories and collecting data for central Cloud storage.
Another important factor is the total cost of ownership and resilience of the solution. The solution is designed in such a way that onboarding multiple factories does not linearly scale the costs. As VM resources are quite limited on a factory level, we decided to solve it by pushing resiliently buffered data to the Cloud using Azure IoT Edge and Azure IoT Hub. All processing and further analytics is done on the Azure Cloud resulting in cost optimization of onboarding additional factories.
In order to meet real-time data processing requirements Apache Kafka gathers the data from IoT devices. It enables logging/monitoring and collecting metrics in real-time which is required to monitor and support the production lines. Heavy real-time processing of the data (enrichment and cleaning) is done in Kafka Streams. In order to store the data in the persistent storage layer (for further non real-time analytics) we used Kafka Connect and ADLS connector.
DWH (Data Warehouse) is based on the Databricks which is used for all ETL pipelines. Spark jobs are responsible for processing and enriching the data. The data is stored as partitioned Delta Tables. Databricks is also used as a data science workbench where the ML models are developed and experiments are run.
Last but not the least, as per Client requirements the ML pipelines run in both Azure ML and Vertex AI (GCP) depending on the use-case and are better fit for the algorithms provided by Azure or GCP. Thanks to this, we were able to create and serve ML models which provided accurate recommendations for the IoT engineers which led to correct PLC driver settings and minimization of the flawed batches.
- end-to-end system allows real-time access to data from all factories in one central Cloud storage
- constant monitoring and analysis of quality
- undisturbed access to data allows for the retainment and improvement of ML algorithms responsible for recommending PLC configuration
- cost optimization - thanks to constant monitoring and the retraining of ML algorithms the number of flawed batches were reduced significantly.
Get to know us, discover our interests, projects and training courses.