Big data project for
E-commerce
About the Client
The Client stands as one of the largest e-commerce giants in Europe. They offer a wide array of products across various categories to international customers, who are presented with a selection from nearly 300 million options. Their operations span across 6 markets, attracting over 20 million monthly visits. They collaborate with a vast network of over 130,000 professional merchants.
Complication
The Client's objective was to ensure optimal data platform performance for its expanding customer base throughout the year. This involved achieving seamless scalability for data operations during periods of high traffic, such as Christmas and Black Friday, as well as supporting more complex computations, including the training of multiple advanced machine learning models. The new data platform architecture was also expected to foster cost accountability across the organization, providing insights into the ROI of data-related investments. Additionally, the architecture needed to offer flexible resource organization and cost allocation options for various departments and teams.
To address this challenge, one of Europe's largest e-commerce platforms chose to collaborate with us in order to expedite the migration to the Google Cloud Platform. The Datumo team of experts plays an important role in assisting the Client and streamlining their transition to the Cloud.
The Value We Delivered
- Accelerated Cloud Adoption: Our work facilitated the swift migration of the Big Data platform to the Cloud, enabling the Client to harness the full range of Cloud benefits and relieve product teams, allowing them to focus on business development
- Cooperating with the Client in creating and deploying FinOps strategies: we directed our efforts towards enhancing key areas that frequently serve as fertile ground for unmonitored expenses:
- the reduction of superfluous data processing and report generation (such as redundant tasks or obsolete data),
- the mitigation of idle resources,
- the resolution of issues linked to BigQuery utilization and Spark optimization (minimizing data querying for effective cost management)
- Performance Optimization: Datumo's efforts resulted in enhanced platform performance, ensuring smoother and faster data processing and analysis.
- Enhanced Data Quality: Precise validation techniques were implemented, contributing to improved data quality and reliability.
- Knowledge Transfer and Collaboration: Multiple knowledge transfer sessions were conducted with the Client, fostering a collaborative environment and empowering the team with advanced expertise. Datumo conducted regular training sessions on Google Cloud migration and operation. Moreover, we highlighted the most prominent migration cases and optimization techniques, enabling the Client teams to seamlessly manage and enhance the platform.
Innovative Solutions and advanced technologies
The Client’s Big Data and AI on-premise platform was based on a common Hadoop stack. HDFS and Hive were used as a data lake. Data Was transformed using Apache Spark applications on one large Apache Yarn Cluster. Processing workflows were defined and orchestrated by Apache Airflow.
In improve and move migration, platform components were replaced by their managed or serverless GCP equivalents. BigQuery became a new serverless data lake. One enterprise long-living Spark cluster was replaced by ephemeral Dataproc clusters which were created only for processing time. Cloud Composer was used as a managed Apache Airflow cluster.
It seems quite straightforward, however the petabyte datascale, more than 1k workflows and the complexity of adapting the Hadoop stack to the Cloud, presents a demanding migration challenge. To take full advantage of Cloud-native capabilities, workloads had to be refactored or rearchitectured. SQL-based processing was replaced by serverless BigQuery jobs, and heavy computational aspects of Spark processing were offloaded to BigQuery.
Knowledge Zone
Get to know us, discover our interests, projects and training courses.