Creating Big Data platforms using technologies from the Apache family
The aim of the training is to gain practical knowledge about Big Data solutions.
Technologies
Purpose of training
You will learn how to use popular Big Data technologies: Apache Spark, Apache Kafka, Apache Airflow and Apache Druid. You will also discover how to build complex Big Data systems from scratch. Practical workshops are the main strength of the training.
50% - theory, 50% - practical workshops
The training can be conducted in the Client’s office or other convenient spot.
Duration:
3 days. The training program is tailored to the needs of the group.
Training addresses
The training is addressed to programmers and business analysts whose goal is to learn about Big Data tools. Basic knowledge of Java or Scala is recommended.
Training plan
Module 1 Overview of the Apache family Big Data solutions and introduction to data processing
1.1 Overview of the Apache family Big Data solutions
1.2 Scala for Big Data
- 1.2.1 Case Class, Traits
- 1.2.2 Tuples
- 1.2.3 Lazy evaluation
- 1.2.4 String interpolation
- 1.2.5 Pattern matching
- 1.2.6 Companion object
- 1.2.7 Collections and transformations
- 1.2.8 For comprehension, mapping
- 1.2.9 Try/Either/Option
- 1.2.10 Implicits
1.3 Apache Spark - introduction
- 1.3.1 RDD, DataFrame, Dataset
- 1.3.2 Lazy evaluation
- 1.3.3 Transformations and actions
- 1.3.4 Spark vs. Hadoop
- 1.3.5 DataFrame vs Dataset API
Module 2 Data processing using Apache Spark and a modern data warehouse - Apache Druid
2.1 Workshops: Spark - how to enrich your data?
2.2 Apache Spark - architecture and optimization
- 2.2.1 Architecture (driver, worker, executor...)
- 2.2.2. Optimization of jobs and parameters
- 2.2.3 Deployment
- 2.2.4 Shuffling
- 2.2.5 Common errors - key-skew, serialization, OOM
- 2.2.6 Broadcast, repartition, caching, execution plans, optimization
- 2.2.7 Spark internals - joins, group by
2.3 Apache Druid
- 2.3.1 Architecture
- 2.3.2 Data structures
- 2.3.3 Component management
- 2.3.4 Druid and Big Data platforms based on Apache Hadoop
- 2.3.5 Real-time and batch processing
Module 3 Streaming and orchestration
3.1 Apache Kafka
- 3.1.1 Pub/Sub pattern, difference between push and pull models
- 3.1.2 Architecture
- 3.1.3 Topics
- 3.1.4 Kafka producer & Kafka consumer
- 3.1.5 Analysis of the scalability of Apache Kafka based system
- 3.1.6 Consumer groups
- 3.1.7 Replication and retention
- 3.1.8 Zookeeper
3.2 Apache Airflow
- 3.2.1 Processing automation
- 3.2.2 Creating a Data Pipeline - Defining Acyclic Processing Directed Graphs (DAGs)
- 3.2.3 Architecture