Creating Big Data platforms using technologies from the Apache family

The aim of the training is to gain practical knowledge about Big Data solutions.

Ask for training

Technologies

Purpose of training

You will learn how to use popular Big Data technologies: Apache Spark, Apache Kafka, Apache Airflow and Apache Druid. You will also discover how to build complex Big Data systems from scratch. Practical workshops are the main strength of the training.

‍

50% - theory, 50% - practical workshops

The training can be conducted in the Client’s office or other convenient spot.

Duration:

3 days. The training program is tailored to the needs of the group.

Training addresses

The training is addressed to programmers and business analysts whose goal is to learn about Big Data tools. Basic knowledge of Java or Scala is recommended.

Training plan

Module 1 Overview of the Apache family Big Data solutions and introduction to data processing

1.1 Overview of the Apache family Big Data solutions

1.2 Scala for Big Data

1.2.1 Case Class, Traits
1.2.2 Tuples
1.2.3 Lazy evaluation
1.2.4 String interpolation
1.2.5 Pattern matching
1.2.6 Companion object
1.2.7 Collections and transformations
1.2.8 For comprehension, mapping
1.2.9 Try/Either/Option
1.2.10 Implicits

1.3 Apache Spark - introduction

1.3.1 RDD, DataFrame, Dataset
1.3.2 Lazy evaluation
1.3.3 Transformations and actions
1.3.4 Spark vs. Hadoop
1.3.5 DataFrame vs Dataset API

Module 2 Data processing using Apache Spark and a modern data warehouse - Apache Druid

‍

2.1 Workshops: Spark - how to enrich your data?

2.2 Apache Spark - architecture and optimization

2.2.1 Architecture (driver, worker, executor...)
2.2.2. Optimization of jobs and parameters
2.2.3 Deployment
2.2.4 Shuffling
2.2.5 Common errors - key-skew, serialization, OOM
2.2.6 Broadcast, repartition, caching, execution plans, optimization
2.2.7 Spark internals - joins, group by

2.3 Apache Druid

2.3.1 Architecture
2.3.2 Data structures
2.3.3 Component management
2.3.4 Druid and Big Data platforms based on Apache Hadoop
2.3.5 Real-time and batch processing

Module 3 Streaming and orchestration

‍

3.1 Apache Kafka

3.1.1 Pub/Sub pattern, difference between push and pull models
3.1.2 Architecture
3.1.3 Topics
3.1.4 Kafka producer & Kafka consumer
3.1.5 Analysis of the scalability of Apache Kafka based system
3.1.6 Consumer groups
3.1.7 Replication and retention
3.1.8 Zookeeper

3.2 Apache Airflow

3.2.1 Processing automation
3.2.2 Creating a Data Pipeline - Defining Acyclic Processing Directed Graphs (DAGs)
3.2.3 Architecture

no schedule has been set

Our trainers

Piotr Guzik

For many years, I have been regularly conducting practical training in the area of Big Data. I conduct lectures and workshops with pleasure and a smile on my face. I have received very positive feedback from all of my students. In the past, I lectured in postgraduate studies. Let's talk!

Daniel Pogrebniak

I am a dedicated trainer, specialising in Scala and Spark. I like sharing knowledge, conducting training, while popularizing programming best practices. I develop code on a daily basis, so I constantly increase my hands-on experience. You'd like to talk about programming practices? Write to me!

Michał Misiewicz

I have been running Big Data training courses since 2019. I relish the opportunity to meet interesting people and identify problems that Big Data technologies can solve. Let's talk!

Rafał Rozpondek

I am a Big Data trainer. I effectively share knowledge in technological areas, such as Apache Hadoop, Spark, Nifi, Airflow, Druid or Kafka. Training gives me an opportunity to spread interest in new technologies and, as a result, overcome business challenges. Let's talk!

Oskar Lamch

I conduct training in Big Data technologies, focusing on Scala and Spark. My goal is to transfer knowledge in an effective and interesting way that attracts people to new technologies and solutions. Let's talk!

Marcin Szustak

Since 2020, I have been co-creating materials and conducting theoretical and practical classes focused on Spark. I use this technology on a daily basis to process large data sets from various industries, both in on-premise environments and those based on cloud solutions. This makes me a reliable expert who builds training on experience.

Contact

Ask for training

contact@datumo.pl

+48 789 566 177

Dziekońskiego 1 street,

00-728 Warsaw

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Technologies

Purpose of training

Duration:

Training addresses

Training plan

Our trainers

Ask for training

Get expert advice for free