Big Data processing using Apache Spark

During the training, you will learn how to use the Apache Spark framework to quickly process large amounts of data.

Technologies

Purpose of training

This course provides an introduction to the Apache Spark architecture. It can be conducted in Scala or Python. It covers the process of creating a Spark application - integration with the source, data processing, process optimization and saving in a database in the Cloud environment.

You will get familiar with Spark API. You will learn how to write Spark Jobs related to common and specific problems. We will discuss optimizations, the most common challenges and ways to overcome them. The training focuses mainly on practical skills.

‍

40% - theory, 60% - practical workshops

The training can be conducted in the Client’s office or online.

‍

Duration:

2-3 days.

Training addresses

Developers and business analysts whose aim is to learn about Apache Spark technology. Basic knowledge of Python or Scala is recommended.

Training plan

Module 1 Introduction to data processing using Apache Spark

‍

1.1 Origin - new features in the newest version of Apache Spark, integration with Cloud/Hadoop

1.2 Introduction to the API

1.2.1 RDD / Dataset / Dataframe
1.2.2 Main characteristics, differences and performance comparison
1.2.3 Recommendations, tips & tricks

1.3 Lazy Evaluation - Transformations and Actions

1.3.1 How the transformation execution graph is built
1.3.2 How to read and interpret execution graph
1.3.3 How to re-use RDD applying previously performed transformations

1.4 Shuffling - transferring data between machines

1.4.1 Transformations that require shuffling (wide and narrow)
1.4.2 ReduceByKey and GroupByKey
1.4.3 Minimizing the impact on application performance

1.5 Data partitioning

1.5.1 The choice between repartition() and coalesce()
1.5.2 Main assumptions of data partitioning
1.5.3 Number/size of partitions vs. processing performance

1.6 Basic configuration of an Apache Spark based project:

1.6.1 How to configure the script starting job
1.6.2 How to write job code to make it easy for testing

1.7 Possibilities of Spark integration with other solutions (database, hdfs, avro, text file, csv, json ...)

Module 2 Architecture, integration, common problems and optimization of applications based on Apache Spark

‍

2.1 Architecture

2.1.1 Spark Driver, Worker and Executor
2.1.2 Job vs. Stage vs. Task
2.1.3 Processing units and data units
2.1.4 Deployment possibilities

2.2 Testing Spark jobs

2.3 Joins
2.3.1 Physical types of joins
2.3.2 Best practice
2.3.3 Using Join to minimize data transfer between machines (shuffling reduction)

2.4 UDF - how to build them and how they impact performance. Differences between DataFrame and Dataset.

2.5 Spark job optimizations and common issues

2.5.1 Key skew
2.5.2 OOM
2.5.3 Broadcast
2.5.4 Cache
2.5.5 Serialization
2.5.6 How to choose the size of executors

2.6 Interpretation and optimization of query plans

2.6.1 Navigating through Spark UI
2.6.2 Verification and most important elements

2.7 Spark Catalyst and Tungsten

no schedule has been set

Our trainers

Piotr Guzik

For many years, I have been regularly conducting practical training in the area of Big Data. I conduct lectures and workshops with pleasure and a smile on my face. I have received very positive feedback from all of my students. In the past, I lectured in postgraduate studies. Let's talk!

Michał Misiewicz

I have been running Big Data training courses since 2019. I relish the opportunity to meet interesting people and identify problems that Big Data technologies can solve. Let's talk!

Rafał Rozpondek

I am a Big Data trainer. I effectively share knowledge in technological areas, such as Apache Hadoop, Spark, Nifi, Airflow, Druid or Kafka. Training gives me an opportunity to spread interest in new technologies and, as a result, overcome business challenges. Let's talk!

Oskar Lamch

I conduct training in Big Data technologies, focusing on Scala and Spark. My goal is to transfer knowledge in an effective and interesting way that attracts people to new technologies and solutions. Let's talk!

Marcin Szustak

Since 2020, I have been co-creating materials and conducting theoretical and practical classes focused on Spark. I use this technology on a daily basis to process large data sets from various industries, both in on-premise environments and those based on cloud solutions. This makes me a reliable expert who builds training on experience.