Apache Spark: ecosystem overview with Apache Hadoop YARN and HDFSApache Spark™ is a multi-language engine for executing data engineerinApache Spark: ecosystem overview with Apache Hadoop YARN and HDFSg, data science, and machine learning on single-node machines or clusters. It is largely used in the industry for distributed computation of data at scale.

This project is open-source under the Apache software foundation, created by the founders of the company Databricks that provides cluster and support for Spark.

Let deep dive on the architecture of Apache Spark on YARN in a distributed ecosystem of containers and Java VMs.

This architecture need to address the computation and the storage at high scale to manipulate big volume of data with an efficient processing.

Table of contents

Spark ecosystem
Spark architecture through Hadoop
Spark engine for computation and storage
Go with Spark in production
Example of Spark application at Scale
Go further
What’s next
Thanks to
Glossary
References

Let’s start with the basics, you can jump straight to the sections that interest you, otherwise see vertically how this whole ecosystem works.

Spark ecosystem

Data pipelines

Spark allows building data pipeline in batch mode and the streaming mode.

Here, you see a lambda architecture, but to associate online and offline processing, you can have a kappa or zeta architecture on your data pipeline. You will see it in another article.

Spark vs Spark Streaming [built by author]

The streaming mode of Spark is more micro-batch approach. Other framework like Kafka Streaming, Faust Streaming or Apache Flink are more used to process data streams. Spark is forceful to process data batch overall.

APIs

Mainly, Spark can run with several APIs, one for each following programming language Scala/Java, Python and R, respectively with Spark, PySpark and SparkR.

Different types of Spark engines [built by author]

In practice, Spark and PySpark are more used than SparkR. For the experimentation, PySpark is more helpful, and for the performances Spark is better.

Connectors

Spark has many connectors to write to different types of storage.

Here, we are going to speak a bit more about the connection with Apache Hadoop HDFS.

Spark SQL

In addition, you can also run Spark with SQL. With a SQL request, you can also access to heterogeneous data then combine them in memory:

The Catalyst & Tungsten engines allows optimizing the execution plan of the query:

PySpark

In the Python driver program, SparkContext uses Py4J to launch a JVM and create a JavaSparkContext. Py4J is only used on the driver for local communication between the Python and Java SparkContext objects; large data transfers are performed through a different mechanism.

👀 Read more

Read the rest of the article …