Apache Spark: ecosystem overview with Apache Hadoop YARN and HDFS

Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters. It is largely used in the industry for distributed computation of data at scale.

This project is open-source under the Apache software foundation, created by the founders of the company Databricks that provides cluster and support for Spark.

Let deep dive on the architecture of Apache Spark on YARN in a distributed ecosystem of containers and Java VMs.

This architecture need to address the computation and the storage at high scale to manipulate big volume of data with an efficient processing.

Table of contents

  • Spark ecosystem
  • Spark architecture through Hadoop
  • Spark engine for computation and storage
  • Go with Spark in production
  • Example of Spark application at Scale
  • Go further
  • What’s next
  • Thanks to
  • Glossary
  • References

Let’s start with the basics, you can jump straight to the sections that interest you, otherwise see vertically how this whole ecosystem works.

Spark ecosystem

Data pipelines

Spark allows building data pipeline in batch mode and the streaming mode.

Lambda architecture with Spark
Lambda architecture with Spark [source]

Here, you see a lambda architecture, but to associate online and offline processing, you can have a kappa or zeta architecture on your data pipeline. You will see it in another article.

Spark vs Spark Streaming
Spark vs Spark Streaming [built by author]

The streaming mode of Spark is more micro-batch approach. Other framework like Kafka Streaming, Faust Streaming or Apache Flink are more used to process data streams. Spark is forceful to process data batch overall.


Mainly, Spark can run with several APIs, one for each following programming language Scala/Java, Python and R, respectively with Spark, PySpark and SparkR.

Different types of Spark engines
Different types of Spark engines [built by author]

In practice, Spark and PySpark are more used than SparkR. For the experimentation, PySpark is more helpful, and for the performances Spark is better.


Spark has many connectors to write to different types of storage.

Apache Spark’s ecosystem of connectors
Apache Spark’s ecosystem of connectors [source]

Here, we are going to speak a bit more about the connection with Apache Hadoop HDFS.

Spark SQL

In addition, you can also run Spark with SQL. With a SQL request, you can also access to heterogeneous data then combine them in memory:

Spark SQL overview
Spark SQL overview [source]

The Catalyst & Tungsten engines allows optimizing the execution plan of the query:

Catalyst [source]


In the Python driver program, SparkContext uses Py4J to launch a JVM and create a JavaSparkContext. Py4J is only used on the driver for local communication between the Python and Java SparkContext objects; large data transfers are performed through a different mechanism.

