Apache Spark: ecosystem overview with Apache Hadoop YARN and HDFSApache Spark™ is a multi-language engine for executing data engineerinApache Spark: ecosystem overview with Apache Hadoop YARN and HDFSg, data science, and machine learning on single-node machines or clusters. It is largely used in the industry for distributed computation of data at scale.
This project is open-source under the Apache software foundation, created by the founders of the company Databricks that provides cluster and support for Spark.
Let deep dive on the architecture of Apache Spark on YARN in a distributed ecosystem of containers and Java VMs.
This architecture need to address the computation and the storage at high scale to manipulate big volume of data with an efficient processing.
Table of contents
- Spark ecosystem
- Spark architecture through Hadoop
- Spark engine for computation and storage
- Go with Spark in production
- Example of Spark application at Scale
- Go further
- What’s next
- Thanks to
- Glossary
- References
Let’s start with the basics, you can jump straight to the sections that interest you, otherwise see vertically how this whole ecosystem works.
Spark ecosystem
Data pipelines
Spark allows building data pipeline in batch mode and the streaming mode.
Here, you see a lambda architecture, but to associate online and offline processing, you can have a kappa or zeta architecture on your data pipeline. You will see it in another article.
The streaming mode of Spark is more micro-batch approach. Other framework like Kafka Streaming, Faust Streaming or Apache Flink are more used to process data streams. Spark is forceful to process data batch overall.
APIs
Mainly, Spark can run with several APIs, one for each following programming language Scala/Java, Python and R, respectively with Spark, PySpark and SparkR.
In practice, Spark and PySpark are more used than SparkR. For the experimentation, PySpark is more helpful, and for the performances Spark is better.
Connectors
Spark has many connectors to write to different types of storage.
Here, we are going to speak a bit more about the connection with Apache Hadoop HDFS.
Spark SQL
In addition, you can also run Spark with SQL. With a SQL request, you can also access to heterogeneous data then combine them in memory:
The Catalyst & Tungsten engines allows optimizing the execution plan of the query:
PySpark
In the Python driver program, SparkContext uses Py4J to launch a JVM and create a JavaSparkContext. Py4J is only used on the driver for local communication between the Python and Java SparkContext objects; large data transfers are performed through a different mechanism.
👀 Read more
Read the rest of the article …
Comments