All tech blog posts »

Data

From Hadoop to the modern data stack

I was at the meetup From Hadoop to the Modern Stack with the companies Criteo, Starbust and Castor with 3 speeches Hadoop and Data processing at Criteo, Data Catalog Costor, and Trino on Apache Iceberg.

December 14, 2022
3 min

Creative Commons Attribution 4.0 International License:

Table of contents

Hadoop at Criteo

by Anthony Rabier, Staff Site Reliability Lead Engineer at Criteo
and William Montaz, Senior Staff Site Reliability Engineer at Criteo.

Criteo migrated from Hadoop 3 to Hadoop 2, with a lot of patches and in an awesome way by migrating first the runtime to Hadoop 3 for a big progressively without downtime for Hadoop YARN and HDFS without any intervention of the development teams, then by migrating the Hadoop projects for Spark, Flink, and Hadoop Map Reduce. So a vanilla distribution of Hadoop 3 has been created, running with compiled projects in Hadoop 2 and 3 with retro compatibility and tricks developed and merged by Criteo in the core project of Hadoop.

Migration from Hadoop 2 to Hadoop 3

Garmadon is an agent java deployed with all JVM processes running on the Criteo Hadoop cluster to do the metrology jobs for Spark, Flink, and Hadoop Map Reduce … You can build generic Grafana dashboards and create data lineage between your dataset and audit all operations on HDFS.

Garmadon, Java agent for the Metrology on Hadoop

Data processing at Criteo

by Miguel Liroz, Senior Staff Site Reliability Lead Engineer at Criteo
and Raphael Claude, Staff Site Reliability Engineer at Criteo

Datadoc is an internal tool used by Criteo to browse the data catalog and to provide a data catalog on each dataset.

Datadoc — Datadoc [built by author]

BigDataFlow is an internal tool used by Criteo written in Scala to schedule data processing jobs from an extended SQL query.

BigDataFlow extended SQL — BigDataFlow extended SQL [built by author]

👀 Read more

Read the rest of the article …

Category:
- Data

Creative Commons Attribution 4.0 International License:

Author

	Gilles LEGOUX (glegoux) Staff software engineer @Criteo

Also read

2022-11-01-my-tech-blog-on-medium

My tech blog on Medium

Migration blog posts to Medium under the user @glegoux

News

November 01, 2022

2022-11-15-learning-pyspark-with-google-colab

Learning PySpark with Google Colab

PySpark on Google Colab is an efficient way to manipulate and explore the data, and a good fit for a group of AI lear...

Data

November 15, 2022

2021-11-23-secret-santa

What can be the applied mathematics behind a Secret Santa?

Applied Mathematics

November 23, 2021

Comments