Hadoop at Criteo

by Anthony Rabier, Staff Site Reliability Lead Engineer at Criteo
and William Montaz, Senior Staff Site Reliability Engineer at Criteo.

Criteo migrated from Hadoop 3 to Hadoop 2, with a lot of patches and in an awesome way by migrating first the runtime to Hadoop 3 for a big progressively without downtime for Hadoop YARN and HDFS without any intervention of the development teams, then by migrating the Hadoop projects for Spark, Flink, and Hadoop Map Reduce. So a vanilla distribution of Hadoop 3 has been created, running with compiled projects in Hadoop 2 and 3 with retro compatibility and tricks developed and merged by Criteo in the core project of Hadoop.

Migration from Hadoop 2 to Hadoop 3

Garmadon is an agent java deployed with all JVM processes running on the Criteo Hadoop cluster to do the metrology jobs for Spark, Flink, and Hadoop Map Reduce … You can build generic Grafana dashboards and create data lineage between your dataset and audit all operations on HDFS.

Garmadon, Java agent for the Metrology on Hadoop

Data processing at Criteo

by Miguel Liroz, Senior Staff Site Reliability Lead Engineer at Criteo
and Raphael Claude, Staff Site Reliability Engineer at Criteo

Datadoc is an internal tool used by Criteo to browse the data catalog and to provide a data catalog on each dataset.

Datadoc
Datadoc [built by author]

BigDataFlow is an internal tool used by Criteo written in Scala to schedule data processing jobs from an extended SQL query.

BigDataFlow extended SQL
BigDataFlow extended SQL [built by author]

👀 Read more

Read the rest of the article …


Also read
Comments