Machine Learning

Strata + Hadoop 2016 Conference in NY

This past September, Nike approved my request to attend the Strata + Hadoop Conference in New York. Strata is one of the biggest conferences for Engineers who are interested in Machine Learning and Big Data. The three day long conference covered various workshops and sessions ranging from TensorFlow, Tuning Spark Applications, Big Data Streaming and much more.

My personal favorites were a hands-on TensorFlow tutorial from Google, a talk on Spark optimizations from Cloudera and the Parquet performance tuning session from Netflix.

I have been working on Spark a lot recently. While I really enjoy the simplicity of the API, I often find myself tweaking several memory configuration properties to optimize the performance of Spark jobs. Some properties that often need updating are the number of cores per executor and memory allocated to each executor. Add YARN to the mix and the settings become even more complicated due to added complexity of container resource allocation. Spark Dynamic Allocation helps in automating the number of executors created based on the tasks, but the number of cores and memory are not automated. Cloudera talk on this topic was useful. The four main points to consider while optimizing these properties are:

An executor can perform multiple tasks hence increasing the number of cores per executor can increase parallelism.
Hadoop daemons use memory and core, so leave some room for overhead.
YARN Application Manager also needs a core, so not all cores can be assigned to the executors.
Too many cores per executors can lead to bad HDFS IO throughput.

One of the biggest challenges with Big Data is the data format. A good tool can make a great deal of difference in the query and ETL performance. Spark provides hooks to read several different formats like JSON, CSV, XML, Avro and my favorite Parquet. Parquet is a columnar storage format that enables faster query on big datasets. Netflix presented on their big data architecture and how they leverage Parquet format in their pipeline. They provided a few good practical tips on how to optimize the read and storage of Parquet files that can be found here.

Apart from all the technical talks, I also appreciated the emphasis on diversity at the conference. There were meetings and special sessions to engage a more diverse group. It was a great experience for me and I hope I will get more opportunities to attend similar events in future.

Here are some of the resources from the events:

Strata + Hadoop 2016 Conference in NY

Read next

AI and Developer Ecosystem

Providing Context to Large Language Models using VectorDB

Engineering Management is not Easy