Richa Khandelwal

Sign in Subscribe

AWS

Tuning Spark Jobs on EMR with YARN - Lessons Learnt

Apache Spark is a distributed processing system that can process data at a very large scale. Even though Spark's memory model is optimized to handle large amount of data, it is no magic and there are several settings that can give you most out of your cluster. I

Cross-Account S3 bucket settings for data transfer on Hadoop based systems

While trying to write some data from one AWS account to another, I ran into several cross-account S3 settings issues. Google was coming out thin on my searches, hence documenting it in case somebody else runs into this. Problem Account 1 (let's call it Dumbledore) has a

Migrating to EMR 5.0.X for Spark 2.0

AWS released EMR 5.0 recently. It is a major release and contains upgrades such as Apache Spark 2.0, Apache Hive 2.1, Presto 0.150, Apache Zeppelin 0.6.1 etc Spark 2.0 comes with various performance and API updates. There are also some breaking changes that