Apache Spark Tuning Spark Jobs on EMR with YARN - Lessons Learnt Apache Spark is a distributed processing system that can process data at a very large scale. Even though Spark's memory model is optimized to handle large amount of data, it is no magic and there are several settings that can give you most out
AWS Cross-Account S3 bucket settings for data transfer on Hadoop based systems While trying to write some data from one AWS account to another, I ran into several cross-account S3 settings issues. Google was coming out thin on my searches, hence documenting it in case somebody else runs into this. Problem Account 1 (let's call it
AWS Migrating to EMR 5.0.X for Spark 2.0 AWS released EMR 5.0 recently. It is a major release and contains upgrades such as Apache Spark 2.0, Apache Hive 2.1, Presto 0.150, Apache Zeppelin 0.6.1 etc Spark 2.0 comes with various performance and API updates. There