Migrating to EMR 5.0.X for Spark 2.0

AWS released EMR 5.0 recently. It is a major release and contains upgrades such as Apache Spark 2.0, Apache Hive 2.1, Presto 0.150, Apache Zeppelin 0.6.1 etc
Spark 2.0 comes with various performance and API updates. There are also some breaking changes that should be considered while migrating to EMR 5.0..
Follow Spark Release notes to understand what is new in Spark 2.0. Also see the migration guide. Following are some of the factors to consider while upgrading to EMR 5.0

  • HiveContext and SqlContext: HiveContext and SqlContext have been deprecated. SparkSession is the new unified API to create various contexts. If you have spark-hive and/or spark-sql dependencies in your project and you are using them just to create the contexts, you can get rid of it by switching to SparkSession which lives in Spark Core library. Here is an example:
val conf = new SparkConf().setAppName("Awesome Application")

val ss: SparkSession = {
  SparkSession.builder.config(conf)
    .enableHiveSupport()
    .getOrCreate()
}
  • Java: Spark has deprecated the support for Java 7. EMR 5.0 ships with Java 8 as well. You can remove the custom bootstrap script (if you have one) to set Spark and Hadoop environment to Java 8 on the EMR. Spark jobs will need to be migrated to use Java 8 as well.

  • Scala: Scala 2.11.8 is now supported.

  • Zeppelin: Zeppelin has been adopted as production release. "sandbox" should be dropped from the application name in the create-cluster script.

  • Hive: Spark 2.0 uses Hive 1.2.1 for internal SQL actions which by default uses Hadoop MapReduce as the execution engine. EMR 5.0.0 ships with Hive 2.1 which uses Apache Tez as the execution engine. Tez is supposed to be faster than mr but despite of several tries, I could not get Spark jobs to work with Hive and Tez combination. I was getting Tez Session class not found exception even though the jars were on the classpath. There may be a way to get them working together but a quick workaround could be updating your hive-site settings for the EMR to include "mr" as the execution engine. You could also just change the Spark Hive setting which will only change the setting for your spark jobs and not the whole EMR cluster. Here is a sample JSON that you can link to in your EMR create-cluster script.

{
    "Classification": "hive-site" or "spark-hive-site"
    "Properties": {
        "hive.execution.engine": "mr",
        "javax.jdo.option.ConnectionURL": "url",
        "javax.jdo.option.ConnectionDriverName": "org.mariadb.jdbc.Driver",
        "javax.jdo.option.ConnectionUserName": "username",
        "javax.jdo.option.ConnectionPassword": "password"
    }
}
  • Apache Tez UI: Assuming that you have dynamic port forwarding and proxy settings configured, you can browse to Apache Tez - default for Hive tasks - UI here: :8080/tez-ui/#