Cross-Account S3 bucket settings for data transfer on Hadoop based systems

While trying to write some data from one AWS account to another, I ran into several cross-account S3 settings issues. Google was coming out thin on my searches, hence documenting it in case somebody else runs into this.

Problem

Account 1 (let's call it Dumbledore) has a S3 Bucket. Account 2 (let's call it Voldemort) wants to write to it, but writing to Dumbledore's bucket makes Voldemort the owner of the objects in that bucket, which results in Dumbledore losing access to objects in his own bucket. Dumbledore needs access to those objects for the greater good. For AWS explanation see: http://docs.aws.amazon.com/AmazonS3/latest/dev/example-walkthroughs-managing-access-example3.html.
You may be wondering why can't Dumbledore and Voldemort work together and get the cross account settings working as documented in the above link? Because:
The documentation suggests updating each object's permission using s3api put-object-acl command. In Big Data land, that would be too time consuming and expensive as we will be writing many files. Also, we won't necessarily know the name of the files that would be generated.

There is also a command called put-bucket-acl. I tried that as well, but it failed too. I am guessing the reason is updating the bucket level permission still does not update the objects that get created afterwards.
If you have one off case, you can go the above route and call it a day. For large number of files on Hadoop based systems, continue reading.

It turned out that while it is difficult to get this cross-account setting working without some hacks at the S3 level, Hadoop provides a simpler way to resolve it. As we are using Spark that uses Hadoop to write to S3, we can use the fs.s3.acl.default or fs.s3n.acl.default, or fs.s3a.acl.default Hadoop setting - depending on what version of S3 api you're using. Set that setting's value to BucketOwnerFullControl which will provide Dumbledore full power to r/w/d the objects that Voldemort created.

Full Steps

I am documenting all the steps for setting up the buckets, roles/policies and permissions for cross-account data transfer through S3 API. This has been verified for Spark Jobs on EMR, but should work same for any Hadoop based system.

  • Create a S3 Bucket in Dumbledore's account. Example: hogwarts
  • There are two ways to provide the access:
    • Via IAM Role and Policy:

      • Create a role in Dumbledore's account. Example: S3CrossAccountTransfer
      • Create a policy in Dumbledore's account (Example: DumbledoreWins).
      • Create a Policy Document to customize the permissions on the S3 bucket. Attach DumbledoreWins policy to the S3CrossAccountTransfer role. The IAM section in AWS console UI will guide you through this. (skip this if you are following step 5) Policy example:
      {
        "Version": "2016-09-17",
        "Statement": [{
            "Sid": "Stmt1461003119438",
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::hogwarts/*"
          }]
      } 
      
    • Via S3 Bucket Policy:

      • Go to "hogwarts" S3 bucket and edit the bucket policy for Voldemort. For example:
         {
         "Version": "2012-10-17",
         "Id": "Policy1460741928081",
         "Statement": [
            {
          "Sid": "Stmt1460741905195",
          "Effect": "Allow",
          "Principal": {
              "AWS": [
                  "arn:aws:iam::Voldemort<this will be numbers IRL>:role/<Role Or User that EMR Cluster assumes>"
              ]
      
          },
          "Action": "s3:*",
          "Resource": "arn:aws:s3:::hogwarts/*"
        }
      }```
      
  • Lastly, set the fs.s3a.acl.default Hadoop property (changing the s3a to the s3 api version you are using one of the following: s3, s3a, s3n) if you are on an EMR cluster. There are a couple of ways to do this for Spark:
    • Set it in the core-site.xml by adding a configuration file and referencing it during the creation of your cluster. The content of configuration file will look like this:
    "Classification": "core-site",
    "Properties": {
    	"fs.s3a.acl.default": "BucketOwnerFullControl"
    }}] 
    
    • Set it in Spark Hadoop Configuration- changing the s3a to the s3 api version you are using (one of the following: s3, s3a, s3n).
      sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.acl.default", "BucketOwnerFullControl")

If you are using Java SDK, you can set x-amz-acl header on PUT request to enable the same feature:

val fileContentBytes = fileContent.getBytes(StandardCharsets.UTF_8)
val fileInputStream = new ByteArrayInputStream(fileContentBytes)
val metadata = new ObjectMetadata()
metadata.setContentType(ContentType)
metadata.setContentLength(fileContentBytes.length)
metadata.setHeader("x-amz-acl", "bucket-owner-full-control")
val putObjectRequest = new PutObjectRequest(bucket, fileName, fileInputStream, metadata)

Important Note

Above settings will not have any impact on the historical data. if the objects were created without the right ACL "bucket-owner-full-control" or read permission, you will still get access denied. To solve that, you will need to add acl on individual objects.