You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-issues@hadoop.apache.org by GitBox <gi...@apache.org> on 2022/09/06 13:29:33 UTC

[GitHub] [hadoop] einavh opened a new pull request, #4861: Branch 3.2 - hadoop per bucket endpoint configuration is ignored

einavh opened a new pull request, #4861:
URL: https://github.com/apache/hadoop/pull/4861

   
   I'm using EMR emr-6.5.0 cluster in us-east-1 with ec2 instances. cluster is running spark application using pyspark 3.2.1
   EMR is using Hadoop distribution:Amazon 3.2.1
   
   my spark application is reading from one bucket in us-west-2 and writing to a bucket in us-east-1.
   
   since I'm processing a large amount of data I'm paying a lot of money for the network transport . in order to reduce the cost I have create a vpc interface to s3 endpoint in us-west-2. inside the spark application I'm using aws cli for reading the file names from us-west-2 bucket and it is working through the s3 interface endpoint but when I use pyspark to read the data it is using the us-east-1 s3 endpoint instead of the us-west-2 endpoint.
   I tried to use per bucket configuration but it is being ignored although I added it to the defualt configuration and to spark submit call.
   
   I tried to set the following configuration but they are ignored:
     '--conf', "spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain",
     '--conf', "spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem",
     '--conf', "spark.hadoop.fs.s3a.bucket.<us-west-2-bucket -name>.endpoint=<us-west-2 endpoint >",
      '--conf', "spark.hadoop.fs.s3a.bucket.<us-west-2-bucket -name>.endpoint.region=us-west-2",
      '--conf', "spark.hadoop.fs.s3a.bucket.<us-east-1-bucket -name>.endpoint=<us-east-1end point >",
      '--conf', "spark.hadoop.fs.s3a.bucket.<us-east-1-bucket -name>.endpoint.region=us-east-1",
      '--conf', "spark.hadoop.fs.s3a.path.style.access=false",
      '--conf', "spark.executor.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true",
      '--conf', "spark.driver.extraJavaOptions=-Dcom.amazonaws.services.s3.enableV4=true",
       '--conf', "Dfs.s3a.bucket.<us-east-1-bucket -name>.endpoint=<us-east-1 end point >",
      '--conf', "Dfs.s3a.bucket.<us-west-2-bucket -name>.endpoint=<us-west-2 end point >",
      '--conf', "spark.eventLog.enabled=false",
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

Re: [PR] Branch 3.2 - hadoop per bucket endpoint configuration is ignored [hadoop]

Posted by "ayushtkn (via GitHub)" <gi...@apache.org>.

ayushtkn closed pull request #4861: Branch 3.2 - hadoop per bucket endpoint configuration is ignored
URL: https://github.com/apache/hadoop/pull/4861


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org