You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Jerry Lam <ch...@gmail.com> on 2016/01/01 22:35:18 UTC
Re: SparkSQL integration issue with AWS S3a
Hi Kostiantyn,
You should be able to use spark.conf to specify s3a keys.
I don't remember exactly but you can add hadoop properties by prefixing spark.hadoop.*
* is the s3a properties. For instance,
spark.hadoop.s3a.access.key wudjgdueyhsj
Of course, you need to make sure the property key is right. I'm using my phone so I cannot easily verifying.
Then you can specify different user using different spark.conf via --properties-file when spark-submit
HTH,
Jerry
Sent from my iPhone
> On 31 Dec, 2015, at 2:06 pm, KOSTIANTYN Kudriavtsev <ku...@gmail.com> wrote:
>
> Hi Jerry,
>
> what you suggested looks to be working (I put hdfs-site.xml into $SPARK_HOME/conf folder), but could you shed some light on how it can be federated per user?
> Thanks in advance!
>
> Thank you,
> Konstantin Kudryavtsev
>
>> On Wed, Dec 30, 2015 at 2:37 PM, Jerry Lam <ch...@gmail.com> wrote:
>> Hi Kostiantyn,
>>
>> I want to confirm that it works first by using hdfs-site.xml. If yes, you could define different spark-{user-x}.conf and source them during spark-submit. let us know if hdfs-site.xml works first. It should.
>>
>> Best Regards,
>>
>> Jerry
>>
>> Sent from my iPhone
>>
>>> On 30 Dec, 2015, at 2:31 pm, KOSTIANTYN Kudriavtsev <ku...@gmail.com> wrote:
>>>
>>> Hi Jerry,
>>>
>>> I want to run different jobs on different S3 buckets - different AWS creds - on the same instances. Could you shed some light if it's possible to achieve with hdfs-site?
>>>
>>> Thank you,
>>> Konstantin Kudryavtsev
>>>
>>>> On Wed, Dec 30, 2015 at 2:10 PM, Jerry Lam <ch...@gmail.com> wrote:
>>>> Hi Kostiantyn,
>>>>
>>>> Can you define those properties in hdfs-site.xml and make sure it is visible in the class path when you spark-submit? It looks like a conf sourcing issue to me.
>>>>
>>>> Cheers,
>>>>
>>>> Sent from my iPhone
>>>>
>>>>> On 30 Dec, 2015, at 1:59 pm, KOSTIANTYN Kudriavtsev <ku...@gmail.com> wrote:
>>>>>
>>>>> Chris,
>>>>>
>>>>> thanks for the hist with AIM roles, but in my case I need to run different jobs with different S3 permissions on the same cluster, so this approach doesn't work for me as far as I understood it
>>>>>
>>>>> Thank you,
>>>>> Konstantin Kudryavtsev
>>>>>
>>>>>> On Wed, Dec 30, 2015 at 1:48 PM, Chris Fregly <ch...@fregly.com> wrote:
>>>>>> couple things:
>>>>>>
>>>>>> 1) switch to IAM roles if at all possible - explicitly passing AWS credentials is a long and lonely road in the end
>>>>>>
>>>>>> 2) one really bad workaround/hack is to run a job that hits every worker and writes the credentials to the proper location (~/.awscredentials or whatever)
>>>>>>
>>>>>> ^^ i wouldn't recommend this. ^^ it's horrible and doesn't handle autoscaling, but i'm mentioning it anyway as it is a temporary fix.
>>>>>>
>>>>>> if you switch to IAM roles, things become a lot easier as you can authorize all of the EC2 instances in the cluster - and handles autoscaling very well - and at some point, you will want to autoscale.
>>>>>>
>>>>>>> On Wed, Dec 30, 2015 at 1:08 PM, KOSTIANTYN Kudriavtsev <ku...@gmail.com> wrote:
>>>>>>> Chris,
>>>>>>>
>>>>>>> good question, as you can see from the code I set up them on driver, so I expect they will be propagated to all nodes, won't them?
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Konstantin Kudryavtsev
>>>>>>>
>>>>>>>> On Wed, Dec 30, 2015 at 1:06 PM, Chris Fregly <ch...@fregly.com> wrote:
>>>>>>>> are the credentials visible from each Worker node to all the Executor JVMs on each Worker?
>>>>>>>>
>>>>>>>>> On Dec 30, 2015, at 12:45 PM, KOSTIANTYN Kudriavtsev <ku...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> Dear Spark community,
>>>>>>>>>
>>>>>>>>> I faced the following issue with trying accessing data on S3a, my code is the following:
>>>>>>>>>
>>>>>>>>> val sparkConf = new SparkConf()
>>>>>>>>>
>>>>>>>>> val sc = new SparkContext(sparkConf)
>>>>>>>>> sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
>>>>>>>>> sc.hadoopConfiguration.set("fs.s3a.access.key", "---")
>>>>>>>>> sc.hadoopConfiguration.set("fs.s3a.secret.key", "---")
>>>>>>>>> val sqlContext = SQLContext.getOrCreate(sc)
>>>>>>>>> val df = sqlContext.read.parquet(...)
>>>>>>>>> df.count
>>>>>>>>>
>>>>>>>>> It results in the following exception and log messages:
>>>>>>>>> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load credentials from BasicAWSCredentialsProvider: Access key or secret key is null
>>>>>>>>> 15/12/30 17:00:32 DEBUG EC2MetadataClient: Connecting to EC2 instance metadata service at URL: http://x.x.x.x/latest/meta-data/iam/security-credentials/
>>>>>>>>> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load credentials from InstanceProfileCredentialsProvider: The requested metadata is not found at http://x.x.x.x/latest/meta-data/iam/security-credentials/
>>>>>>>>> 15/12/30 17:00:32 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 3)
>>>>>>>>> com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
>>>>>>>>> at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
>>>>>>>>> at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
>>>>>>>>> at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
>>>>>>>>> at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
>>>>>>>>> at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
>>>>>>>>>
>>>>>>>>> I run standalone spark 1.5.2 and using hadoop 2.7.1
>>>>>>>>>
>>>>>>>>> any ideas/workarounds?
>>>>>>>>>
>>>>>>>>> AWS credentials are correct for this bucket
>>>>>>>>>
>>>>>>>>> Thank you,
>>>>>>>>> Konstantin Kudryavtsev
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Chris Fregly
>>>>>> Principal Data Solutions Engineer
>>>>>> IBM Spark Technology Center, San Francisco, CA
>>>>>> http://spark.tc | http://advancedspark.com
>
Re: SparkSQL integration issue with AWS S3a
Posted by Jerry Lam <ch...@gmail.com>.
Hi Kostiantyn,
Yes. If security is a concern then this approach cannot satisfy it. The keys are visible in the properties files. If the goal is to hide them, you might be able go a bit further with this approach. Have you look at spark security page?
Best Regards,
Jerry
Sent from my iPhone
> On 6 Jan, 2016, at 8:49 am, Kostiantyn Kudriavtsev <ku...@gmail.com> wrote:
>
> Hi guys,
>
> the only one big issue with this approach:
>>> spark.hadoop.s3a.access.key is now visible everywhere, in logs, in spark webui and is not secured at all...
>
>> On Jan 2, 2016, at 11:13 AM, KOSTIANTYN Kudriavtsev <ku...@gmail.com> wrote:
>>
>> thanks Jerry, it works!
>> really appreciate your help
>>
>> Thank you,
>> Konstantin Kudryavtsev
>>
>>> On Fri, Jan 1, 2016 at 4:35 PM, Jerry Lam <ch...@gmail.com> wrote:
>>> Hi Kostiantyn,
>>>
>>> You should be able to use spark.conf to specify s3a keys.
>>>
>>> I don't remember exactly but you can add hadoop properties by prefixing spark.hadoop.*
>>> * is the s3a properties. For instance,
>>>
>>> spark.hadoop.s3a.access.key wudjgdueyhsj
>>>
>>> Of course, you need to make sure the property key is right. I'm using my phone so I cannot easily verifying.
>>>
>>> Then you can specify different user using different spark.conf via --properties-file when spark-submit
>>>
>>> HTH,
>>>
>>> Jerry
>>>
>>> Sent from my iPhone
>>>
>>>> On 31 Dec, 2015, at 2:06 pm, KOSTIANTYN Kudriavtsev <ku...@gmail.com> wrote:
>>>>
>>>> Hi Jerry,
>>>>
>>>> what you suggested looks to be working (I put hdfs-site.xml into $SPARK_HOME/conf folder), but could you shed some light on how it can be federated per user?
>>>> Thanks in advance!
>>>>
>>>> Thank you,
>>>> Konstantin Kudryavtsev
>>>>
>>>>> On Wed, Dec 30, 2015 at 2:37 PM, Jerry Lam <ch...@gmail.com> wrote:
>>>>> Hi Kostiantyn,
>>>>>
>>>>> I want to confirm that it works first by using hdfs-site.xml. If yes, you could define different spark-{user-x}.conf and source them during spark-submit. let us know if hdfs-site.xml works first. It should.
>>>>>
>>>>> Best Regards,
>>>>>
>>>>> Jerry
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>>> On 30 Dec, 2015, at 2:31 pm, KOSTIANTYN Kudriavtsev <ku...@gmail.com> wrote:
>>>>>>
>>>>>> Hi Jerry,
>>>>>>
>>>>>> I want to run different jobs on different S3 buckets - different AWS creds - on the same instances. Could you shed some light if it's possible to achieve with hdfs-site?
>>>>>>
>>>>>> Thank you,
>>>>>> Konstantin Kudryavtsev
>>>>>>
>>>>>>> On Wed, Dec 30, 2015 at 2:10 PM, Jerry Lam <ch...@gmail.com> wrote:
>>>>>>> Hi Kostiantyn,
>>>>>>>
>>>>>>> Can you define those properties in hdfs-site.xml and make sure it is visible in the class path when you spark-submit? It looks like a conf sourcing issue to me.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Sent from my iPhone
>>>>>>>
>>>>>>>> On 30 Dec, 2015, at 1:59 pm, KOSTIANTYN Kudriavtsev <ku...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Chris,
>>>>>>>>
>>>>>>>> thanks for the hist with AIM roles, but in my case I need to run different jobs with different S3 permissions on the same cluster, so this approach doesn't work for me as far as I understood it
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>> Konstantin Kudryavtsev
>>>>>>>>
>>>>>>>>> On Wed, Dec 30, 2015 at 1:48 PM, Chris Fregly <ch...@fregly.com> wrote:
>>>>>>>>> couple things:
>>>>>>>>>
>>>>>>>>> 1) switch to IAM roles if at all possible - explicitly passing AWS credentials is a long and lonely road in the end
>>>>>>>>>
>>>>>>>>> 2) one really bad workaround/hack is to run a job that hits every worker and writes the credentials to the proper location (~/.awscredentials or whatever)
>>>>>>>>>
>>>>>>>>> ^^ i wouldn't recommend this. ^^ it's horrible and doesn't handle autoscaling, but i'm mentioning it anyway as it is a temporary fix.
>>>>>>>>>
>>>>>>>>> if you switch to IAM roles, things become a lot easier as you can authorize all of the EC2 instances in the cluster - and handles autoscaling very well - and at some point, you will want to autoscale.
>>>>>>>>>
>>>>>>>>>> On Wed, Dec 30, 2015 at 1:08 PM, KOSTIANTYN Kudriavtsev <ku...@gmail.com> wrote:
>>>>>>>>>> Chris,
>>>>>>>>>>
>>>>>>>>>> good question, as you can see from the code I set up them on driver, so I expect they will be propagated to all nodes, won't them?
>>>>>>>>>>
>>>>>>>>>> Thank you,
>>>>>>>>>> Konstantin Kudryavtsev
>>>>>>>>>>
>>>>>>>>>>> On Wed, Dec 30, 2015 at 1:06 PM, Chris Fregly <ch...@fregly.com> wrote:
>>>>>>>>>>> are the credentials visible from each Worker node to all the Executor JVMs on each Worker?
>>>>>>>>>>>
>>>>>>>>>>>> On Dec 30, 2015, at 12:45 PM, KOSTIANTYN Kudriavtsev <ku...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Dear Spark community,
>>>>>>>>>>>>
>>>>>>>>>>>> I faced the following issue with trying accessing data on S3a, my code is the following:
>>>>>>>>>>>>
>>>>>>>>>>>> val sparkConf = new SparkConf()
>>>>>>>>>>>>
>>>>>>>>>>>> val sc = new SparkContext(sparkConf)
>>>>>>>>>>>> sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
>>>>>>>>>>>> sc.hadoopConfiguration.set("fs.s3a.access.key", "---")
>>>>>>>>>>>> sc.hadoopConfiguration.set("fs.s3a.secret.key", "---")
>>>>>>>>>>>> val sqlContext = SQLContext.getOrCreate(sc)
>>>>>>>>>>>> val df = sqlContext.read.parquet(...)
>>>>>>>>>>>> df.count
>>>>>>>>>>>>
>>>>>>>>>>>> It results in the following exception and log messages:
>>>>>>>>>>>> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load credentials from BasicAWSCredentialsProvider: Access key or secret key is null
>>>>>>>>>>>> 15/12/30 17:00:32 DEBUG EC2MetadataClient: Connecting to EC2 instance metadata service at URL: http://x.x.x.x/latest/meta-data/iam/security-credentials/
>>>>>>>>>>>> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load credentials from InstanceProfileCredentialsProvider: The requested metadata is not found at http://x.x.x.x/latest/meta-data/iam/security-credentials/
>>>>>>>>>>>> 15/12/30 17:00:32 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 3)
>>>>>>>>>>>> com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
>>>>>>>>>>>> at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
>>>>>>>>>>>> at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
>>>>>>>>>>>> at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
>>>>>>>>>>>> at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
>>>>>>>>>>>> at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
>>>>>>>>>>>>
>>>>>>>>>>>> I run standalone spark 1.5.2 and using hadoop 2.7.1
>>>>>>>>>>>>
>>>>>>>>>>>> any ideas/workarounds?
>>>>>>>>>>>>
>>>>>>>>>>>> AWS credentials are correct for this bucket
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you,
>>>>>>>>>>>> Konstantin Kudryavtsev
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Chris Fregly
>>>>>>>>> Principal Data Solutions Engineer
>>>>>>>>> IBM Spark Technology Center, San Francisco, CA
>>>>>>>>> http://spark.tc | http://advancedspark.com
>
Re: SparkSQL integration issue with AWS S3a
Posted by Kostiantyn Kudriavtsev <ku...@gmail.com>.
Hi guys,
the only one big issue with this approach:
> spark.hadoop.s3a.access.key is now visible everywhere, in logs, in spark webui and is not secured at all...
On Jan 2, 2016, at 11:13 AM, KOSTIANTYN Kudriavtsev <ku...@gmail.com> wrote:
> thanks Jerry, it works!
> really appreciate your help
>
> Thank you,
> Konstantin Kudryavtsev
>
> On Fri, Jan 1, 2016 at 4:35 PM, Jerry Lam <ch...@gmail.com> wrote:
> Hi Kostiantyn,
>
> You should be able to use spark.conf to specify s3a keys.
>
> I don't remember exactly but you can add hadoop properties by prefixing spark.hadoop.*
> * is the s3a properties. For instance,
>
> spark.hadoop.s3a.access.key wudjgdueyhsj
>
> Of course, you need to make sure the property key is right. I'm using my phone so I cannot easily verifying.
>
> Then you can specify different user using different spark.conf via --properties-file when spark-submit
>
> HTH,
>
> Jerry
>
> Sent from my iPhone
>
> On 31 Dec, 2015, at 2:06 pm, KOSTIANTYN Kudriavtsev <ku...@gmail.com> wrote:
>
>> Hi Jerry,
>>
>> what you suggested looks to be working (I put hdfs-site.xml into $SPARK_HOME/conf folder), but could you shed some light on how it can be federated per user?
>> Thanks in advance!
>>
>> Thank you,
>> Konstantin Kudryavtsev
>>
>> On Wed, Dec 30, 2015 at 2:37 PM, Jerry Lam <ch...@gmail.com> wrote:
>> Hi Kostiantyn,
>>
>> I want to confirm that it works first by using hdfs-site.xml. If yes, you could define different spark-{user-x}.conf and source them during spark-submit. let us know if hdfs-site.xml works first. It should.
>>
>> Best Regards,
>>
>> Jerry
>>
>> Sent from my iPhone
>>
>> On 30 Dec, 2015, at 2:31 pm, KOSTIANTYN Kudriavtsev <ku...@gmail.com> wrote:
>>
>>> Hi Jerry,
>>>
>>> I want to run different jobs on different S3 buckets - different AWS creds - on the same instances. Could you shed some light if it's possible to achieve with hdfs-site?
>>>
>>> Thank you,
>>> Konstantin Kudryavtsev
>>>
>>> On Wed, Dec 30, 2015 at 2:10 PM, Jerry Lam <ch...@gmail.com> wrote:
>>> Hi Kostiantyn,
>>>
>>> Can you define those properties in hdfs-site.xml and make sure it is visible in the class path when you spark-submit? It looks like a conf sourcing issue to me.
>>>
>>> Cheers,
>>>
>>> Sent from my iPhone
>>>
>>> On 30 Dec, 2015, at 1:59 pm, KOSTIANTYN Kudriavtsev <ku...@gmail.com> wrote:
>>>
>>>> Chris,
>>>>
>>>> thanks for the hist with AIM roles, but in my case I need to run different jobs with different S3 permissions on the same cluster, so this approach doesn't work for me as far as I understood it
>>>>
>>>> Thank you,
>>>> Konstantin Kudryavtsev
>>>>
>>>> On Wed, Dec 30, 2015 at 1:48 PM, Chris Fregly <ch...@fregly.com> wrote:
>>>> couple things:
>>>>
>>>> 1) switch to IAM roles if at all possible - explicitly passing AWS credentials is a long and lonely road in the end
>>>>
>>>> 2) one really bad workaround/hack is to run a job that hits every worker and writes the credentials to the proper location (~/.awscredentials or whatever)
>>>>
>>>> ^^ i wouldn't recommend this. ^^ it's horrible and doesn't handle autoscaling, but i'm mentioning it anyway as it is a temporary fix.
>>>>
>>>> if you switch to IAM roles, things become a lot easier as you can authorize all of the EC2 instances in the cluster - and handles autoscaling very well - and at some point, you will want to autoscale.
>>>>
>>>> On Wed, Dec 30, 2015 at 1:08 PM, KOSTIANTYN Kudriavtsev <ku...@gmail.com> wrote:
>>>> Chris,
>>>>
>>>> good question, as you can see from the code I set up them on driver, so I expect they will be propagated to all nodes, won't them?
>>>>
>>>> Thank you,
>>>> Konstantin Kudryavtsev
>>>>
>>>> On Wed, Dec 30, 2015 at 1:06 PM, Chris Fregly <ch...@fregly.com> wrote:
>>>> are the credentials visible from each Worker node to all the Executor JVMs on each Worker?
>>>>
>>>> On Dec 30, 2015, at 12:45 PM, KOSTIANTYN Kudriavtsev <ku...@gmail.com> wrote:
>>>>
>>>>> Dear Spark community,
>>>>>
>>>>> I faced the following issue with trying accessing data on S3a, my code is the following:
>>>>>
>>>>> val sparkConf = new SparkConf()
>>>>>
>>>>> val sc = new SparkContext(sparkConf)
>>>>> sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
>>>>> sc.hadoopConfiguration.set("fs.s3a.access.key", "---")
>>>>> sc.hadoopConfiguration.set("fs.s3a.secret.key", "---")
>>>>> val sqlContext = SQLContext.getOrCreate(sc)
>>>>> val df = sqlContext.read.parquet(...)
>>>>> df.count
>>>>>
>>>>> It results in the following exception and log messages:
>>>>> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load credentials from BasicAWSCredentialsProvider: Access key or secret key is null
>>>>> 15/12/30 17:00:32 DEBUG EC2MetadataClient: Connecting to EC2 instance metadata service at URL: http://x.x.x.x/latest/meta-data/iam/security-credentials/
>>>>> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load credentials from InstanceProfileCredentialsProvider: The requested metadata is not found at http://x.x.x.x/latest/meta-data/iam/security-credentials/
>>>>> 15/12/30 17:00:32 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 3)
>>>>> com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
>>>>> at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
>>>>> at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
>>>>> at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
>>>>> at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
>>>>> at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
>>>>>
>>>>> I run standalone spark 1.5.2 and using hadoop 2.7.1
>>>>>
>>>>> any ideas/workarounds?
>>>>>
>>>>> AWS credentials are correct for this bucket
>>>>>
>>>>> Thank you,
>>>>> Konstantin Kudryavtsev
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Chris Fregly
>>>> Principal Data Solutions Engineer
>>>> IBM Spark Technology Center, San Francisco, CA
>>>> http://spark.tc | http://advancedspark.com
>>>>
>>>
>>
>
Re: SparkSQL integration issue with AWS S3a
Posted by KOSTIANTYN Kudriavtsev <ku...@gmail.com>.
thanks Jerry, it works!
really appreciate your help
Thank you,
Konstantin Kudryavtsev
On Fri, Jan 1, 2016 at 4:35 PM, Jerry Lam <ch...@gmail.com> wrote:
> Hi Kostiantyn,
>
> You should be able to use spark.conf to specify s3a keys.
>
> I don't remember exactly but you can add hadoop properties by prefixing
> spark.hadoop.*
> * is the s3a properties. For instance,
>
> spark.hadoop.s3a.access.key wudjgdueyhsj
>
> Of course, you need to make sure the property key is right. I'm using my
> phone so I cannot easily verifying.
>
> Then you can specify different user using different spark.conf via
> --properties-file when spark-submit
>
> HTH,
>
> Jerry
>
> Sent from my iPhone
>
> On 31 Dec, 2015, at 2:06 pm, KOSTIANTYN Kudriavtsev <
> kudryavtsev.konstantin@gmail.com> wrote:
>
> Hi Jerry,
>
> what you suggested looks to be working (I put hdfs-site.xml into
> $SPARK_HOME/conf folder), but could you shed some light on how it can be
> federated per user?
> Thanks in advance!
>
> Thank you,
> Konstantin Kudryavtsev
>
> On Wed, Dec 30, 2015 at 2:37 PM, Jerry Lam <ch...@gmail.com> wrote:
>
>> Hi Kostiantyn,
>>
>> I want to confirm that it works first by using hdfs-site.xml. If yes, you
>> could define different spark-{user-x}.conf and source them during
>> spark-submit. let us know if hdfs-site.xml works first. It should.
>>
>> Best Regards,
>>
>> Jerry
>>
>> Sent from my iPhone
>>
>> On 30 Dec, 2015, at 2:31 pm, KOSTIANTYN Kudriavtsev <
>> kudryavtsev.konstantin@gmail.com> wrote:
>>
>> Hi Jerry,
>>
>> I want to run different jobs on different S3 buckets - different AWS
>> creds - on the same instances. Could you shed some light if it's possible
>> to achieve with hdfs-site?
>>
>> Thank you,
>> Konstantin Kudryavtsev
>>
>> On Wed, Dec 30, 2015 at 2:10 PM, Jerry Lam <ch...@gmail.com> wrote:
>>
>>> Hi Kostiantyn,
>>>
>>> Can you define those properties in hdfs-site.xml and make sure it is
>>> visible in the class path when you spark-submit? It looks like a conf
>>> sourcing issue to me.
>>>
>>> Cheers,
>>>
>>> Sent from my iPhone
>>>
>>> On 30 Dec, 2015, at 1:59 pm, KOSTIANTYN Kudriavtsev <
>>> kudryavtsev.konstantin@gmail.com> wrote:
>>>
>>> Chris,
>>>
>>> thanks for the hist with AIM roles, but in my case I need to run
>>> different jobs with different S3 permissions on the same cluster, so this
>>> approach doesn't work for me as far as I understood it
>>>
>>> Thank you,
>>> Konstantin Kudryavtsev
>>>
>>> On Wed, Dec 30, 2015 at 1:48 PM, Chris Fregly <ch...@fregly.com> wrote:
>>>
>>>> couple things:
>>>>
>>>> 1) switch to IAM roles if at all possible - explicitly passing AWS
>>>> credentials is a long and lonely road in the end
>>>>
>>>> 2) one really bad workaround/hack is to run a job that hits every
>>>> worker and writes the credentials to the proper location (~/.awscredentials
>>>> or whatever)
>>>>
>>>> ^^ i wouldn't recommend this. ^^ it's horrible and doesn't handle
>>>> autoscaling, but i'm mentioning it anyway as it is a temporary fix.
>>>>
>>>> if you switch to IAM roles, things become a lot easier as you can
>>>> authorize all of the EC2 instances in the cluster - and handles autoscaling
>>>> very well - and at some point, you will want to autoscale.
>>>>
>>>> On Wed, Dec 30, 2015 at 1:08 PM, KOSTIANTYN Kudriavtsev <
>>>> kudryavtsev.konstantin@gmail.com> wrote:
>>>>
>>>>> Chris,
>>>>>
>>>>> good question, as you can see from the code I set up them on driver,
>>>>> so I expect they will be propagated to all nodes, won't them?
>>>>>
>>>>> Thank you,
>>>>> Konstantin Kudryavtsev
>>>>>
>>>>> On Wed, Dec 30, 2015 at 1:06 PM, Chris Fregly <ch...@fregly.com>
>>>>> wrote:
>>>>>
>>>>>> are the credentials visible from each Worker node to all the Executor
>>>>>> JVMs on each Worker?
>>>>>>
>>>>>> On Dec 30, 2015, at 12:45 PM, KOSTIANTYN Kudriavtsev <
>>>>>> kudryavtsev.konstantin@gmail.com> wrote:
>>>>>>
>>>>>> Dear Spark community,
>>>>>>
>>>>>> I faced the following issue with trying accessing data on S3a, my
>>>>>> code is the following:
>>>>>>
>>>>>> val sparkConf = new SparkConf()
>>>>>>
>>>>>> val sc = new SparkContext(sparkConf)
>>>>>> sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
>>>>>> sc.hadoopConfiguration.set("fs.s3a.access.key", "---")
>>>>>> sc.hadoopConfiguration.set("fs.s3a.secret.key", "---")
>>>>>>
>>>>>> val sqlContext = SQLContext.getOrCreate(sc)
>>>>>>
>>>>>> val df = sqlContext.read.parquet(...)
>>>>>>
>>>>>> df.count
>>>>>>
>>>>>>
>>>>>> It results in the following exception and log messages:
>>>>>>
>>>>>> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load credentials from BasicAWSCredentialsProvider: *Access key or secret key is null*
>>>>>> 15/12/30 17:00:32 DEBUG EC2MetadataClient: Connecting to EC2 instance metadata service at URL: http://x.x.x.x/latest/meta-data/iam/security-credentials/
>>>>>> 15/12/30 <http://x.x.x.x/latest/meta-data/iam/security-credentials/15/12/30> 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load credentials from InstanceProfileCredentialsProvider: The requested metadata is not found at http://x.x.x.x/latest/meta-data/iam/security-credentials/
>>>>>> 15/12/30 <http://x.x.x.x/latest/meta-data/iam/security-credentials/15/12/30> 17:00:32 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 3)
>>>>>> com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
>>>>>> at com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
>>>>>> at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3521)
>>>>>> at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1031)
>>>>>> at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:994)
>>>>>> at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:297)
>>>>>>
>>>>>>
>>>>>> I run standalone spark 1.5.2 and using hadoop 2.7.1
>>>>>>
>>>>>> any ideas/workarounds?
>>>>>>
>>>>>> AWS credentials are correct for this bucket
>>>>>>
>>>>>> Thank you,
>>>>>> Konstantin Kudryavtsev
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> *Chris Fregly*
>>>> Principal Data Solutions Engineer
>>>> IBM Spark Technology Center, San Francisco, CA
>>>> http://spark.tc | http://advancedspark.com
>>>>
>>>
>>>
>>
>