You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Benjamin Kim <bb...@gmail.com> on 2017/02/23 18:23:42 UTC

Get S3 Parquet File

We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB Parquet file from AWS S3. We can read the schema and show some data when the file is loaded into a DataFrame, but when we try to do some operations, such as count, we get this error below.

com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
	at com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
	at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779)
	at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107)
	at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:239)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2711)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2748)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
	at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
	at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:162)
	at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145)
	at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:180)
	at org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:229)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Can anyone help?

Cheers,
Ben


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Get S3 Parquet File

Posted by Benjamin Kim <bb...@gmail.com>.

Gourav,

I’ll start experimenting with Spark 2.1 to see if this works.

Cheers,
Ben


> On Feb 24, 2017, at 5:46 AM, Gourav Sengupta <go...@gmail.com> wrote:
> 
> Hi Benjamin,
> 
> First of all fetching data from S3 while writing a code in on premise system is a very bad idea. You might want to first copy the data in to local HDFS before running your code. Ofcourse this depends on the volume of data and internet speed that you have.
> 
> The platform which makes your data at least 10 times faster is SPARK 2.1. And trust me you do not want to be writing code which needs you to update it once again in 6 months because newer versions of SPARK now find it deprecated.
> 
> 
> Regards,
> Gourav Sengupta
> 
> 
> 
> On Fri, Feb 24, 2017 at 7:18 AM, Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>> wrote:
> Hi Gourav,
> 
> My answers are below.
> 
> Cheers,
> Ben
> 
> 
>> On Feb 23, 2017, at 10:57 PM, Gourav Sengupta <gourav.sengupta@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Can I ask where are you running your CDH? Is it on premise or have you created a cluster for yourself in AWS? Our cluster in on premise in our data center.
>> 
>> Also I have really never seen use s3a before, that was used way long before when writing s3 files took a long time, but I think that you are reading it. 
>> 
>> Anyideas why you are not migrating to Spark 2.1, besides speed, there are lots of apis which are new and the existing ones are being deprecated. Therefore there is a very high chance that you are already working on code which is being deprecated by the SPARK community right now. We use CDH and upgrade with whatever Spark version they include, which is 1.6.0. We are waiting for the move to Spark 2.0/2.1.
>> 
>> And besides that would you not want to work on a platform which is at least 10 times faster What would that be?
>> 
>> Regards,
>> Gourav Sengupta
>> 
>> On Thu, Feb 23, 2017 at 6:23 PM, Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>> wrote:
>> We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB Parquet file from AWS S3. We can read the schema and show some data when the file is loaded into a DataFrame, but when we try to do some operations, such as count, we get this error below.
>> 
>> com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
>>         at com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
>>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779)
>>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107)
>>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)
>>         at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:239)
>>         at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2711)
>>         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
>>         at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2748)
>>         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730)
>>         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
>>         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>>         at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
>>         at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:162)
>>         at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145)
>>         at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:180)
>>         at org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>         at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:229)
>>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>         at java.lang.Thread.run(Thread.java:745)
>> 
>> Can anyone help?
>> 
>> Cheers,
>> Ben
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>> 
>> 
> 
>

Re: Get S3 Parquet File

Posted by Gourav Sengupta <go...@gmail.com>.

Hi Benjamin,

First of all fetching data from S3 while writing a code in on premise
system is a very bad idea. You might want to first copy the data in to
local HDFS before running your code. Ofcourse this depends on the volume of
data and internet speed that you have.

The platform which makes your data at least 10 times faster is SPARK 2.1.
And trust me you do not want to be writing code which needs you to update
it once again in 6 months because newer versions of SPARK now find it
deprecated.


Regards,
Gourav Sengupta



On Fri, Feb 24, 2017 at 7:18 AM, Benjamin Kim <bb...@gmail.com> wrote:

> Hi Gourav,
>
> My answers are below.
>
> Cheers,
> Ben
>
>
> On Feb 23, 2017, at 10:57 PM, Gourav Sengupta <go...@gmail.com>
> wrote:
>
> Can I ask where are you running your CDH? Is it on premise or have you
> created a cluster for yourself in AWS? Our cluster in on premise in our
> data center.
>
> Also I have really never seen use s3a before, that was used way long
> before when writing s3 files took a long time, but I think that you are
> reading it.
>
> Anyideas why you are not migrating to Spark 2.1, besides speed, there are
> lots of apis which are new and the existing ones are being deprecated.
> Therefore there is a very high chance that you are already working on code
> which is being deprecated by the SPARK community right now. We use CDH
> and upgrade with whatever Spark version they include, which is 1.6.0. We
> are waiting for the move to Spark 2.0/2.1.
>
> And besides that would you not want to work on a platform which is at
> least 10 times faster What would that be?
>
> Regards,
> Gourav Sengupta
>
> On Thu, Feb 23, 2017 at 6:23 PM, Benjamin Kim <bb...@gmail.com> wrote:
>
>> We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB
>> Parquet file from AWS S3. We can read the schema and show some data when
>> the file is loaded into a DataFrame, but when we try to do some operations,
>> such as count, we get this error below.
>>
>> com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS
>> credentials from any provider in the chain
>>         at com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.
>> getCredentials(AWSCredentialsProviderChain.java:117)
>>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke
>> (AmazonS3Client.java:3779)
>>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBu
>> cket(AmazonS3Client.java:1107)
>>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBu
>> cketExist(AmazonS3Client.java:1070)
>>         at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSys
>> tem.java:239)
>>         at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.
>> java:2711)
>>         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
>>         at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem
>> .java:2748)
>>         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:
>> 2730)
>>         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
>>         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>>         at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReade
>> r.java:385)
>>         at parquet.hadoop.ParquetRecordReader.initializeInternalReader(
>> ParquetRecordReader.java:162)
>>         at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordR
>> eader.java:145)
>>         at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(
>> SqlNewHadoopRDD.scala:180)
>>         at org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD
>> .scala:126)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:
>> 306)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
>> DD.scala:38)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:
>> 306)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
>> DD.scala:38)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:
>> 306)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>> Task.scala:73)
>>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>> Task.scala:41)
>>         at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.
>> scala:229)
>>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>> Executor.java:1142)
>>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>> lExecutor.java:617)
>>         at java.lang.Thread.run(Thread.java:745)
>>
>> Can anyone help?
>>
>> Cheers,
>> Ben
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>
>

Re: Get S3 Parquet File

Posted by Femi Anthony <fe...@gmail.com>.

Ok, thanks a lot for the heads up.

Sent from my iPhone

> On Feb 25, 2017, at 10:58 AM, Steve Loughran <st...@hortonworks.com> wrote:
> 
> 
>> On 24 Feb 2017, at 07:47, Femi Anthony <fe...@gmail.com> wrote:
>> 
>> Have you tried reading using s3n which is a slightly older protocol ? I'm not sure how compatible s3a is with older versions of Spark.
> 
> I would absolutely not use s3n with a 1.2 GB file.
> 
> There is a WONTFIX JIRA on how it will read to the end of a file when you close a stream, and as seek() closes a stream every seek will read to the end of a file. And as readFully(position, bytes) does a seek either end, every time the Parquet code tries to read a bit of data, 1.3 GV of download: https://issues.apache.org/jira/browse/HADOOP-12376
> 
> That is not going to be fixed, ever. Because it can only be done by upgrading the libraries, and that will simply move new bugs in, lead to different bugreports, etc, etc. All for a piece of code which has be supplanted in the hadoop-2.7.x JARs with s3a ready for use, and in the forthcoming hadoop-2.8+ code, significantly faster for IO (especially ORC/Parquet), multi-GB upload, and even the basic metadata operations used when setting up queries.
> 
> For Hadoop 2.7+, use S3a. Any issues with s3n will be closed as  "use s3a"
> 
> 
>> 
>> 
>> Femi
>> 
>>> On Fri, Feb 24, 2017 at 2:18 AM, Benjamin Kim <bb...@gmail.com> wrote:
>>> Hi Gourav,
>>> 
>>> My answers are below.
>>> 
>>> Cheers,
>>> Ben
>>> 
>>> 
>>>> On Feb 23, 2017, at 10:57 PM, Gourav Sengupta <go...@gmail.com> wrote:
>>>> 
>>>> Can I ask where are you running your CDH? Is it on premise or have you created a cluster for yourself in AWS? Our cluster in on premise in our data center.
>>>> 
> 
> you need to set  up your s3a credentials in core-site, spark-defaults, or rely on spark-submit picking up the submitters AWS env vars a propagating them.
> 
> 
>>>> Also I have really never seen use s3a before, that was used way long before when writing s3 files took a long time, but I think that you are reading it. 
>>>> 
>>>> Anyideas why you are not migrating to Spark 2.1, besides speed, there are lots of apis which are new and the existing ones are being deprecated. Therefore there is a very high chance that you are already working on code which is being deprecated by the SPARK community right now. We use CDH and upgrade with whatever Spark version they include, which is 1.6.0. We are waiting for the move to Spark 2.0/2.1.
> 
> this is in the hadoop codebase, not the spark release. it will be the same irrsepectivel
> 
>>>> 
>>>> And besides that would you not want to work on a platform which is at least 10 times faster What would that be?
>>>> 
>>>> Regards,
>>>> Gourav Sengupta
>>>> 
>>>>> On Thu, Feb 23, 2017 at 6:23 PM, Benjamin Kim <bb...@gmail.com> wrote:
>>>>> We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB Parquet file from AWS S3. We can read the schema and show some data when the file is loaded into a DataFrame, but when we try to do some operations, such as count, we get this error below.
>>>>> 
>>>>> com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
>>>>>         at com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
>>>>>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779)
>>>>>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107)
>>>>>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)
>>>>>         at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:239)
>>>>>         at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2711)
>>>>>         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
>>>>>         at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2748)
>>>>>         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730)
>>>>>         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
>>>>>         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>>>>>         at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
>>>>>         at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:162)
>>>>>         at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145)
>>>>>         at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:180)
>>>>>         at org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126)
>>>>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>>>>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>>>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>>>>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>>>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>>>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>>>>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>>>>>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>>>>         at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>>>>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:229)
>>>>>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>         at java.lang.Thread.run(Thread.java:745)
>>>>> 
> 
> 
> This stack trace implies that its an executor failing to authenticate with AWS and so read the bucket data. What may be happening is that code running in your client is being authenticated, but work done to the authenticator RDD/dataframe isn't
> 
> 
> 1. try cranking up the logging in org.apache.hadoop.fs.s3a and com.cloudera.com.amazonaws, though all the auth code there deliberately avoids printing out credentials, so isn't that great for debugging things.
> 2. make sure that the fs.s3a secret and auth keys are getting down.
> 
> 
> For troubleshooting S3A, start with
> https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md
> 
> and/or
> https://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.8.0/bk_hdcloud-aws/content/s3-trouble/index.html
> 
> 
>

Re: Get S3 Parquet File

Posted by Steve Loughran <st...@hortonworks.com>.

On 24 Feb 2017, at 07:47, Femi Anthony <fe...@gmail.com>> wrote:

Have you tried reading using s3n which is a slightly older protocol ? I'm not sure how compatible s3a is with older versions of Spark.

I would absolutely not use s3n with a 1.2 GB file.

There is a WONTFIX JIRA on how it will read to the end of a file when you close a stream, and as seek() closes a stream every seek will read to the end of a file. And as readFully(position, bytes) does a seek either end, every time the Parquet code tries to read a bit of data, 1.3 GV of download: https://issues.apache.org/jira/browse/HADOOP-12376

That is not going to be fixed, ever. Because it can only be done by upgrading the libraries, and that will simply move new bugs in, lead to different bugreports, etc, etc. All for a piece of code which has be supplanted in the hadoop-2.7.x JARs with s3a ready for use, and in the forthcoming hadoop-2.8+ code, significantly faster for IO (especially ORC/Parquet), multi-GB upload, and even the basic metadata operations used when setting up queries.

For Hadoop 2.7+, use S3a. Any issues with s3n will be closed as  "use s3a"




Femi

On Fri, Feb 24, 2017 at 2:18 AM, Benjamin Kim <bb...@gmail.com>> wrote:
Hi Gourav,

My answers are below.

Cheers,
Ben


On Feb 23, 2017, at 10:57 PM, Gourav Sengupta <go...@gmail.com>> wrote:

Can I ask where are you running your CDH? Is it on premise or have you created a cluster for yourself in AWS? Our cluster in on premise in our data center.


you need to set  up your s3a credentials in core-site, spark-defaults, or rely on spark-submit picking up the submitters AWS env vars a propagating them.


Also I have really never seen use s3a before, that was used way long before when writing s3 files took a long time, but I think that you are reading it.

Anyideas why you are not migrating to Spark 2.1, besides speed, there are lots of apis which are new and the existing ones are being deprecated. Therefore there is a very high chance that you are already working on code which is being deprecated by the SPARK community right now. We use CDH and upgrade with whatever Spark version they include, which is 1.6.0. We are waiting for the move to Spark 2.0/2.1.

this is in the hadoop codebase, not the spark release. it will be the same irrsepectivel


And besides that would you not want to work on a platform which is at least 10 times faster What would that be?

Regards,
Gourav Sengupta

On Thu, Feb 23, 2017 at 6:23 PM, Benjamin Kim <bb...@gmail.com>> wrote:
We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB Parquet file from AWS S3. We can read the schema and show some data when the file is loaded into a DataFrame, but when we try to do some operations, such as count, we get this error below.

com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
        at com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
        at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779)
        at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107)
        at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:239)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2711)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2748)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
        at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
        at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:162)
        at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145)
        at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:180)
        at org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:229)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)



This stack trace implies that its an executor failing to authenticate with AWS and so read the bucket data. What may be happening is that code running in your client is being authenticated, but work done to the authenticator RDD/dataframe isn't


1. try cranking up the logging in org.apache.hadoop.fs.s3a and com.cloudera.com.amazonaws, though all the auth code there deliberately avoids printing out credentials, so isn't that great for debugging things.
2. make sure that the fs.s3a secret and auth keys are getting down.


For troubleshooting S3A, start with
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md

and/or
https://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.8.0/bk_hdcloud-aws/content/s3-trouble/index.html

Re: Get S3 Parquet File

Posted by Femi Anthony <fe...@gmail.com>.

Have you tried reading using s3n which is a slightly older protocol ? I'm
not sure how compatible s3a is with older versions of Spark.


Femi

On Fri, Feb 24, 2017 at 2:18 AM, Benjamin Kim <bb...@gmail.com> wrote:

> Hi Gourav,
>
> My answers are below.
>
> Cheers,
> Ben
>
>
> On Feb 23, 2017, at 10:57 PM, Gourav Sengupta <go...@gmail.com>
> wrote:
>
> Can I ask where are you running your CDH? Is it on premise or have you
> created a cluster for yourself in AWS? Our cluster in on premise in our
> data center.
>
> Also I have really never seen use s3a before, that was used way long
> before when writing s3 files took a long time, but I think that you are
> reading it.
>
> Anyideas why you are not migrating to Spark 2.1, besides speed, there are
> lots of apis which are new and the existing ones are being deprecated.
> Therefore there is a very high chance that you are already working on code
> which is being deprecated by the SPARK community right now. We use CDH
> and upgrade with whatever Spark version they include, which is 1.6.0. We
> are waiting for the move to Spark 2.0/2.1.
>
> And besides that would you not want to work on a platform which is at
> least 10 times faster What would that be?
>
> Regards,
> Gourav Sengupta
>
> On Thu, Feb 23, 2017 at 6:23 PM, Benjamin Kim <bb...@gmail.com> wrote:
>
>> We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB
>> Parquet file from AWS S3. We can read the schema and show some data when
>> the file is loaded into a DataFrame, but when we try to do some operations,
>> such as count, we get this error below.
>>
>> com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS
>> credentials from any provider in the chain
>>         at com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.
>> getCredentials(AWSCredentialsProviderChain.java:117)
>>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke
>> (AmazonS3Client.java:3779)
>>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBu
>> cket(AmazonS3Client.java:1107)
>>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBu
>> cketExist(AmazonS3Client.java:1070)
>>         at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSys
>> tem.java:239)
>>         at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.
>> java:2711)
>>         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
>>         at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem
>> .java:2748)
>>         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:
>> 2730)
>>         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
>>         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>>         at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReade
>> r.java:385)
>>         at parquet.hadoop.ParquetRecordReader.initializeInternalReader(
>> ParquetRecordReader.java:162)
>>         at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordR
>> eader.java:145)
>>         at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(
>> SqlNewHadoopRDD.scala:180)
>>         at org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD
>> .scala:126)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:
>> 306)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
>> DD.scala:38)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:
>> 306)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsR
>> DD.scala:38)
>>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:
>> 306)
>>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>> Task.scala:73)
>>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMap
>> Task.scala:41)
>>         at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.
>> scala:229)
>>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>> Executor.java:1142)
>>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>> lExecutor.java:617)
>>         at java.lang.Thread.run(Thread.java:745)
>>
>> Can anyone help?
>>
>> Cheers,
>> Ben
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>
>


-- 
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.

Re: Get S3 Parquet File

Posted by Benjamin Kim <bb...@gmail.com>.

Hi Gourav,

My answers are below.

Cheers,
Ben


> On Feb 23, 2017, at 10:57 PM, Gourav Sengupta <go...@gmail.com> wrote:
> 
> Can I ask where are you running your CDH? Is it on premise or have you created a cluster for yourself in AWS? Our cluster in on premise in our data center.
> 
> Also I have really never seen use s3a before, that was used way long before when writing s3 files took a long time, but I think that you are reading it. 
> 
> Anyideas why you are not migrating to Spark 2.1, besides speed, there are lots of apis which are new and the existing ones are being deprecated. Therefore there is a very high chance that you are already working on code which is being deprecated by the SPARK community right now. We use CDH and upgrade with whatever Spark version they include, which is 1.6.0. We are waiting for the move to Spark 2.0/2.1.
> 
> And besides that would you not want to work on a platform which is at least 10 times faster What would that be?
> 
> Regards,
> Gourav Sengupta
> 
> On Thu, Feb 23, 2017 at 6:23 PM, Benjamin Kim <bbuild11@gmail.com <ma...@gmail.com>> wrote:
> We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB Parquet file from AWS S3. We can read the schema and show some data when the file is loaded into a DataFrame, but when we try to do some operations, such as count, we get this error below.
> 
> com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
>         at com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779)
>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107)
>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)
>         at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:239)
>         at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2711)
>         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
>         at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2748)
>         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730)
>         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
>         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>         at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
>         at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:162)
>         at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145)
>         at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:180)
>         at org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>         at org.apache.spark.scheduler.Task.run(Task.scala:89)
>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:229)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> 
> Can anyone help?
> 
> Cheers,
> Ben
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> 
>

Re: Get S3 Parquet File

Posted by Gourav Sengupta <go...@gmail.com>.

Can I ask where are you running your CDH? Is it on premise or have you
created a cluster for yourself in AWS?

Also I have really never seen use s3a before, that was used way long before
when writing s3 files took a long time, but I think that you are reading
it.

Anyideas why you are not migrating to Spark 2.1, besides speed, there are
lots of apis which are new and the existing ones are being deprecated.
Therefore there is a very high chance that you are already working on code
which is being deprecated by the SPARK community right now.

And besides that would you not want to work on a platform which is at least
10 times faster

Regards,
Gourav Sengupta

On Thu, Feb 23, 2017 at 6:23 PM, Benjamin Kim <bb...@gmail.com> wrote:

> We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB
> Parquet file from AWS S3. We can read the schema and show some data when
> the file is loaded into a DataFrame, but when we try to do some operations,
> such as count, we get this error below.
>
> com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS
> credentials from any provider in the chain
>         at com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.
> getCredentials(AWSCredentialsProviderChain.java:117)
>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.
> invoke(AmazonS3Client.java:3779)
>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.
> headBucket(AmazonS3Client.java:1107)
>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.
> doesBucketExist(AmazonS3Client.java:1070)
>         at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(
> S3AFileSystem.java:239)
>         at org.apache.hadoop.fs.FileSystem.createFileSystem(
> FileSystem.java:2711)
>         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
>         at org.apache.hadoop.fs.FileSystem$Cache.getInternal(
> FileSystem.java:2748)
>         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730)
>         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
>         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>         at parquet.hadoop.ParquetFileReader.readFooter(
> ParquetFileReader.java:385)
>         at parquet.hadoop.ParquetRecordReader.initializeInternalReader(
> ParquetRecordReader.java:162)
>         at parquet.hadoop.ParquetRecordReader.initialize(
> ParquetRecordReader.java:145)
>         at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>
> (SqlNewHadoopRDD.scala:180)
>         at org.apache.spark.rdd.SqlNewHadoopRDD.compute(
> SqlNewHadoopRDD.scala:126)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>         at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>         at org.apache.spark.rdd.MapPartitionsRDD.compute(
> MapPartitionsRDD.scala:38)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:73)
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:41)
>         at org.apache.spark.scheduler.Task.run(Task.scala:89)
>         at org.apache.spark.executor.Executor$TaskRunner.run(
> Executor.scala:229)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
>
> Can anyone help?
>
> Cheers,
> Ben
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: Get S3 Parquet File

Posted by Benjamin Kim <bb...@gmail.com>.

Aakash,

Here is a code snippet for the keys.

val accessKey = “---"
val secretKey = “---"

val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.s3a.access.key", accessKey)
hadoopConf.set("fs.s3a.secret.key", secretKey)
hadoopConf.set("spark.hadoop.fs.s3a.access.key",accessKey)
hadoopConf.set("spark.hadoop.fs.s3a.secret.key",secretKey)

val df = sqlContext.read.parquet("s3a://aps.optus/uc2/BI_URL_DATA_HLY_20170201_09.PARQUET.gz")
df.show
df.count

When we do the count, then the error happens.

Thanks,
Ben


> On Feb 23, 2017, at 10:31 AM, Aakash Basu <aa...@gmail.com> wrote:
> 
> Hey,
> 
> Please recheck your access key and secret key being used to fetch the parquet file. It seems to be a credential error. Either mismatch/load. If load, then first use it directly in code and see if the issue resolves, then it can be hidden and read from Input Params.
> 
> Thanks,
> Aakash.
> 
> 
> On 23-Feb-2017 11:54 PM, "Benjamin Kim" <bbuild11@gmail.com <ma...@gmail.com>> wrote:
> We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB Parquet file from AWS S3. We can read the schema and show some data when the file is loaded into a DataFrame, but when we try to do some operations, such as count, we get this error below.
> 
> com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
>         at com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779)
>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107)
>         at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)
>         at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:239)
>         at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2711)
>         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
>         at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2748)
>         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730)
>         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
>         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>         at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
>         at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:162)
>         at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145)
>         at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>(SqlNewHadoopRDD.scala:180)
>         at org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>         at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>         at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>         at org.apache.spark.scheduler.Task.run(Task.scala:89)
>         at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:229)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> 
> Can anyone help?
> 
> Cheers,
> Ben
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> 
>

Re: Get S3 Parquet File

Posted by Aakash Basu <aa...@gmail.com>.

Hey,

Please recheck your access key and secret key being used to fetch the
parquet file. It seems to be a credential error. Either mismatch/load. If
load, then first use it directly in code and see if the issue resolves,
then it can be hidden and read from Input Params.

Thanks,
Aakash.


On 23-Feb-2017 11:54 PM, "Benjamin Kim" <bb...@gmail.com> wrote:

We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB Parquet
file from AWS S3. We can read the schema and show some data when the file
is loaded into a DataFrame, but when we try to do some operations, such as
count, we get this error below.

com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS
credentials from any provider in the chain
        at com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.
getCredentials(AWSCredentialsProviderChain.java:117)
        at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.
invoke(AmazonS3Client.java:3779)
        at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.
headBucket(AmazonS3Client.java:1107)
        at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.
doesBucketExist(AmazonS3Client.java:1070)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(
S3AFileSystem.java:239)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(
FileSystem.java:2711)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
        at org.apache.hadoop.fs.FileSystem$Cache.getInternal(
FileSystem.java:2748)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
        at parquet.hadoop.ParquetFileReader.readFooter(
ParquetFileReader.java:385)
        at parquet.hadoop.ParquetRecordReader.initializeInternalReader(
ParquetRecordReader.java:162)
        at parquet.hadoop.ParquetRecordReader.initialize(
ParquetRecordReader.java:145)
        at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.<init>
(SqlNewHadoopRDD.scala:180)
        at org.apache.spark.rdd.SqlNewHadoopRDD.compute(
SqlNewHadoopRDD.scala:126)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(
MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(
MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(
ShuffleMapTask.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(
ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(
Executor.scala:229)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Can anyone help?

Cheers,
Ben


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org