You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Adam Gilmore <dr...@gmail.com> on 2014/12/22 02:37:47 UTC

Issue with Parquet on Spark 1.2 and Amazon EMR

Hi all,

I've just launched a new Amazon EMR cluster and used the script at:

s3://support.elasticmapreduce/spark/install-spark

to install Spark (this script was upgraded to support 1.2).

I know there are tools to launch a Spark cluster in EC2, but I want to use
EMR.

Everything installs fine; however, when I go to read from a Parquet file, I
end up with (the main exception):

Caused by: java.lang.NoSuchMethodError:
parquet.hadoop.ParquetInputSplit.<init>(Lorg/apache/hadoop/fs/Path;JJJ[Ljava/lang/String;[JLjava/lang/String;Ljava/util/Map;)V
        at
parquet.hadoop.TaskSideMetadataSplitStrategy.generateTaskSideMDSplits(ParquetInputFormat.java:578)
        ... 55 more

It seems to me like a version mismatch somewhere.  Where is the
parquet-hadoop jar coming from?  Is it built into a fat jar for Spark?

Any help would be appreciated.  Note that 1.1.1 worked fine with Parquet
files.

RE: Issue with Parquet on Spark 1.2 and Amazon EMR

Posted by "Bozeman, Christopher" <bo...@amazon.com>.

Thanks to Aniket’s work there is two new options to the EMR install script for Spark.   See https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/README.md The “-a” option can be used to bump the spark-assembly to the front of the classpath.

-Christopher

From: Aniket Bhatnagar [mailto:aniket.bhatnagar@gmail.com]
Sent: Monday, January 12, 2015 3:05 AM
To: Kelly, Jonathan; Adam Gilmore; user@spark.apache.org
Subject: Re: Issue with Parquet on Spark 1.2 and Amazon EMR

Meanwhile, I have submitted a pull request (https://github.com/awslabs/emr-bootstrap-actions/pull/37) that allows users to place their jars ahead of all other jars in spark classpath. This should serve as a temporary workaround for all class conflicts.

Thanks,
Aniket

On Mon Jan 05 2015 at 22:13:47 Kelly, Jonathan <jo...@amazon.com>> wrote:
I've noticed the same thing recently and will contact the appropriate owner soon.  (I work for Amazon, so I'll go through internal channels and report back to this list.)

In the meantime, I've found that editing spark-env.sh and putting the Spark assembly first in the classpath fixes the issue.  I expect that the version of Parquet that's being included in the EMR libs just needs to be upgraded.

~ Jonathan Kelly

From: Aniket Bhatnagar <an...@gmail.com>>
Date: Sunday, January 4, 2015 at 10:51 PM
To: Adam Gilmore <dr...@gmail.com>>, "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: Re: Issue with Parquet on Spark 1.2 and Amazon EMR

Can you confirm your emr version? Could it be because of the classpath entries for emrfs? You might face issues with using S3 without them.

Thanks,
Aniket

On Mon, Jan 5, 2015, 11:16 AM Adam Gilmore <dr...@gmail.com>> wrote:
Just an update on this - I found that the script by Amazon was the culprit - not exactly sure why.  When I installed Spark manually onto the EMR (and did the manual configuration of all the EMR stuff), it worked fine.

On Mon, Dec 22, 2014 at 11:37 AM, Adam Gilmore <dr...@gmail.com>> wrote:
Hi all,

I've just launched a new Amazon EMR cluster and used the script at:

s3://support.elasticmapreduce/spark/install-spark

to install Spark (this script was upgraded to support 1.2).

I know there are tools to launch a Spark cluster in EC2, but I want to use EMR.

Everything installs fine; however, when I go to read from a Parquet file, I end up with (the main exception):

Caused by: java.lang.NoSuchMethodError: parquet.hadoop.ParquetInputSplit.<init>(Lorg/apache/hadoop/fs/Path;JJJ[Ljava/lang/String;[JLjava/lang/String;Ljava/util/Map;)V
        at parquet.hadoop.TaskSideMetadataSplitStrategy.generateTaskSideMDSplits(ParquetInputFormat.java:578)
        ... 55 more

It seems to me like a version mismatch somewhere.  Where is the parquet-hadoop jar coming from?  Is it built into a fat jar for Spark?

Any help would be appreciated.  Note that 1.1.1 worked fine with Parquet files.

Re: Issue with Parquet on Spark 1.2 and Amazon EMR

Posted by Aniket Bhatnagar <an...@gmail.com>.

Meanwhile, I have submitted a pull request (
https://github.com/awslabs/emr-bootstrap-actions/pull/37) that allows users
to place their jars ahead of all other jars in spark classpath. This should
serve as a temporary workaround for all class conflicts.

Thanks,
Aniket

On Mon Jan 05 2015 at 22:13:47 Kelly, Jonathan <jo...@amazon.com> wrote:

>   I've noticed the same thing recently and will contact the appropriate
> owner soon.  (I work for Amazon, so I'll go through internal channels and
> report back to this list.)
>
>  In the meantime, I've found that editing spark-env.sh and putting the
> Spark assembly first in the classpath fixes the issue.  I expect that the
> version of Parquet that's being included in the EMR libs just needs to be
> upgraded.
>
>
>  ~ Jonathan Kelly
>
>   From: Aniket Bhatnagar <an...@gmail.com>
> Date: Sunday, January 4, 2015 at 10:51 PM
> To: Adam Gilmore <dr...@gmail.com>, "user@spark.apache.org" <
> user@spark.apache.org>
> Subject: Re: Issue with Parquet on Spark 1.2 and Amazon EMR
>
>   Can you confirm your emr version? Could it be because of the classpath
> entries for emrfs? You might face issues with using S3 without them.
>
> Thanks,
> Aniket
>
> On Mon, Jan 5, 2015, 11:16 AM Adam Gilmore <dr...@gmail.com> wrote:
>
>>  Just an update on this - I found that the script by Amazon was the
>> culprit - not exactly sure why.  When I installed Spark manually onto the
>> EMR (and did the manual configuration of all the EMR stuff), it worked fine.
>>
>> On Mon, Dec 22, 2014 at 11:37 AM, Adam Gilmore <dr...@gmail.com>
>> wrote:
>>
>>>  Hi all,
>>>
>>>  I've just launched a new Amazon EMR cluster and used the script at:
>>>
>>>  s3://support.elasticmapreduce/spark/install-spark
>>>
>>>  to install Spark (this script was upgraded to support 1.2).
>>>
>>>  I know there are tools to launch a Spark cluster in EC2, but I want to
>>> use EMR.
>>>
>>>  Everything installs fine; however, when I go to read from a Parquet
>>> file, I end up with (the main exception):
>>>
>>>  Caused by: java.lang.NoSuchMethodError:
>>> parquet.hadoop.ParquetInputSplit.<init>(Lorg/apache/hadoop/fs/Path;JJJ[Ljava/lang/String;[JLjava/lang/String;Ljava/util/Map;)V
>>>         at
>>> parquet.hadoop.TaskSideMetadataSplitStrategy.generateTaskSideMDSplits(ParquetInputFormat.java:578)
>>>         ... 55 more
>>>
>>>  It seems to me like a version mismatch somewhere.  Where is the
>>> parquet-hadoop jar coming from?  Is it built into a fat jar for Spark?
>>>
>>>  Any help would be appreciated.  Note that 1.1.1 worked fine with
>>> Parquet files.
>>>
>>
>>

Re: Issue with Parquet on Spark 1.2 and Amazon EMR

Posted by "Kelly, Jonathan" <jo...@amazon.com>.

I've noticed the same thing recently and will contact the appropriate owner soon.  (I work for Amazon, so I'll go through internal channels and report back to this list.)

In the meantime, I've found that editing spark-env.sh and putting the Spark assembly first in the classpath fixes the issue.  I expect that the version of Parquet that's being included in the EMR libs just needs to be upgraded.

~ Jonathan Kelly

From: Aniket Bhatnagar <an...@gmail.com>>
Date: Sunday, January 4, 2015 at 10:51 PM
To: Adam Gilmore <dr...@gmail.com>>, "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: Re: Issue with Parquet on Spark 1.2 and Amazon EMR

Can you confirm your emr version? Could it be because of the classpath entries for emrfs? You might face issues with using S3 without them.

Thanks,
Aniket

On Mon, Jan 5, 2015, 11:16 AM Adam Gilmore <dr...@gmail.com>> wrote:
Just an update on this - I found that the script by Amazon was the culprit - not exactly sure why.  When I installed Spark manually onto the EMR (and did the manual configuration of all the EMR stuff), it worked fine.

On Mon, Dec 22, 2014 at 11:37 AM, Adam Gilmore <dr...@gmail.com>> wrote:
Hi all,

I've just launched a new Amazon EMR cluster and used the script at:

s3://support.elasticmapreduce/spark/install-spark

to install Spark (this script was upgraded to support 1.2).

I know there are tools to launch a Spark cluster in EC2, but I want to use EMR.

Everything installs fine; however, when I go to read from a Parquet file, I end up with (the main exception):

Caused by: java.lang.NoSuchMethodError: parquet.hadoop.ParquetInputSplit.<init>(Lorg/apache/hadoop/fs/Path;JJJ[Ljava/lang/String;[JLjava/lang/String;Ljava/util/Map;)V
        at parquet.hadoop.TaskSideMetadataSplitStrategy.generateTaskSideMDSplits(ParquetInputFormat.java:578)
        ... 55 more

It seems to me like a version mismatch somewhere.  Where is the parquet-hadoop jar coming from?  Is it built into a fat jar for Spark?

Any help would be appreciated.  Note that 1.1.1 worked fine with Parquet files.

Re: Issue with Parquet on Spark 1.2 and Amazon EMR

Posted by Aniket Bhatnagar <an...@gmail.com>.

Can you confirm your emr version? Could it be because of the classpath
entries for emrfs? You might face issues with using S3 without them.

Thanks,
Aniket

On Mon, Jan 5, 2015, 11:16 AM Adam Gilmore <dr...@gmail.com> wrote:

> Just an update on this - I found that the script by Amazon was the culprit
> - not exactly sure why.  When I installed Spark manually onto the EMR (and
> did the manual configuration of all the EMR stuff), it worked fine.
>
> On Mon, Dec 22, 2014 at 11:37 AM, Adam Gilmore <dr...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I've just launched a new Amazon EMR cluster and used the script at:
>>
>> s3://support.elasticmapreduce/spark/install-spark
>>
>> to install Spark (this script was upgraded to support 1.2).
>>
>> I know there are tools to launch a Spark cluster in EC2, but I want to
>> use EMR.
>>
>> Everything installs fine; however, when I go to read from a Parquet file,
>> I end up with (the main exception):
>>
>> Caused by: java.lang.NoSuchMethodError:
>> parquet.hadoop.ParquetInputSplit.<init>(Lorg/apache/hadoop/fs/Path;JJJ[Ljava/lang/String;[JLjava/lang/String;Ljava/util/Map;)V
>>         at
>> parquet.hadoop.TaskSideMetadataSplitStrategy.generateTaskSideMDSplits(ParquetInputFormat.java:578)
>>         ... 55 more
>>
>> It seems to me like a version mismatch somewhere.  Where is the
>> parquet-hadoop jar coming from?  Is it built into a fat jar for Spark?
>>
>> Any help would be appreciated.  Note that 1.1.1 worked fine with Parquet
>> files.
>>
>
>

Re: Issue with Parquet on Spark 1.2 and Amazon EMR

Posted by Adam Gilmore <dr...@gmail.com>.

Just an update on this - I found that the script by Amazon was the culprit
- not exactly sure why.  When I installed Spark manually onto the EMR (and
did the manual configuration of all the EMR stuff), it worked fine.

On Mon, Dec 22, 2014 at 11:37 AM, Adam Gilmore <dr...@gmail.com>
wrote:

> Hi all,
>
> I've just launched a new Amazon EMR cluster and used the script at:
>
> s3://support.elasticmapreduce/spark/install-spark
>
> to install Spark (this script was upgraded to support 1.2).
>
> I know there are tools to launch a Spark cluster in EC2, but I want to use
> EMR.
>
> Everything installs fine; however, when I go to read from a Parquet file,
> I end up with (the main exception):
>
> Caused by: java.lang.NoSuchMethodError:
> parquet.hadoop.ParquetInputSplit.<init>(Lorg/apache/hadoop/fs/Path;JJJ[Ljava/lang/String;[JLjava/lang/String;Ljava/util/Map;)V
>         at
> parquet.hadoop.TaskSideMetadataSplitStrategy.generateTaskSideMDSplits(ParquetInputFormat.java:578)
>         ... 55 more
>
> It seems to me like a version mismatch somewhere.  Where is the
> parquet-hadoop jar coming from?  Is it built into a fat jar for Spark?
>
> Any help would be appreciated.  Note that 1.1.1 worked fine with Parquet
> files.
>