You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2018/06/26 13:52:00 UTC

[jira] [Comment Edited] (HADOOP-15559) Clarity on Spark compatibility with hadoop-aws

    [ https://issues.apache.org/jira/browse/HADOOP-15559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523741#comment-16523741 ] 

Steve Loughran edited comment on HADOOP-15559 at 6/26/18 1:51 PM:
------------------------------------------------------------------

# We feel your pain. Getting everything synced up is hard, especially when Spark itself bumps up some dependencies incompatibly (SPARK-22919)
 # the latest docs on this topic [are here|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md]; as they say

{quote}Critical: Do not attempt to "drop in" a newer version of the AWS SDK than that which the Hadoop version was built with Whatever problem you have, changing the AWS SDK version will not fix things, only change the stack traces you see.
{quote}
{quote}Similarly, don't try and mix a hadoop-aws JAR from one Hadoop release with that of any other. The JAR must be in sync with hadoop-common and some other Hadoop JARs.
{quote}
{quote}Randomly changing hadoop- and aws- JARs in the hope of making a problem "go away" or to gain access to a feature you want, will not lead to the outcome you desire.
{quote}

We also point people at [mvnepo|http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws] for the normative list of hadoop-aws - aws JAR mapping. Getting an AWS SDK and Hadoop AWS binding together is not easy, and if you do try it, you are on your own, with nothing but JIRAs related to "upgrade AWS SDK" to act as a cue.

Where life is hard is that unless you build spark with the -Phadoop-cloud profile, you don't get things all lined up. It is what it is for. 

Regarding your specific issue, unless you are using a release of spark with that cloud profile, you have to do it by hand. 
* Get the exact matching hadoop-aws JAR as hadoop-common. Same for hadoop-auth, hadoop-aws. You cannot mix them.
* get the matching aws SDK JAR(s), using mvnrepo as your guide.
* And jackson, obviously. FWIW Hadoop 2.9+ has moved to the shaded AWS SDK JAR to avoid a lot of this pain.
* If you want to use Hadoop 2.8, unless your spark distribution has reverted SPARK-22919, downgrade the httpclient libraries (see the PR there for what changed)



Returning you complaint. anything else you can do the docs are welcome, though really, the best strategy would be to get spark releases built with that hadoop-cloud profile, which is intended to give you all the dependencies you need, and none of the ones you don't.


was (Author: stevel@apache.org):
# We feel your pain. Getting everything synced up is hard, especially when Spark itself bumps up some dependencies incompatibly (SPARK-22919)
 # the latest docs on this topic [are here|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md]; as they say

{quote}Critical: Do not attempt to "drop in" a newer version of the AWS SDK than that which the Hadoop version was built with Whatever problem you have, changing the AWS SDK version will not fix things, only change the stack traces you see.
{quote}
{quote}Similarly, don't try and mix a hadoop-aws JAR from one Hadoop release with that of any other. The JAR must be in sync with hadoop-common and some other Hadoop JARs.
{quote}
{quote}Randomly changing hadoop- and aws- JARs in the hope of making a problem "go away" or to gain access to a feature you want, will not lead to the outcome you desire.
{quote}

We also point people at [mvnepo|http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws] for the normative list of hadoop-aws - aws JAR mapping. Getting an AWS SDK and Hadoop AWS binding together is not easy, and if you do try it, you are on your own, with nothing but JIRAs related to "upgrade AWS SDK" to act as a cue.

Where life is hard is that unless you build spark with the -Phadoop-cloud profile, you don't get things all lined up. It is what it is for. 

Regarding your specific issue, unless you are using a release of spark with that cloud profile, you have to do it by hand. 
* Get the exact matching hadoop-aws JAR as hadoop-common. Same for hadoop-auth, hadoop-aws. You cannot mix them.
* get the matching aws SDK JAR(s), using mvnrepo as your guide.
* And jackson, obviously. FWIW Hadoop 2.9+ has moved to the shaded AWS SDK JAR to avoid a lot of this pain.
* If you want to use Hadoop 2.8, unless your spark distribution has reverted SPARK-22919, downgrade the httpclient libraries (see the PR there for what changed. We just reverted that patch on the basis that more people want S3A than stocator)


Returning you complaint. anything else you can do the docs are welcome, though really, the best strategy would be to get spark releases built with that hadoop-cloud profile, which is intended to give you all the dependencies you need, and none of the ones you don't.

> Clarity on Spark compatibility with hadoop-aws
> ----------------------------------------------
>
>                 Key: HADOOP-15559
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15559
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: documentation, fs/s3
>            Reporter: Nicholas Chammas
>            Priority: Minor
>
> I'm the maintainer of [Flintrock|https://github.com/nchammas/flintrock], a command-line tool for launching Apache Spark clusters on AWS. One of the things I try to do for my users is make it straightforward to use Spark with {{s3a://}}. I do this by recommending that users start Spark with the {{hadoop-aws}} package.
> For example:
> {code:java}
> pyspark --packages "org.apache.hadoop:hadoop-aws:2.8.4"
> {code}
> I'm struggling, however, to understand what versions of {{hadoop-aws}} should work with what versions of Spark.
> Spark releases are [built against Hadoop 2.7|http://archive.apache.org/dist/spark/spark-2.3.1/]. At the same time, I've been told that I should be able to use newer versions of Hadoop and Hadoop libraries with Spark, so for example, running Spark built against Hadoop 2.7 alongside HDFS 2.8 should work, and there is [no need to build Spark explicitly against Hadoop 2.8|http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Spark-2-3-1-RC4-tp24087p24092.html].
> I'm having trouble translating this mental model into recommendations for how to pair Spark with {{hadoop-aws}}.
> For example, Spark 2.3.1 built against Hadoop 2.7 works with {{hadoop-aws:2.7.6}} but not with {{hadoop-aws:2.8.4}}. Trying the latter yields the following error when I try to access files via {{s3a://}}.
> {code:java}
> py4j.protocol.Py4JJavaError: An error occurred while calling o35.text.
> : java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
> at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
> at org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
> at org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139)
> at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
> at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
> at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
> at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
> at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:693)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:282)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:238)
> at java.lang.Thread.run(Thread.java:748){code}
> So it would seem that {{hadoop-aws}} must be matched to the same MAJOR.MINOR release of Hadoop that Spark is built against. However, neither [this page|https://wiki.apache.org/hadoop/AmazonS3] nor [this one|https://hortonworks.github.io/hdp-aws/s3-spark/] shed any light on how to pair the correct version of {{hadoop-aws}} with Spark.
> Would it be appropriate to add some guidance somewhere on what versions of {{hadoop-aws}} work with what versions and builds of Spark? It would help eliminate this kind of guesswork and slow spelunking.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org