You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Nicholas Chammas <ni...@gmail.com> on 2019/10/28 15:34:28 UTC

Spark 3.0 and S3A

Howdy folks,

I have a question about what is happening with the 3.0 release in relation
to Hadoop and hadoop-aws
<https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html>
.

Today, among other builds, we release a build of Spark built against Hadoop
2.7 and another one built without Hadoop. In Spark 3+, will we continue to
release Hadoop 2.7 builds as one of the primary downloads on the download
page <http://spark.apache.org/downloads.html>? Or will we start building
Spark against a newer version of Hadoop?

The reason I ask is because successive versions of hadoop-aws have made
significant usability improvements to S3A. To get those, users need to
download the Hadoop-free build of Spark
<https://spark.apache.org/docs/latest/hadoop-provided.html> and then link
Spark to a version of Hadoop newer than 2.7. There are various dependency
and runtime issues with trying to pair Spark built against Hadoop 2.7 with
hadoop-aws 2.8 or newer.

If we start releasing builds of Spark built against Hadoop 3.2 (or another
recent version), users can get the latest S3A improvements via --packages
"org.apache.hadoop:hadoop-aws:3.2.1" without needing to download Hadoop
separately.

Nick

Re: Spark 3.0 and S3A

Posted by Steve Loughran <st...@cloudera.com.INVALID>.

On Mon, Oct 28, 2019 at 3:40 PM Sean Owen <sr...@apache.org> wrote:

> There will be a "Hadoop 3.x" version of 3.0, as it's essential to get
> a JDK 11-compatible build. you can see the hadoop-3.2 profile.
> hadoop-aws is pulled in in the hadoop-cloud module I believe, so bears
> checking whether the profile updates the versions there too.
>

it does -you get hadoop-cloud-storage 3.2 which comes with an
aws-sdk-shaded jar in sync with both the s3a code and spark-kinesis.

Trying to use the hadoop 2.7 version of the s3a connector is an exercise in
painful futility. It works, but is -what- four years out of date? As well
as all the performance and scale improvements (random IO reads in
particular), it's got an out of date AWS SDK with an embedded org.json
module whose licence is now forbidden by the ASF (hence: no more ASF
releases of 2.7.x) and it doesn't really handle any of the new
v4-signature-only S3 regions.

If you ever look for "spark + s3a" you will see that the first step to
talking to S3 with the ASF releases is trying to get your classpath right
-which, given the attempts generally consist of dropping in a new AWS SDK
or hadoop-aws-3.1 JAR, means that the first question is "why do I get some
class not found exception"

As we say in the docs: randomly dropping in jars simply moves your stack
trace around

https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/troubleshooting_s3a.md#-classpath-setup


>
> On Mon, Oct 28, 2019 at 10:34 AM Nicholas Chammas
> <ni...@gmail.com> wrote:
> >
> > Howdy folks,
> >
> > I have a question about what is happening with the 3.0 release in
> relation to Hadoop and hadoop-aws.
> >
> > Today, among other builds, we release a build of Spark built against
> Hadoop 2.7 and another one built without Hadoop. In Spark 3+, will we
> continue to release Hadoop 2.7 builds as one of the primary downloads on
> the download page? Or will we start building Spark against a newer version
> of Hadoop?
> >
> > The reason I ask is because successive versions of hadoop-aws have made
> significant usability improvements to S3A. To get those, users need to
> download the Hadoop-free build of Spark and then link Spark to a version of
> Hadoop newer than 2.7. There are various dependency and runtime issues with
> trying to pair Spark built against Hadoop 2.7 with hadoop-aws 2.8 or newer.
> >
> > If we start releasing builds of Spark built against Hadoop 3.2 (or
> another recent version), users can get the latest S3A improvements via
> --packages "org.apache.hadoop:hadoop-aws:3.2.1" without needing to download
> Hadoop separately.
> >
> > Nick
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: Spark 3.0 and S3A

Posted by Sean Owen <sr...@apache.org>.

There will be a "Hadoop 3.x" version of 3.0, as it's essential to get
a JDK 11-compatible build. you can see the hadoop-3.2 profile.
hadoop-aws is pulled in in the hadoop-cloud module I believe, so bears
checking whether the profile updates the versions there too.

On Mon, Oct 28, 2019 at 10:34 AM Nicholas Chammas
<ni...@gmail.com> wrote:
>
> Howdy folks,
>
> I have a question about what is happening with the 3.0 release in relation to Hadoop and hadoop-aws.
>
> Today, among other builds, we release a build of Spark built against Hadoop 2.7 and another one built without Hadoop. In Spark 3+, will we continue to release Hadoop 2.7 builds as one of the primary downloads on the download page? Or will we start building Spark against a newer version of Hadoop?
>
> The reason I ask is because successive versions of hadoop-aws have made significant usability improvements to S3A. To get those, users need to download the Hadoop-free build of Spark and then link Spark to a version of Hadoop newer than 2.7. There are various dependency and runtime issues with trying to pair Spark built against Hadoop 2.7 with hadoop-aws 2.8 or newer.
>
> If we start releasing builds of Spark built against Hadoop 3.2 (or another recent version), users can get the latest S3A improvements via --packages "org.apache.hadoop:hadoop-aws:3.2.1" without needing to download Hadoop separately.
>
> Nick

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org