You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Amin Borjian <bo...@outlook.com> on 2022/04/09 04:36:33 UTC

Spark client for Hadoop 2.x

From Spark version 3.1.0 onwards, the clients provided for Spark are built with Hadoop 3 and placed in maven repository. Unfortunately we use Hadoop 2.7.7 in our infrastructure currently.

1) Does Spark have a plan to publish the Spark client dependencies for Hadoop 2.x?
2) Are the new Spark clients capable of connecting to the Hadoop 2.x cluster? (According to a simple test, Spark client 3.2.1 had no problem with the Hadoop 2.7 cluster but we wanted to know if there was any guarantee from Spark?)

Thank you very much in advance
Amin Borjian

Re: Spark client for Hadoop 2.x

Posted by Steve Loughran <st...@cloudera.com.INVALID>.

I should back up Donjoon's comments with the observation that hadoop 2.10.x
is the only branch-2 release which get any security updates; on branch-3 it
is 3.2.x and 3.3.x which do. Donjoon's colleague Chao Sun was the release
manager on the 3.3.2 release, so it got thoroughly tested with Spark. (I'm
doing a minor update of that this week, including a fix for a minor CVE).

If you are still running hadoop-2.7.7 in production, know that you are
running a version that is about six years old and you should really think
of an upgrade. In fact, I don't think it was ever tested against java8, as
it came out before that work.
https://issues.apache.org/jira/browse/HADOOP-11090
If you have any problems with headache 2.7.7 security wise, compatibility
etc then the response of the team to any issue filed on the ASF JIRA is "if
you replicate this on hadoop-3.2+" we will consider it a real issue"

I'm also going to observe that the hadoop APIs have changed and libraries
make use of them for better performance.

If "According to a simple test, Spark client 3.2.1 had no problem with the
Hadoop 2.7 cluster" and then it is possible that your test didn't do much
in the way of coverage, especially of external libraries which you may
depend on, parquet, avro etc. " and new stuff like Apache iceberg, is
unlikely to even compile against branch-2, let alone run.

"if there was any guarantee from Spark"

Define guarantee here? It's not been tested and again, any issue files will
be ignored unless it is replicable on a combination of artifacts which the
release was tested against. Even there, in the world of open source, there
is the hope/expectation that people encountering problems can participate
in fixing them.

If you really want a version of spark and libraries which you depend on to
run on hadoop 2.7.7 (&java 7) then you are pretty much going to have to
rebuild entire stack retest it yourself. I wouldn't really recommend this.

-Steve

On Sun, 10 Apr 2022 at 22:49, Dongjoon Hyun <do...@gmail.com> wrote:

> Hi, Amin
>
> In general, the Apache Spark community has received many feedbacks and
> been moving forward to
>
> - Use the latest Hadoop versions for more bug fixes including CVEs.
> - Use Hadoop's shaded clients to minimize the dependency issues
>
> Since the above is not achievable with Hadoop 2 clients, I believe the
> official answer is `No` to (1). (Especially for your Hadoop 2.7 cluster
> released in 2018.)
>
> For the second question, Apache Spark community has been collaborating
> with Apache Hadoop community in order to use the latest Apache Hadoop 3
> clients to connect old/new Hadoop clusters and public cloud environments. I
> believe your production jobs should be fine if you are not relying on some
> proprietary(=non-Apache Hadoop) features from private vendors. Please
> report to the Apache Hadoop community or us if you hit unknown
> compatibility issues.
>
> Bests
> Dongjoon.
>
>
> On Fri, Apr 8, 2022 at 9:37 PM Amin Borjian <bo...@outlook.com>
> wrote:
>
>>
>>
>> From Spark version 3.1.0 onwards, the clients provided for Spark are
>> built with Hadoop 3 and placed in maven repository. Unfortunately we use
>> Hadoop 2.7.7 in our infrastructure currently.
>>
>>
>>
>> 1) Does Spark have a plan to publish the Spark client dependencies for
>> Hadoop 2.x?
>>
>> 2) Are the new Spark clients capable of connecting to the Hadoop 2.x
>> cluster? (According to a simple test, Spark client 3.2.1 had no problem
>> with the Hadoop 2.7 cluster but we wanted to know if there was any
>> guarantee from Spark?)
>>
>>
>>
>> Thank you very much in advance
>>
>> Amin Borjian
>>
>>
>>
>

Re: Spark client for Hadoop 2.x

Posted by Dongjoon Hyun <do...@gmail.com>.

Hi, Amin

In general, the Apache Spark community has received many feedbacks and been
moving forward to

- Use the latest Hadoop versions for more bug fixes including CVEs.
- Use Hadoop's shaded clients to minimize the dependency issues

Since the above is not achievable with Hadoop 2 clients, I believe the
official answer is `No` to (1). (Especially for your Hadoop 2.7 cluster
released in 2018.)

For the second question, Apache Spark community has been collaborating with
Apache Hadoop community in order to use the latest Apache Hadoop 3 clients
to connect old/new Hadoop clusters and public cloud environments. I believe
your production jobs should be fine if you are not relying on some
proprietary(=non-Apache Hadoop) features from private vendors. Please
report to the Apache Hadoop community or us if you hit unknown
compatibility issues.

Bests
Dongjoon.

On Fri, Apr 8, 2022 at 9:37 PM Amin Borjian <bo...@outlook.com>
wrote:

>
>
> From Spark version 3.1.0 onwards, the clients provided for Spark are built
> with Hadoop 3 and placed in maven repository. Unfortunately we use Hadoop
> 2.7.7 in our infrastructure currently.
>
>
>
> 1) Does Spark have a plan to publish the Spark client dependencies for
> Hadoop 2.x?
>
> 2) Are the new Spark clients capable of connecting to the Hadoop 2.x
> cluster? (According to a simple test, Spark client 3.2.1 had no problem
> with the Hadoop 2.7 cluster but we wanted to know if there was any
> guarantee from Spark?)
>
>
>
> Thank you very much in advance
>
> Amin Borjian
>
>
>