You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Reynold Xin <rx...@databricks.com> on 2015/11/19 23:14:44 UTC

Dropping support for earlier Hadoop versions in Spark 2.0?

I proposed dropping support for Hadoop 1.x in the Spark 2.0 email, and I
think everybody is for that.

https://issues.apache.org/jira/browse/SPARK-11807

Sean suggested also dropping support for Hadoop 2.2, 2.3, and 2.4. That is
to say, keep only Hadoop 2.6 and greater.

What are the community's thoughts on that?

Re: Dropping support for earlier Hadoop versions in Spark 2.0?

Posted by Chester Chen <ch...@alpinenow.com>.

for #1-3, the answer is likely No.

  Recently we upgrade to Spark 1.5.1, with CDH5.3, CDH5.4 and HDP2.2  and
others.

  We were using CDH5.3 client to talk to CDH5.4. We were doing this to see
if we support many different hadoop cluster versions without changing the
build. This was ok for yarn-cluster spark 1.3.1, but could not get spark
1.5.1 started. We upgrade the client to CDH5.4, then everything works.

  There are API changes between Apache 2.4 and 2.6, not sure you can mix
match them.

Chester


On Fri, Nov 20, 2015 at 1:59 PM, Sandy Ryza <sa...@cloudera.com> wrote:

> To answer your fourth question from Cloudera's perspective, we would never
> support a customer running Spark 2.0 on a Hadoop version < 2.6.
>
> -Sandy
>
> On Fri, Nov 20, 2015 at 1:39 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> OK I'm not exactly asking for a vote here :)
>>
>> I don't think we should look at it from only maintenance point of view --
>> because in that case the answer is clearly supporting as few versions as
>> possible (or just rm -rf spark source code and call it a day). It is a
>> tradeoff between the number of users impacted and the maintenance burden.
>>
>> So a few questions for those more familiar with Hadoop:
>>
>> 1. Can Hadoop 2.6 client read Hadoop 2.4 / 2.3?
>>
>> 2. If the answer to 1 is yes, are there known, major issues with backward
>> compatibility?
>>
>> 3. Can Hadoop 2.6+ YARN work on older versions of YARN clusters?
>>
>> 4. (for Hadoop vendors) When did/will support for Hadoop 2.4 and below
>> stop? To what extent do you care about running Spark on older Hadoop
>> clusters.
>>
>>
>>
>> On Fri, Nov 20, 2015 at 7:52 AM, Steve Loughran <st...@hortonworks.com>
>> wrote:
>>
>>>
>>> On 20 Nov 2015, at 14:28, chester@alpinenow.com wrote:
>>>
>>> Assuming we have 1.6 and 1.7 releases, then spark 2.0 is about 9 months
>>> away.
>>>
>>> customer will need to upgrade the new Hadoop clusters to Apache 2.6 or
>>> later to leverage new spark 2.0 in one year. I think this possible as
>>> latest release on cdh5.x,  HDP 2.x are both on Apache 2.6.0 already.
>>> Company will have enough time to upgrade cluster.
>>>
>>> +1 for me as well
>>>
>>> Chester
>>>
>>>
>>> now, if you are looking that far ahead, the other big issue is "when to
>>> retire Java 7 support".?
>>>
>>> That's a tough decision for all projects. Hadoop 3.x will be Java 8
>>> only, but nobody has committed the patch to the trunk codebase to force a
>>> java 8 build; + most of *todays* hadoop clusters are Java 7. But as you
>>> can't even download a Java 7 JDK for the desktop from oracle any more
>>> today, 2016 is a time to look at the language support and decide what is
>>> the baseline version
>>>
>>> Commentary from Twitter here -as they point out, it's not just the
>>> server farm that matters, it's all the apps that talk to it
>>>
>>>
>>>
>>> http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201503.mbox/%3CCAB7mwtE+kefcxsR6n46-ZTcS19ED7cWc9voBtR1jQEWDkye07g@mail.gmail.com%3E
>>>
>>> -Steve
>>>
>>
>>
>

Re: Dropping support for earlier Hadoop versions in Spark 2.0?

Posted by Sandy Ryza <sa...@cloudera.com>.

To answer your fourth question from Cloudera's perspective, we would never
support a customer running Spark 2.0 on a Hadoop version < 2.6.

-Sandy

On Fri, Nov 20, 2015 at 1:39 PM, Reynold Xin <rx...@databricks.com> wrote:

> OK I'm not exactly asking for a vote here :)
>
> I don't think we should look at it from only maintenance point of view --
> because in that case the answer is clearly supporting as few versions as
> possible (or just rm -rf spark source code and call it a day). It is a
> tradeoff between the number of users impacted and the maintenance burden.
>
> So a few questions for those more familiar with Hadoop:
>
> 1. Can Hadoop 2.6 client read Hadoop 2.4 / 2.3?
>
> 2. If the answer to 1 is yes, are there known, major issues with backward
> compatibility?
>
> 3. Can Hadoop 2.6+ YARN work on older versions of YARN clusters?
>
> 4. (for Hadoop vendors) When did/will support for Hadoop 2.4 and below
> stop? To what extent do you care about running Spark on older Hadoop
> clusters.
>
>
>
> On Fri, Nov 20, 2015 at 7:52 AM, Steve Loughran <st...@hortonworks.com>
> wrote:
>
>>
>> On 20 Nov 2015, at 14:28, chester@alpinenow.com wrote:
>>
>> Assuming we have 1.6 and 1.7 releases, then spark 2.0 is about 9 months
>> away.
>>
>> customer will need to upgrade the new Hadoop clusters to Apache 2.6 or
>> later to leverage new spark 2.0 in one year. I think this possible as
>> latest release on cdh5.x,  HDP 2.x are both on Apache 2.6.0 already.
>> Company will have enough time to upgrade cluster.
>>
>> +1 for me as well
>>
>> Chester
>>
>>
>> now, if you are looking that far ahead, the other big issue is "when to
>> retire Java 7 support".?
>>
>> That's a tough decision for all projects. Hadoop 3.x will be Java 8 only,
>> but nobody has committed the patch to the trunk codebase to force a java 8
>> build; + most of *todays* hadoop clusters are Java 7. But as you can't even
>> download a Java 7 JDK for the desktop from oracle any more today, 2016 is a
>> time to look at the language support and decide what is the baseline
>> version
>>
>> Commentary from Twitter here -as they point out, it's not just the server
>> farm that matters, it's all the apps that talk to it
>>
>>
>>
>> http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201503.mbox/%3CCAB7mwtE+kefcxsR6n46-ZTcS19ED7cWc9voBtR1jQEWDkye07g@mail.gmail.com%3E
>>
>> -Steve
>>
>
>

Re: Dropping support for earlier Hadoop versions in Spark 2.0?

Posted by Sean Owen <so...@cloudera.com>.

On Fri, Nov 20, 2015 at 10:39 PM, Reynold Xin <rx...@databricks.com> wrote:
> I don't think we should look at it from only maintenance point of view --
> because in that case the answer is clearly supporting as few versions as
> possible (or just rm -rf spark source code and call it a day). It is a
> tradeoff between the number of users impacted and the maintenance burden.

The upside to supporting only newer versions is less maintenance (no
small thing given how sprawling the build is), but also more ability
to use newer functionality. The downside is of course not letting
older Hadoop users use the latest Spark.

> 1. Can Hadoop 2.6 client read Hadoop 2.4 / 2.3?

If the question is about HDFS, really, then I think the answer is
"yes". The big compatibility problem has been protobuf but all of 2.2+
is on 2.5.

> 3. Can Hadoop 2.6+ YARN work on older versions of YARN clusters?

Same client/server question? This is where I'm not as clear. I think
the answer is 'yes' to the extent you're using functionality that
existed in the older YARN. Of course, using some newer API vs old
clusters doesn't work.

> 4. (for Hadoop vendors) When did/will support for Hadoop 2.4 and below stop?
> To what extent do you care about running Spark on older Hadoop clusters.

CDH 5.3 = Hadoop 2.6, FWIW, which was out about a year ago. Support
continues for a long time in the sense that CDH 5 will be supported
for years. However, Spark 2 would never be shipped / supported in CDH
5. So, it's not an issue for Spark 2; Spark 2 will be "supported"
probably only vs Hadoop 3 or at least something later in 2.x than 2.6.

The question is here is really about whether Spark should specially
support, say, Spark 2 + CDH 5.0 or something. My experience so far is
that Spark has not really supported older vendor versions it claims
to, and I'd rather not pretend it does. So this doesn't strike me as a
great reason either.

This is roughly why supporting, say, 2.6 as a pretty safely recent
version seems like an OK place to draw the line 6-8 months from now.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Dropping support for earlier Hadoop versions in Spark 2.0?

Posted by Steve Loughran <st...@hortonworks.com>.

> On 20 Nov 2015, at 21:39, Reynold Xin <rx...@databricks.com> wrote:
> 
> OK I'm not exactly asking for a vote here :)
> 
> I don't think we should look at it from only maintenance point of view -- because in that case the answer is clearly supporting as few versions as possible (or just rm -rf spark source code and call it a day). It is a tradeoff between the number of users impacted and the maintenance burden.
> 
> So a few questions for those more familiar with Hadoop:
> 
> 1. Can Hadoop 2.6 client read Hadoop 2.4 / 2.3? 
> 

yes, at HDFS 

There's some special cases with HDFS stopping a 2.2-2.5 client talking to Hadoop 2.6

-HDFS at rest encryption needs a client that can decode it (2.6.x+)
-HDFS erasure code will need a later version (2.8?)

If you turn SASL on in your datanodes, your DNs don't need to come up on a port < 1024, but Hadoop  < 2.6 clients stop being able to work with HDFS at that point

> 2. If the answer to 1 is yes, are there known, major issues with backward compatibility?
> 

hadoop native libs, every time. Guava, jackson and protobuf can be managed with shading, but hadoop.{so,dll} is a real problem. A hadoop-2.6 JAR will use native methods in hadoop.lib which, if not loaded, will break the app.  This is a pain as nobody includes that native lib with their java binaries —who can even predict which one they have to do. As a consequence, I'd really advise against trying to run an app built with the 2.6 JARS inside a YARN cluster  < 2.6. You can certainly talk to HDFS and the YARN services, but there's a risk a codepath will hit a native method that isn't there.

It's trouble the other way too.  -even though we try not break existing code by moving/renaming native methods it can happen.

The last time someone did this in a big way, I was the first to find it in HADOOP-11064; the changes where reverted/altered but there was no official declaration that compatibility at the JNI layer will be maintained. Apparently you can't guarantee it over JVM versions either.

We really need a lib versioning story, which is what HADOOP-11127 covers.

> 3. Can Hadoop 2.6+ YARN work on older versions of YARN clusters?
> 

I'd say no, with classpath and hadoop native being the failure points.

There's also feature completeness; Hadoop 2.6 was the first version with all the YARN-896 work for long-lived services

> 4. (for Hadoop vendors) When did/will support for Hadoop 2.4 and below stop? To what extent do you care about running Spark on older Hadoop clusters.
> 
> 

I don't know. And I probably don't want to make any forward looking statements anyway. But I don't even know how well supported 2.4 is today; 2.6 is the one that still gets bug fixes out from the ASF. I can see it lasting a while.

What essentially happens is that we provide bug fixes to the existing releases, but for anything new: upgrade.

Assuming that policy continues (disclaimer: personal opinions, etc), then any Spark 2.0 release would be rebuilt against all the JARs which the rest of that version of HDP would use, and that's the only version we'd recommend using.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Dropping support for earlier Hadoop versions in Spark 2.0?

Posted by Reynold Xin <rx...@databricks.com>.

OK I'm not exactly asking for a vote here :)

I don't think we should look at it from only maintenance point of view --
because in that case the answer is clearly supporting as few versions as
possible (or just rm -rf spark source code and call it a day). It is a
tradeoff between the number of users impacted and the maintenance burden.

So a few questions for those more familiar with Hadoop:

1. Can Hadoop 2.6 client read Hadoop 2.4 / 2.3?

2. If the answer to 1 is yes, are there known, major issues with backward
compatibility?

3. Can Hadoop 2.6+ YARN work on older versions of YARN clusters?

4. (for Hadoop vendors) When did/will support for Hadoop 2.4 and below
stop? To what extent do you care about running Spark on older Hadoop
clusters.

On Fri, Nov 20, 2015 at 7:52 AM, Steve Loughran <st...@hortonworks.com>
wrote:

>
> On 20 Nov 2015, at 14:28, chester@alpinenow.com wrote:
>
> Assuming we have 1.6 and 1.7 releases, then spark 2.0 is about 9 months
> away.
>
> customer will need to upgrade the new Hadoop clusters to Apache 2.6 or
> later to leverage new spark 2.0 in one year. I think this possible as
> latest release on cdh5.x,  HDP 2.x are both on Apache 2.6.0 already.
> Company will have enough time to upgrade cluster.
>
> +1 for me as well
>
> Chester
>
>
> now, if you are looking that far ahead, the other big issue is "when to
> retire Java 7 support".?
>
> That's a tough decision for all projects. Hadoop 3.x will be Java 8 only,
> but nobody has committed the patch to the trunk codebase to force a java 8
> build; + most of *todays* hadoop clusters are Java 7. But as you can't even
> download a Java 7 JDK for the desktop from oracle any more today, 2016 is a
> time to look at the language support and decide what is the baseline
> version
>
> Commentary from Twitter here -as they point out, it's not just the server
> farm that matters, it's all the apps that talk to it
>
>
>
> http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201503.mbox/%3CCAB7mwtE+kefcxsR6n46-ZTcS19ED7cWc9voBtR1jQEWDkye07g@mail.gmail.com%3E
>
> -Steve
>

Re: Dropping support for earlier Hadoop versions in Spark 2.0?

Posted by Steve Loughran <st...@hortonworks.com>.

On 20 Nov 2015, at 14:28, chester@alpinenow.com<ma...@alpinenow.com> wrote:

Assuming we have 1.6 and 1.7 releases, then spark 2.0 is about 9 months away.

customer will need to upgrade the new Hadoop clusters to Apache 2.6 or later to leverage new spark 2.0 in one year. I think this possible as latest release on cdh5.x,  HDP 2.x are both on Apache 2.6.0 already. Company will have enough time to upgrade cluster.

+1 for me as well

Chester


now, if you are looking that far ahead, the other big issue is "when to retire Java 7 support".?

That's a tough decision for all projects. Hadoop 3.x will be Java 8 only, but nobody has committed the patch to the trunk codebase to force a java 8 build; + most of *todays* hadoop clusters are Java 7. But as you can't even download a Java 7 JDK for the desktop from oracle any more today, 2016 is a time to look at the language support and decide what is the baseline version

Commentary from Twitter here -as they point out, it's not just the server farm that matters, it's all the apps that talk to it


http://mail-archives.apache.org/mod_mbox/hadoop-common-dev/201503.mbox/%3CCAB7mwtE+kefcxsR6n46-ZTcS19ED7cWc9voBtR1jQEWDkye07g@mail.gmail.com%3E<ht...@mail.gmail.com>>

-Steve

Re: Dropping support for earlier Hadoop versions in Spark 2.0?

Posted by ch...@alpinenow.com.

Assuming we have 1.6 and 1.7 releases, then spark 2.0 is about 9 months away. 

customer will need to upgrade the new Hadoop clusters to Apache 2.6 or later to leverage new spark 2.0 in one year. I think this possible as latest release on cdh5.x,  HDP 2.x are both on Apache 2.6.0 already. Company will have enough time to upgrade cluster.

+1 for me as well

Chester

Sent from my iPad

> On Nov 19, 2015, at 2:14 PM, Reynold Xin <rx...@databricks.com> wrote:
> 
> I proposed dropping support for Hadoop 1.x in the Spark 2.0 email, and I think everybody is for that.
> 
> https://issues.apache.org/jira/browse/SPARK-11807
> 
> Sean suggested also dropping support for Hadoop 2.2, 2.3, and 2.4. That is to say, keep only Hadoop 2.6 and greater.
> 
> What are the community's thoughts on that?
>

Re: Dropping support for earlier Hadoop versions in Spark 2.0?

Posted by Steve Loughran <st...@hortonworks.com>.

On 19 Nov 2015, at 22:14, Reynold Xin <rx...@databricks.com>> wrote:

I proposed dropping support for Hadoop 1.x in the Spark 2.0 email, and I think everybody is for that.

https://issues.apache.org/jira/browse/SPARK-11807

Sean suggested also dropping support for Hadoop 2.2, 2.3, and 2.4. That is to say, keep only Hadoop 2.6 and greater.

What are the community's thoughts on that?


+1

It's the common APIs under pretty much shipping; EMR, CDH & HDP, and there's no significant API changes between it and 2.7. [There's a couple of extra records in job submissions in 2.7 which you can get at with reflection for AM failure reset window and rolling log capture patterns]. It's also getting some ongoing maintenance (2.6.3 being planned for dec).

It's not perfect; if I were to list troublespots to me they are : s3a isn't ready for use; there's better logging and tracing in later versions. But those aren't at the API level.

Re: Dropping support for earlier Hadoop versions in Spark 2.0?

Posted by Ted Yu <yu...@gmail.com>.

Should a new job be setup under Spark-Master-Maven-with-YARN for hadoop
2.6.x ?

Cheers

On Thu, Nov 19, 2015 at 5:16 PM, 张志强(旺轩) <zz...@alibaba-inc.com> wrote:

> I agreed
> +1
>
> ------------------------------------------------------------------
> 发件人：Reynold Xin<rx...@databricks.com>
> 日 期：2015年11月20日 06:14:44
> 收件人：dev@spark.apache.org<de...@spark.apache.org>; Sean Owen<sr...@gmail.com>;
> Thomas Graves<tg...@apache.org>
> 主 题：Dropping support for earlier Hadoop versions in Spark 2.0?
>
>
> I proposed dropping support for Hadoop 1.x in the Spark 2.0 email, and I
> think everybody is for that.
>
> https://issues.apache.org/jira/browse/SPARK-11807
>
> Sean suggested also dropping support for Hadoop 2.2, 2.3, and 2.4. That is
> to say, keep only Hadoop 2.6 and greater.
>
> What are the community's thoughts on that?
>
>
>

回复:Dropping support for earlier Hadoop versions in Spark 2.0?

Posted by "张志强(旺轩)" <zz...@alibaba-inc.com>.

I agreed
+1------------------------------------------------------------------发件人：Reynold Xin<rx...@databricks.com>日　期：2015年11月20日 06:14:44收件人：dev@spark.apache.org<de...@spark.apache.org>; Sean Owen<sr...@gmail.com>; Thomas Graves<tg...@apache.org>主　题：Dropping support for earlier Hadoop versions in Spark 2.0?I proposed dropping support for Hadoop 1.x in the Spark 2.0 email, and I think everybody is for that.
https://issues.apache.org/jira/browse/SPARK-11807

Sean suggested also dropping support for Hadoop 2.2, 2.3, and 2.4. That is to say, keep only Hadoop 2.6 and greater.
What are the community's thoughts on that?

Re: Dropping support for earlier Hadoop versions in Spark 2.0?

Posted by Henri Dubois-Ferriere <he...@gmail.com>.

+1

On 19 November 2015 at 14:14, Reynold Xin <rx...@databricks.com> wrote:

> I proposed dropping support for Hadoop 1.x in the Spark 2.0 email, and I
> think everybody is for that.
>
> https://issues.apache.org/jira/browse/SPARK-11807
>
> Sean suggested also dropping support for Hadoop 2.2, 2.3, and 2.4. That is
> to say, keep only Hadoop 2.6 and greater.
>
> What are the community's thoughts on that?
>
>

Re: Dropping support for earlier Hadoop versions in Spark 2.0?

Posted by Saisai Shao <sa...@gmail.com>.

+1.

Hadoop 2.6 would be a good choice with many features added (like supporting
long running service, label based scheduling). Currently there's lot of
reflection codes to support multiple version of Yarn, so upgrading to a
newer version will really ease the pain :).

Thanks
Saisai

On Fri, Nov 20, 2015 at 3:58 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> +1
>
> Regards
> JB
>
>
> On 11/19/2015 11:14 PM, Reynold Xin wrote:
>
>> I proposed dropping support for Hadoop 1.x in the Spark 2.0 email, and I
>> think everybody is for that.
>>
>> https://issues.apache.org/jira/browse/SPARK-11807
>>
>> Sean suggested also dropping support for Hadoop 2.2, 2.3, and 2.4. That
>> is to say, keep only Hadoop 2.6 and greater.
>>
>> What are the community's thoughts on that?
>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: Dropping support for earlier Hadoop versions in Spark 2.0?

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

+1

Regards
JB

On 11/19/2015 11:14 PM, Reynold Xin wrote:
> I proposed dropping support for Hadoop 1.x in the Spark 2.0 email, and I
> think everybody is for that.
>
> https://issues.apache.org/jira/browse/SPARK-11807
>
> Sean suggested also dropping support for Hadoop 2.2, 2.3, and 2.4. That
> is to say, keep only Hadoop 2.6 and greater.
>
> What are the community's thoughts on that?
>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org