You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Reynold Xin <rx...@databricks.com> on 2016/01/14 07:29:09 UTC

[discuss] dropping Hadoop 2.2 and 2.3 support in Spark 2.0?

We've dropped Hadoop 1.x support in Spark 2.0.

There is also a proposal to drop Hadoop 2.2 and 2.3, i.e. the minimal
Hadoop version we support would be Hadoop 2.4. The main advantage is then
we'd be able to focus our Jenkins resources (and the associated maintenance
of Jenkins) to create builds for Hadoop 2.6/2.7. It is my understanding
that all Hadoop vendors have moved away from 2.2/2.3, but there might be
some users that are on these older versions.

What do you think about this idea?

Re: [discuss] dropping Hadoop 2.2 and 2.3 support in Spark 2.0?

Posted by Reynold Xin <rx...@databricks.com>.

Thanks for chiming in. Note that an organization's agility in Spark
upgrades can be very different from Hadoop upgrades.

For many orgs, Hadoop is responsible for cluster resource scheduling (YARN)
and data storage (HDFS). These two are notorious difficult to upgrade. It
is all or nothing for a cluster. (You can't have a subset of the nodes
running Hadoop 2.2 and the other subset running Hadoop 2.6). For Spark, it
is a very different story. It is pretty easy to run multiple different
versions of Spark in different applications, even though they are all
running in a single cluster.

As a result, you might see a lot of orgs with really old Hadoop versions
and yet are willing to upgrade to Spark 2.x.





On Thu, Jan 14, 2016 at 11:26 AM, Steve Loughran <st...@hortonworks.com>
wrote:

>
> > On 14 Jan 2016, at 09:28, Steve Loughran <st...@hortonworks.com> wrote:
> >>
> >
> > 2.6.x is still having active releases, likely through 2016. It'll be the
> only hadoop version where problems Spark encounters would get fixed
>
> Correction: minimum Hadoop version
>
> Any problem reported against older versions will probably get a message
> saying "upgrade"
>
> >
> > It's also the last iteration of interesting API features —especially in
> YARN: timeline server, registry, various other things
> >
> > And it has s3a, which, for anyone using S3 for storage, is the only S3
> filesystem binding I'd recommend. Hadoop 2.4 not only has s3n, it's got a
> broken one that (HADOOP-10589)
> >
> > I believe 2.6 supportsr recent guava versions, even if it is frozen on
> 11.0 to avoid surprising people (i.e. all deprecated/removed classes should
> have been stripped)
> >
> > Finally: it's the only version of Hadoop that works on Java 7, has
> patches to support Java8+kerberos (in fact, Java 7u80+ and kerberos).
> >
> > For the reason of JVMs and guava alone, I'd abandon Hadoop < 2.6. Those
> versions won't work on secure Java 7 clusters, recent guava versions, and
> have lots of uncorrected issues.
> >
> > Oh, and did I mention the test matrix? The later version of Hadoop you
> use, the less versions to test against.
> >
> >> My general position is that backwards-compatibility and supporting
> >> older platforms needs to be a low priority in a major release; it's a
> >> decision about what to support for users in the next couple years, not
> >> the preceding couple years. Users on older technologies simply stay on
> >> the older Spark until ready to update; they are in no sense suddenly
> >> left behind otherwise.
> >
> >
> > If they are running older versions of Hadoop, they generally have stable
> apps which they don't bother upgrading. New clusters => new versions => new
> apps.
> >
> >
> >
> B�KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB� �
> [��X��ܚX�K  K[XZ[ �  ]�][��X��ܚX�P � \�˘\ X� K�ܙ�B��܈ Y  ] [ۘ[  ��[X[� �
> K[XZ[ �  ]�Z [   � \�˘\ X� K�ܙ�B�
>
> I have no idea what this is or why it made it to the tail of my email.
> Maybe outlook has changed its signature for me.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: [discuss] dropping Hadoop 2.2 and 2.3 support in Spark 2.0?

Posted by Steve Loughran <st...@hortonworks.com>.

> On 14 Jan 2016, at 09:28, Steve Loughran <st...@hortonworks.com> wrote:
>> 
> 
> 2.6.x is still having active releases, likely through 2016. It'll be the only hadoop version where problems Spark encounters would get fixed

Correction: minimum Hadoop version

Any problem reported against older versions will probably get a message saying "upgrade"

> 
> It's also the last iteration of interesting API features —especially in YARN: timeline server, registry, various other things
> 
> And it has s3a, which, for anyone using S3 for storage, is the only S3 filesystem binding I'd recommend. Hadoop 2.4 not only has s3n, it's got a broken one that (HADOOP-10589)
> 
> I believe 2.6 supportsr recent guava versions, even if it is frozen on 11.0 to avoid surprising people (i.e. all deprecated/removed classes should have been stripped)
> 
> Finally: it's the only version of Hadoop that works on Java 7, has patches to support Java8+kerberos (in fact, Java 7u80+ and kerberos).
> 
> For the reason of JVMs and guava alone, I'd abandon Hadoop < 2.6. Those versions won't work on secure Java 7 clusters, recent guava versions, and have lots of uncorrected issues.
> 
> Oh, and did I mention the test matrix? The later version of Hadoop you use, the less versions to test against. 
> 
>> My general position is that backwards-compatibility and supporting
>> older platforms needs to be a low priority in a major release; it's a
>> decision about what to support for users in the next couple years, not
>> the preceding couple years. Users on older technologies simply stay on
>> the older Spark until ready to update; they are in no sense suddenly
>> left behind otherwise.
> 
> 
> If they are running older versions of Hadoop, they generally have stable apps which they don't bother upgrading. New clusters => new versions => new apps.
> 
> 
> B�KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB��[��X��ܚX�KK[XZ[�]�][��X��ܚX�P�\�˘\X�K�ܙ�B��܈Y][ۘ[��[X[��K[XZ[�]�Z[�\�˘\X�K�ܙ�B�

I have no idea what this is or why it made it to the tail of my email. Maybe outlook has changed its signature for me.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [discuss] dropping Hadoop 2.2 and 2.3 support in Spark 2.0?

Posted by Steve Loughran <st...@hortonworks.com>.

> On 14 Jan 2016, at 02:17, Sean Owen <so...@cloudera.com> wrote:
> 
> I personally support this. I had suggest drawing the line at Hadoop
> 2.6, but that's minor. More info:
> 
> Hadoop 2.7: April 2015
> Hadoop 2.6: Nov 2014
> Hadoop 2.5: Aug 2014
> Hadoop 2.4: April 2014
> Hadoop 2.3: Feb 2014
> Hadoop 2.2: Oct 2013
> 
> CDH 5.0/5.1 = Hadoop 2.3 + backports
> CDH 5.2/5.3 = Hadoop 2.5 + backports
> CDH 5.4+ = Hadoop 2.6 + chunks of 2.7 + backports.
> 
> I can only imagine that CDH6 this year will be based on something
> later still like 2.8 (no idea about the 3.0 schedule).

Hadoop 2.8 comes out in ~1-2 months. I've already been building & testing spark against it; no major issues.

> In the sense
> that 5.2 was released about a year and half ago, yes, this vendor has
> moved on from 2.3 a while ago. These releases will also never contain
> a different minor Spark release. For example 5.7 will have Spark 1.6,
> I believe, and not 2.0.
> 
> Here, I listed some additional things we could clean up in Spark if
> Hadoop 2.6 was assumed. By itself, not a lot:
> https://github.com/apache/spark/pull/10446#issuecomment-167971026
> 

> Yes, we also get less Jenkins complexity. Mostly, the jar-hell that's
> biting now gets a little more feasible to fix. And we get Hadoop fixes
> as well as new APIs, which helps mostly for YARN.
> 

2.6.x is still having active releases, likely through 2016. It'll be the only hadoop version where problems Spark encounters would get fixed

It's also the last iteration of interesting API features —especially in YARN: timeline server, registry, various other things

And it has s3a, which, for anyone using S3 for storage, is the only S3 filesystem binding I'd recommend. Hadoop 2.4 not only has s3n, it's got a broken one that (HADOOP-10589)

I believe 2.6 supportsr recent guava versions, even if it is frozen on 11.0 to avoid surprising people (i.e. all deprecated/removed classes should have been stripped)

Finally: it's the only version of Hadoop that works on Java 7, has patches to support Java8+kerberos (in fact, Java 7u80+ and kerberos).

For the reason of JVMs and guava alone, I'd abandon Hadoop < 2.6. Those versions won't work on secure Java 7 clusters, recent guava versions, and have lots of uncorrected issues.

Oh, and did I mention the test matrix? The later version of Hadoop you use, the less versions to test against. 

> My general position is that backwards-compatibility and supporting
> older platforms needs to be a low priority in a major release; it's a
> decision about what to support for users in the next couple years, not
> the preceding couple years. Users on older technologies simply stay on
> the older Spark until ready to update; they are in no sense suddenly
> left behind otherwise.

If they are running older versions of Hadoop, they generally have stable apps which they don't bother upgrading. New clusters => new versions => new apps.

Re: [discuss] dropping Hadoop 2.2 and 2.3 support in Spark 2.0?

Posted by Sean Owen <so...@cloudera.com>.

I personally support this. I had suggest drawing the line at Hadoop
2.6, but that's minor. More info:

Hadoop 2.7: April 2015
Hadoop 2.6: Nov 2014
Hadoop 2.5: Aug 2014
Hadoop 2.4: April 2014
Hadoop 2.3: Feb 2014
Hadoop 2.2: Oct 2013

CDH 5.0/5.1 = Hadoop 2.3 + backports
CDH 5.2/5.3 = Hadoop 2.5 + backports
CDH 5.4+ = Hadoop 2.6 + chunks of 2.7 + backports.

I can only imagine that CDH6 this year will be based on something
later still like 2.8 (no idea about the 3.0 schedule). In the sense
that 5.2 was released about a year and half ago, yes, this vendor has
moved on from 2.3 a while ago. These releases will also never contain
a different minor Spark release. For example 5.7 will have Spark 1.6,
I believe, and not 2.0.

Here, I listed some additional things we could clean up in Spark if
Hadoop 2.6 was assumed. By itself, not a lot:
https://github.com/apache/spark/pull/10446#issuecomment-167971026

Yes, we also get less Jenkins complexity. Mostly, the jar-hell that's
biting now gets a little more feasible to fix. And we get Hadoop fixes
as well as new APIs, which helps mostly for YARN.

My general position is that backwards-compatibility and supporting
older platforms needs to be a low priority in a major release; it's a
decision about what to support for users in the next couple years, not
the preceding couple years. Users on older technologies simply stay on
the older Spark until ready to update; they are in no sense suddenly
left behind otherwise.

On Thu, Jan 14, 2016 at 6:29 AM, Reynold Xin <rx...@databricks.com> wrote:
> We've dropped Hadoop 1.x support in Spark 2.0.
>
> There is also a proposal to drop Hadoop 2.2 and 2.3, i.e. the minimal Hadoop
> version we support would be Hadoop 2.4. The main advantage is then we'd be
> able to focus our Jenkins resources (and the associated maintenance of
> Jenkins) to create builds for Hadoop 2.6/2.7. It is my understanding that
> all Hadoop vendors have moved away from 2.2/2.3, but there might be some
> users that are on these older versions.
>
> What do you think about this idea?
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [discuss] dropping Hadoop 2.2 and 2.3 support in Spark 2.0?

Posted by Sean Owen <so...@cloudera.com>.

I personally support this. I had suggest drawing the line at Hadoop
2.6, but that's minor. More info:

Hadoop 2.7: April 2015
Hadoop 2.6: Nov 2014
Hadoop 2.5: Aug 2014
Hadoop 2.4: April 2014
Hadoop 2.3: Feb 2014
Hadoop 2.2: Oct 2013

CDH 5.0/5.1 = Hadoop 2.3 + backports
CDH 5.2/5.3 = Hadoop 2.5 + backports
CDH 5.4+ = Hadoop 2.6 + chunks of 2.7 + backports.

I can only imagine that CDH6 this year will be based on something
later still like 2.8 (no idea about the 3.0 schedule). In the sense
that 5.2 was released about a year and half ago, yes, this vendor has
moved on from 2.3 a while ago. These releases will also never contain
a different minor Spark release. For example 5.7 will have Spark 1.6,
I believe, and not 2.0.

Here, I listed some additional things we could clean up in Spark if
Hadoop 2.6 was assumed. By itself, not a lot:
https://github.com/apache/spark/pull/10446#issuecomment-167971026

Yes, we also get less Jenkins complexity. Mostly, the jar-hell that's
biting now gets a little more feasible to fix. And we get Hadoop fixes
as well as new APIs, which helps mostly for YARN.

My general position is that backwards-compatibility and supporting
older platforms needs to be a low priority in a major release; it's a
decision about what to support for users in the next couple years, not
the preceding couple years. Users on older technologies simply stay on
the older Spark until ready to update; they are in no sense suddenly
left behind otherwise.

On Thu, Jan 14, 2016 at 6:29 AM, Reynold Xin <rx...@databricks.com> wrote:
> We've dropped Hadoop 1.x support in Spark 2.0.
>
> There is also a proposal to drop Hadoop 2.2 and 2.3, i.e. the minimal Hadoop
> version we support would be Hadoop 2.4. The main advantage is then we'd be
> able to focus our Jenkins resources (and the associated maintenance of
> Jenkins) to create builds for Hadoop 2.6/2.7. It is my understanding that
> all Hadoop vendors have moved away from 2.2/2.3, but there might be some
> users that are on these older versions.
>
> What do you think about this idea?
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org