You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Sean Owen <so...@cloudera.com> on 2018/02/08 18:50:23 UTC

Drop the Hadoop 2.6 profile?

At this point, with Hadoop 3 on deck, I think hadoop 2.6 is both fairly
old, and actually, not different from 2.7 with respect to Spark. That is, I
don't know if we are actually maintaining anything here but a separate
profile and 2x the number of test builds.

The cost is, by the same token, low. However I'm floating the idea of
removing the 2.6 profile and just requiring 2.7+ as of Spark 2.4?

Re: Drop the Hadoop 2.6 profile?

Posted by Koert Kuipers <ko...@tresata.com>.

oh nevermind i am used to spark builds without hadoop included. but i
realize that if hadoop is included it matters if its 2.6 or 2.7...

On Thu, Feb 8, 2018 at 5:06 PM, Koert Kuipers <ko...@tresata.com> wrote:

> wouldn't hadoop 2.7 profile means someone by introduces usage of some
> hadoop apis that dont exist in hadoop 2.6?
>
> why not keep 2.6 and ditch 2.7 given that hadoop 2.7 is backwards
> compatible with 2.6? what is the added value of having a 2.7 profile?
>
> On Thu, Feb 8, 2018 at 5:03 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> That would still work with a Hadoop-2.7-based profile, as there isn't
>> actually any code difference in Spark that treats the two versions
>> differently (nor, really, much different between 2.6 and 2.7 to begin
>> with). This practice of different profile builds was pretty unnecessary
>> after 2.2; it's mostly vestigial now.
>>
>> On Thu, Feb 8, 2018 at 3:57 PM Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> CDH 5 is still based on hadoop 2.6
>>>
>>> On Thu, Feb 8, 2018 at 2:03 PM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> Mostly just shedding the extra build complexity, and builds. The
>>>> primary little annoyance is it's 2x the number of flaky build failures to
>>>> examine.
>>>> I suppose it allows using a 2.7+-only feature, but outside of YARN, not
>>>> sure there is anything compelling.
>>>>
>>>> It's something that probably gains us virtually nothing now, but isn't
>>>> too painful either.
>>>> I think it will not make sense to distinguish them once any Hadoop
>>>> 3-related support comes into the picture, and maybe that will start soon;
>>>> there were some more pings on related JIRAs this week. You could view it as
>>>> early setup for that move.
>>>>
>>>>
>>>> On Thu, Feb 8, 2018 at 12:57 PM Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>>
>>>>> Does it gain us anything to drop 2.6?
>>>>>
>>>>> > On Feb 8, 2018, at 10:50 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>> >
>>>>> > At this point, with Hadoop 3 on deck, I think hadoop 2.6 is both
>>>>> fairly old, and actually, not different from 2.7 with respect to Spark.
>>>>> That is, I don't know if we are actually maintaining anything here but a
>>>>> separate profile and 2x the number of test builds.
>>>>> >
>>>>> > The cost is, by the same token, low. However I'm floating the idea
>>>>> of removing the 2.6 profile and just requiring 2.7+ as of Spark 2.4?
>>>>>
>>>>
>>>
>

Re: Drop the Hadoop 2.6 profile?

Posted by Koert Kuipers <ko...@tresata.com>.

wouldn't hadoop 2.7 profile means someone by introduces usage of some
hadoop apis that dont exist in hadoop 2.6?

why not keep 2.6 and ditch 2.7 given that hadoop 2.7 is backwards
compatible with 2.6? what is the added value of having a 2.7 profile?

On Thu, Feb 8, 2018 at 5:03 PM, Sean Owen <so...@cloudera.com> wrote:

> That would still work with a Hadoop-2.7-based profile, as there isn't
> actually any code difference in Spark that treats the two versions
> differently (nor, really, much different between 2.6 and 2.7 to begin
> with). This practice of different profile builds was pretty unnecessary
> after 2.2; it's mostly vestigial now.
>
> On Thu, Feb 8, 2018 at 3:57 PM Koert Kuipers <ko...@tresata.com> wrote:
>
>> CDH 5 is still based on hadoop 2.6
>>
>> On Thu, Feb 8, 2018 at 2:03 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> Mostly just shedding the extra build complexity, and builds. The primary
>>> little annoyance is it's 2x the number of flaky build failures to examine.
>>> I suppose it allows using a 2.7+-only feature, but outside of YARN, not
>>> sure there is anything compelling.
>>>
>>> It's something that probably gains us virtually nothing now, but isn't
>>> too painful either.
>>> I think it will not make sense to distinguish them once any Hadoop
>>> 3-related support comes into the picture, and maybe that will start soon;
>>> there were some more pings on related JIRAs this week. You could view it as
>>> early setup for that move.
>>>
>>>
>>> On Thu, Feb 8, 2018 at 12:57 PM Reynold Xin <rx...@databricks.com> wrote:
>>>
>>>> Does it gain us anything to drop 2.6?
>>>>
>>>> > On Feb 8, 2018, at 10:50 AM, Sean Owen <so...@cloudera.com> wrote:
>>>> >
>>>> > At this point, with Hadoop 3 on deck, I think hadoop 2.6 is both
>>>> fairly old, and actually, not different from 2.7 with respect to Spark.
>>>> That is, I don't know if we are actually maintaining anything here but a
>>>> separate profile and 2x the number of test builds.
>>>> >
>>>> > The cost is, by the same token, low. However I'm floating the idea of
>>>> removing the 2.6 profile and just requiring 2.7+ as of Spark 2.4?
>>>>
>>>
>>

Re: Drop the Hadoop 2.6 profile?

Posted by Steve Loughran <st...@hortonworks.com>.

I'd advocate 2.7 over 2.6, primarily due to Kerberos and JVM versions

2.6 is not even qualified for Java 7, let alone Java 8: you've got no guarantees that things work on the min Java version Spark requires.

Kerberos is always the failure point here, as well as various libraries (jetty) which get used more on the server.

Except Guava, which gets everywhere and whose Java version policy  is only slightly more stable as its binary compatibility.

if tests aren't seeing those problems, it may mean that Kerberos is avoided, which is always nice to do, but it'll find you later

see HADOOP-11287, HADOOP-12716 (2.8+ only, presumably backported to CDH and HDP), HADOOP-10786 (which is in 2.6.1).

-Steve

On 8 Feb 2018, at 22:30, Koert Kuipers <ko...@tresata.com>> wrote:

w
ire compatibility is relevant if hadoop is included in spark build

for those of us that build spark without hadoop included hadoop (binary) api compatibility matters. i wouldn't want to build against hadoop 2.7 and deploy on hadoop 2.6, but i am ok the other way around. so to get the compatibility with all the major distros and cloud providers building against hadoop 2.6 is currently the way to go.

On Thu, Feb 8, 2018 at 5:09 PM, Marcelo Vanzin <va...@cloudera.com>> wrote:
I think it would make sense to drop one of them, but not necessarily 2.6.

It kinda depends on what wire compatibility guarantees the Hadoop
libraries have; can a 2.6 client talk to 2.7 (pretty certain it can)?
Is the opposite safe (not sure)?

If the answer to the latter question is "no", then keeping 2.6 and
dropping 2.7 makes more sense. Those who really want a
Hadoop-version-specific package can override the needed versions in
the command line, or use the "without hadoop" package.

But in the context of trying to support 3.0 it makes sense to drop one
of them, at least from jenkins.

On Thu, Feb 8, 2018 at 2:03 PM, Sean Owen <so...@cloudera.com>> wrote:
> That would still work with a Hadoop-2.7-based profile, as there isn't
> actually any code difference in Spark that treats the two versions
> differently (nor, really, much different between 2.6 and 2.7 to begin with).
> This practice of different profile builds was pretty unnecessary after 2.2;
> it's mostly vestigial now.
>
> On Thu, Feb 8, 2018 at 3:57 PM Koert Kuipers <ko...@tresata.com>> wrote:
>>
>> CDH 5 is still based on hadoop 2.6
>>
>> On Thu, Feb 8, 2018 at 2:03 PM, Sean Owen <so...@cloudera.com>> wrote:
>>>
>>> Mostly just shedding the extra build complexity, and builds. The primary
>>> little annoyance is it's 2x the number of flaky build failures to examine.
>>> I suppose it allows using a 2.7+-only feature, but outside of YARN, not
>>> sure there is anything compelling.
>>>
>>> It's something that probably gains us virtually nothing now, but isn't
>>> too painful either.
>>> I think it will not make sense to distinguish them once any Hadoop
>>> 3-related support comes into the picture, and maybe that will start soon;
>>> there were some more pings on related JIRAs this week. You could view it as
>>> early setup for that move.
>>>
>>>
>>> On Thu, Feb 8, 2018 at 12:57 PM Reynold Xin <rx...@databricks.com>> wrote:
>>>>
>>>> Does it gain us anything to drop 2.6?
>>>>
>>>> > On Feb 8, 2018, at 10:50 AM, Sean Owen <so...@cloudera.com>> wrote:
>>>> >
>>>> > At this point, with Hadoop 3 on deck, I think hadoop 2.6 is both
>>>> > fairly old, and actually, not different from 2.7 with respect to Spark. That
>>>> > is, I don't know if we are actually maintaining anything here but a separate
>>>> > profile and 2x the number of test builds.
>>>> >
>>>> > The cost is, by the same token, low. However I'm floating the idea of
>>>> > removing the 2.6 profile and just requiring 2.7+ as of Spark 2.4?
>>
>>
>

--
Marcelo

Re: Drop the Hadoop 2.6 profile?

Posted by Koert Kuipers <ko...@tresata.com>.

w
ire compatibility is relevant if hadoop is included in spark build


for those of us that build spark without hadoop included hadoop (binary)
api compatibility matters. i wouldn't want to build against hadoop 2.7 and
deploy on hadoop 2.6, but i am ok the other way around. so to get the
compatibility with all the major distros and cloud providers building
against hadoop 2.6 is currently the way to go.


On Thu, Feb 8, 2018 at 5:09 PM, Marcelo Vanzin <va...@cloudera.com> wrote:

> I think it would make sense to drop one of them, but not necessarily 2.6.
>
> It kinda depends on what wire compatibility guarantees the Hadoop
> libraries have; can a 2.6 client talk to 2.7 (pretty certain it can)?
> Is the opposite safe (not sure)?
>
> If the answer to the latter question is "no", then keeping 2.6 and
> dropping 2.7 makes more sense. Those who really want a
> Hadoop-version-specific package can override the needed versions in
> the command line, or use the "without hadoop" package.
>
> But in the context of trying to support 3.0 it makes sense to drop one
> of them, at least from jenkins.
>
>
> On Thu, Feb 8, 2018 at 2:03 PM, Sean Owen <so...@cloudera.com> wrote:
> > That would still work with a Hadoop-2.7-based profile, as there isn't
> > actually any code difference in Spark that treats the two versions
> > differently (nor, really, much different between 2.6 and 2.7 to begin
> with).
> > This practice of different profile builds was pretty unnecessary after
> 2.2;
> > it's mostly vestigial now.
> >
> > On Thu, Feb 8, 2018 at 3:57 PM Koert Kuipers <ko...@tresata.com> wrote:
> >>
> >> CDH 5 is still based on hadoop 2.6
> >>
> >> On Thu, Feb 8, 2018 at 2:03 PM, Sean Owen <so...@cloudera.com> wrote:
> >>>
> >>> Mostly just shedding the extra build complexity, and builds. The
> primary
> >>> little annoyance is it's 2x the number of flaky build failures to
> examine.
> >>> I suppose it allows using a 2.7+-only feature, but outside of YARN, not
> >>> sure there is anything compelling.
> >>>
> >>> It's something that probably gains us virtually nothing now, but isn't
> >>> too painful either.
> >>> I think it will not make sense to distinguish them once any Hadoop
> >>> 3-related support comes into the picture, and maybe that will start
> soon;
> >>> there were some more pings on related JIRAs this week. You could view
> it as
> >>> early setup for that move.
> >>>
> >>>
> >>> On Thu, Feb 8, 2018 at 12:57 PM Reynold Xin <rx...@databricks.com>
> wrote:
> >>>>
> >>>> Does it gain us anything to drop 2.6?
> >>>>
> >>>> > On Feb 8, 2018, at 10:50 AM, Sean Owen <so...@cloudera.com> wrote:
> >>>> >
> >>>> > At this point, with Hadoop 3 on deck, I think hadoop 2.6 is both
> >>>> > fairly old, and actually, not different from 2.7 with respect to
> Spark. That
> >>>> > is, I don't know if we are actually maintaining anything here but a
> separate
> >>>> > profile and 2x the number of test builds.
> >>>> >
> >>>> > The cost is, by the same token, low. However I'm floating the idea
> of
> >>>> > removing the 2.6 profile and just requiring 2.7+ as of Spark 2.4?
> >>
> >>
> >
>
>
>
> --
> Marcelo
>

Re: Drop the Hadoop 2.6 profile?

Posted by Marcelo Vanzin <va...@cloudera.com>.

I think it would make sense to drop one of them, but not necessarily 2.6.

It kinda depends on what wire compatibility guarantees the Hadoop
libraries have; can a 2.6 client talk to 2.7 (pretty certain it can)?
Is the opposite safe (not sure)?

If the answer to the latter question is "no", then keeping 2.6 and
dropping 2.7 makes more sense. Those who really want a
Hadoop-version-specific package can override the needed versions in
the command line, or use the "without hadoop" package.

But in the context of trying to support 3.0 it makes sense to drop one
of them, at least from jenkins.


On Thu, Feb 8, 2018 at 2:03 PM, Sean Owen <so...@cloudera.com> wrote:
> That would still work with a Hadoop-2.7-based profile, as there isn't
> actually any code difference in Spark that treats the two versions
> differently (nor, really, much different between 2.6 and 2.7 to begin with).
> This practice of different profile builds was pretty unnecessary after 2.2;
> it's mostly vestigial now.
>
> On Thu, Feb 8, 2018 at 3:57 PM Koert Kuipers <ko...@tresata.com> wrote:
>>
>> CDH 5 is still based on hadoop 2.6
>>
>> On Thu, Feb 8, 2018 at 2:03 PM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>> Mostly just shedding the extra build complexity, and builds. The primary
>>> little annoyance is it's 2x the number of flaky build failures to examine.
>>> I suppose it allows using a 2.7+-only feature, but outside of YARN, not
>>> sure there is anything compelling.
>>>
>>> It's something that probably gains us virtually nothing now, but isn't
>>> too painful either.
>>> I think it will not make sense to distinguish them once any Hadoop
>>> 3-related support comes into the picture, and maybe that will start soon;
>>> there were some more pings on related JIRAs this week. You could view it as
>>> early setup for that move.
>>>
>>>
>>> On Thu, Feb 8, 2018 at 12:57 PM Reynold Xin <rx...@databricks.com> wrote:
>>>>
>>>> Does it gain us anything to drop 2.6?
>>>>
>>>> > On Feb 8, 2018, at 10:50 AM, Sean Owen <so...@cloudera.com> wrote:
>>>> >
>>>> > At this point, with Hadoop 3 on deck, I think hadoop 2.6 is both
>>>> > fairly old, and actually, not different from 2.7 with respect to Spark. That
>>>> > is, I don't know if we are actually maintaining anything here but a separate
>>>> > profile and 2x the number of test builds.
>>>> >
>>>> > The cost is, by the same token, low. However I'm floating the idea of
>>>> > removing the 2.6 profile and just requiring 2.7+ as of Spark 2.4?
>>
>>
>



-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Drop the Hadoop 2.6 profile?

Posted by Sean Owen <so...@cloudera.com>.

That would still work with a Hadoop-2.7-based profile, as there isn't
actually any code difference in Spark that treats the two versions
differently (nor, really, much different between 2.6 and 2.7 to begin
with). This practice of different profile builds was pretty unnecessary
after 2.2; it's mostly vestigial now.

On Thu, Feb 8, 2018 at 3:57 PM Koert Kuipers <ko...@tresata.com> wrote:

> CDH 5 is still based on hadoop 2.6
>
> On Thu, Feb 8, 2018 at 2:03 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> Mostly just shedding the extra build complexity, and builds. The primary
>> little annoyance is it's 2x the number of flaky build failures to examine.
>> I suppose it allows using a 2.7+-only feature, but outside of YARN, not
>> sure there is anything compelling.
>>
>> It's something that probably gains us virtually nothing now, but isn't
>> too painful either.
>> I think it will not make sense to distinguish them once any Hadoop
>> 3-related support comes into the picture, and maybe that will start soon;
>> there were some more pings on related JIRAs this week. You could view it as
>> early setup for that move.
>>
>>
>> On Thu, Feb 8, 2018 at 12:57 PM Reynold Xin <rx...@databricks.com> wrote:
>>
>>> Does it gain us anything to drop 2.6?
>>>
>>> > On Feb 8, 2018, at 10:50 AM, Sean Owen <so...@cloudera.com> wrote:
>>> >
>>> > At this point, with Hadoop 3 on deck, I think hadoop 2.6 is both
>>> fairly old, and actually, not different from 2.7 with respect to Spark.
>>> That is, I don't know if we are actually maintaining anything here but a
>>> separate profile and 2x the number of test builds.
>>> >
>>> > The cost is, by the same token, low. However I'm floating the idea of
>>> removing the 2.6 profile and just requiring 2.7+ as of Spark 2.4?
>>>
>>
>

Re: Drop the Hadoop 2.6 profile?

Posted by Koert Kuipers <ko...@tresata.com>.

CDH 5 is still based on hadoop 2.6

On Thu, Feb 8, 2018 at 2:03 PM, Sean Owen <so...@cloudera.com> wrote:

> Mostly just shedding the extra build complexity, and builds. The primary
> little annoyance is it's 2x the number of flaky build failures to examine.
> I suppose it allows using a 2.7+-only feature, but outside of YARN, not
> sure there is anything compelling.
>
> It's something that probably gains us virtually nothing now, but isn't too
> painful either.
> I think it will not make sense to distinguish them once any Hadoop
> 3-related support comes into the picture, and maybe that will start soon;
> there were some more pings on related JIRAs this week. You could view it as
> early setup for that move.
>
>
> On Thu, Feb 8, 2018 at 12:57 PM Reynold Xin <rx...@databricks.com> wrote:
>
>> Does it gain us anything to drop 2.6?
>>
>> > On Feb 8, 2018, at 10:50 AM, Sean Owen <so...@cloudera.com> wrote:
>> >
>> > At this point, with Hadoop 3 on deck, I think hadoop 2.6 is both fairly
>> old, and actually, not different from 2.7 with respect to Spark. That is, I
>> don't know if we are actually maintaining anything here but a separate
>> profile and 2x the number of test builds.
>> >
>> > The cost is, by the same token, low. However I'm floating the idea of
>> removing the 2.6 profile and just requiring 2.7+ as of Spark 2.4?
>>
>

Re: Drop the Hadoop 2.6 profile?

Posted by Sean Owen <so...@cloudera.com>.

Mostly just shedding the extra build complexity, and builds. The primary
little annoyance is it's 2x the number of flaky build failures to examine.
I suppose it allows using a 2.7+-only feature, but outside of YARN, not
sure there is anything compelling.

It's something that probably gains us virtually nothing now, but isn't too
painful either.
I think it will not make sense to distinguish them once any Hadoop
3-related support comes into the picture, and maybe that will start soon;
there were some more pings on related JIRAs this week. You could view it as
early setup for that move.

On Thu, Feb 8, 2018 at 12:57 PM Reynold Xin <rx...@databricks.com> wrote:

> Does it gain us anything to drop 2.6?
>
> > On Feb 8, 2018, at 10:50 AM, Sean Owen <so...@cloudera.com> wrote:
> >
> > At this point, with Hadoop 3 on deck, I think hadoop 2.6 is both fairly
> old, and actually, not different from 2.7 with respect to Spark. That is, I
> don't know if we are actually maintaining anything here but a separate
> profile and 2x the number of test builds.
> >
> > The cost is, by the same token, low. However I'm floating the idea of
> removing the 2.6 profile and just requiring 2.7+ as of Spark 2.4?
>

Re: Drop the Hadoop 2.6 profile?

Posted by Reynold Xin <rx...@databricks.com>.

Does it gain us anything to drop 2.6?

> On Feb 8, 2018, at 10:50 AM, Sean Owen <so...@cloudera.com> wrote:
> 
> At this point, with Hadoop 3 on deck, I think hadoop 2.6 is both fairly old, and actually, not different from 2.7 with respect to Spark. That is, I don't know if we are actually maintaining anything here but a separate profile and 2x the number of test builds.
> 
> The cost is, by the same token, low. However I'm floating the idea of removing the 2.6 profile and just requiring 2.7+ as of Spark 2.4?

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org