You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Sean Owen <so...@cloudera.com> on 2015/03/08 22:56:31 UTC

Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

Ah. I misunderstood that Matei was referring to the Scala 2.11 tarball
at http://people.apache.org/~pwendell/spark-1.3.0-rc3/ and not the
Maven artifacts.

Patrick I see you just commented on SPARK-5134 and will follow up
there. Sounds like this may accidentally not be a problem.

On binary tarball releases, I wonder if anyone has an opinion on my
opinion that these shouldn't be distributed for specific Hadoop
*distributions* to begin with. (Won't repeat the argument here yet.)
That resolves this n x m explosion too.

Vendors already provide their own distribution, yes, that's their job.


On Sun, Mar 8, 2015 at 9:42 PM, Krishna Sankar <ks...@gmail.com> wrote:
> Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop
> Distributions X ...
>
> May be one option is to have a minimum basic set (which I know is what we
> are discussing) and move the rest to spark-packages.org. There the vendors
> can add the latest downloads - for example when 1.4 is released, HDP can
> build a release of HDP Spark 1.4 bundle.
>
> Cheers
> <k/>
>
> On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell <pw...@gmail.com> wrote:
>>
>> We probably want to revisit the way we do binaries in general for
>> 1.4+. IMO, something worth forking a separate thread for.
>>
>> I've been hesitating to add new binaries because people
>> (understandably) complain if you ever stop packaging older ones, but
>> on the other hand the ASF has complained that we have too many
>> binaries already and that we need to pare it down because of the large
>> volume of files. Doubling the number of binaries we produce for Scala
>> 2.11 seemed like it would be too much.
>>
>> One solution potentially is to actually package "Hadoop provided"
>> binaries and encourage users to use these by simply setting
>> HADOOP_HOME, or have instructions for specific distros. I've heard
>> that our existing packages don't work well on HDP for instance, since
>> there are some configuration quirks that differ from the upstream
>> Hadoop.
>>
>> If we cut down on the cross building for Hadoop versions, then it is
>> more tenable to cross build for Scala versions without exploding the
>> number of binaries.
>>
>> - Patrick
>>
>> On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen <so...@cloudera.com> wrote:
>> > Yeah, interesting question of what is the better default for the
>> > single set of artifacts published to Maven. I think there's an
>> > argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros
>> > and cons discussed more at
>> >
>> > https://issues.apache.org/jira/browse/SPARK-5134
>> > https://github.com/apache/spark/pull/3917
>> >
>> > On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia <ma...@gmail.com>
>> > wrote:
>> >> +1
>> >>
>> >> Tested it on Mac OS X.
>> >>
>> >> One small issue I noticed is that the Scala 2.11 build is using Hadoop
>> >> 1 without Hive, which is kind of weird because people will more likely want
>> >> Hadoop 2 with Hive. So it would be good to publish a build for that
>> >> configuration instead. We can do it if we do a new RC, or it might be that
>> >> binary builds may not need to be voted on (I forgot the details there).
>> >>
>> >> Matei
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

Posted by Patrick Wendell <pw...@gmail.com>.

I think it's important to separate the goals from the implementation.
I agree with Matei on the goal - I think the goal needs to be to allow
people to download Apache Spark and use it with CDH, HDP, MapR,
whatever... This is the whole reason why HDFS and YARN have stable
API's, so that other projects can build on them in a way that works
across multiple versions. I wouldn't want to force users to upgrade
according only to some vendor timetable, that doesn't seem from the
ASF perspective like a good thing for the project. If users want to
get packages from Bigtop, or the vendors, that's totally fine too.

My point earlier was - I am not sure we are actually accomplishing
that goal now, because I've heard in some cases our "Hadoop 2.X"
packages actually don't work on certain distributions, even those that
are based on that Hadoop version. So one solution is to move towards
"bring your own Hadoop" binaries and have users just set HADOOP_HOME
and maybe document any vendor-specific configs that need to be set.
That also happens to solve the "too many binaries" problem, but only
incidentally.

- Patrick

On Sun, Mar 8, 2015 at 4:07 PM, Matei Zaharia <ma...@gmail.com> wrote:
> Our goal is to let people use the latest Apache release even if vendors fall behind or don't want to package everything, so that's why we put out releases for vendors' versions. It's fairly low overhead.
>
> Matei
>
>> On Mar 8, 2015, at 5:56 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> Ah. I misunderstood that Matei was referring to the Scala 2.11 tarball
>> at http://people.apache.org/~pwendell/spark-1.3.0-rc3/ and not the
>> Maven artifacts.
>>
>> Patrick I see you just commented on SPARK-5134 and will follow up
>> there. Sounds like this may accidentally not be a problem.
>>
>> On binary tarball releases, I wonder if anyone has an opinion on my
>> opinion that these shouldn't be distributed for specific Hadoop
>> *distributions* to begin with. (Won't repeat the argument here yet.)
>> That resolves this n x m explosion too.
>>
>> Vendors already provide their own distribution, yes, that's their job.
>>
>>
>> On Sun, Mar 8, 2015 at 9:42 PM, Krishna Sankar <ks...@gmail.com> wrote:
>>> Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop
>>> Distributions X ...
>>>
>>> May be one option is to have a minimum basic set (which I know is what we
>>> are discussing) and move the rest to spark-packages.org. There the vendors
>>> can add the latest downloads - for example when 1.4 is released, HDP can
>>> build a release of HDP Spark 1.4 bundle.
>>>
>>> Cheers
>>> <k/>
>>>
>>> On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell <pw...@gmail.com> wrote:
>>>>
>>>> We probably want to revisit the way we do binaries in general for
>>>> 1.4+. IMO, something worth forking a separate thread for.
>>>>
>>>> I've been hesitating to add new binaries because people
>>>> (understandably) complain if you ever stop packaging older ones, but
>>>> on the other hand the ASF has complained that we have too many
>>>> binaries already and that we need to pare it down because of the large
>>>> volume of files. Doubling the number of binaries we produce for Scala
>>>> 2.11 seemed like it would be too much.
>>>>
>>>> One solution potentially is to actually package "Hadoop provided"
>>>> binaries and encourage users to use these by simply setting
>>>> HADOOP_HOME, or have instructions for specific distros. I've heard
>>>> that our existing packages don't work well on HDP for instance, since
>>>> there are some configuration quirks that differ from the upstream
>>>> Hadoop.
>>>>
>>>> If we cut down on the cross building for Hadoop versions, then it is
>>>> more tenable to cross build for Scala versions without exploding the
>>>> number of binaries.
>>>>
>>>> - Patrick
>>>>
>>>> On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen <so...@cloudera.com> wrote:
>>>>> Yeah, interesting question of what is the better default for the
>>>>> single set of artifacts published to Maven. I think there's an
>>>>> argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros
>>>>> and cons discussed more at
>>>>>
>>>>> https://issues.apache.org/jira/browse/SPARK-5134
>>>>> https://github.com/apache/spark/pull/3917
>>>>>
>>>>> On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia <ma...@gmail.com>
>>>>> wrote:
>>>>>> +1
>>>>>>
>>>>>> Tested it on Mac OS X.
>>>>>>
>>>>>> One small issue I noticed is that the Scala 2.11 build is using Hadoop
>>>>>> 1 without Hive, which is kind of weird because people will more likely want
>>>>>> Hadoop 2 with Hive. So it would be good to publish a build for that
>>>>>> configuration instead. We can do it if we do a new RC, or it might be that
>>>>>> binary builds may not need to be voted on (I forgot the details there).
>>>>>>
>>>>>> Matei
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>
>>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

Posted by Andrew Ash <an...@andrewash.com>.

Does the Apache project team have any ability to measure download counts of
the various releases?  That data could be useful when it comes time to
sunset vendor-specific releases, like CDH4 for example.

On Mon, Mar 9, 2015 at 5:34 AM, Mridul Muralidharan <mr...@gmail.com>
wrote:

> In ideal situation, +1 on removing all vendor specific builds and
> making just hadoop version specific - that is what we should depend on
> anyway.
> Though I hope Sean is correct in assuming that vendor specific builds
> for hadoop 2.4 are just that; and not 2.4- or 2.4+ which cause
> incompatibilities for us or our users !
>
> Regards,
> Mridul
>
>
> On Mon, Mar 9, 2015 at 2:50 AM, Sean Owen <so...@cloudera.com> wrote:
> > Yes, you should always find working bits at Apache no matter what --
> > though 'no matter what' really means 'as long as you use Hadoop distro
> > compatible with upstream Hadoop'. Even distros have a strong interest
> > in that, since the market, the 'pie', is made large by this kind of
> > freedom at the core.
> >
> > If tso, then no vendor-specific builds are needed, only some
> > Hadoop-release-specific ones. So a Hadoop 2.6-specific build could be
> > good (although I'm not yet clear if there's something about 2.5 or 2.6
> > that needs a different build.)
> >
> > I take it that we already believe that, say, the "Hadoop 2.4" build
> > works with CDH5, so no CDH5-specific build is provided by Spark.
> >
> > If a distro doesn't work with stock Spark, then it's either something
> > Spark should fix (e.g. use of a private YARN API or something), or
> > it's something the distro should really fix because it's incompatible.
> >
> > Could we maybe rename the "CDH4" build then, as it doesn't really work
> > with all CDH4, to be a "Hadoop 2.0.x build"? That's been floated
> > before. And can we remove the MapR builds -- or else can someone
> > explain why these exist separately from a Hadoop 2.3 build? I hope it
> > is not *because* they are somehow non-standard. And shall we first run
> > down why Spark doesn't fully work on HDP and see if it's something
> > that Spark or HDP needs to tweak, rather than contemplate another
> > binary? or, if so, can it simply be called a "Hadoop 2.7 + YARN
> > whatever" build and not made specific to a vendor, even if the project
> > has to field another tarball combo for a vendor?
> >
> > Maybe we are saying almost the same thing.
> >
> >
> > On Mon, Mar 9, 2015 at 1:33 AM, Matei Zaharia <ma...@gmail.com>
> wrote:
> >> Yeah, my concern is that people should get Apache Spark from *Apache*,
> not from a vendor. It helps everyone use the latest features no matter
> where they are. In the Hadoop distro case, Hadoop made all this effort to
> have standard APIs (e.g. YARN), so it should be easy. But it is a problem
> if we're not packaging for the newest versions of some distros; I think we
> just fell behind at Hadoop 2.4.
> >>
> >> Matei
> >>
> >>> On Mar 8, 2015, at 8:02 PM, Sean Owen <so...@cloudera.com> wrote:
> >>>
> >>> Yeah it's not much overhead, but here's an example of where it causes
> >>> a little issue.
> >>>
> >>> I like that reasoning. However, the released builds don't track the
> >>> later versions of Hadoop that vendors would be distributing -- there's
> >>> no Hadoop 2.6 build for example. CDH4 is here, but not the
> >>> far-more-used CDH5. HDP isn't present at all. The CDH4 build doesn't
> >>> actually work with many CDH4 versions.
> >>>
> >>> I agree with the goal of maximizing the reach of Spark, but I don't
> >>> know how much these builds advance that goal.
> >>>
> >>> Anyone can roll-their-own exactly-right build, and the docs and build
> >>> have been set up to make that as simple as can be expected. So these
> >>> aren't *required* to let me use latest Spark on distribution X.
> >>>
> >>> I had thought these existed to sorta support 'legacy' distributions,
> >>> like CDH4, and that build was justified as a
> >>> quasi-Hadoop-2.0.x-flavored build. But then I don't understand what
> >>> the MapR profiles are for.
> >>>
> >>> I think it's too much work to correctly, in parallel, maintain any
> >>> customizations necessary for any major distro, and it might be best to
> >>> do not at all than to do it incompletely. You could say it's also an
> >>> enabler for distros to vary in ways that require special
> >>> customization.
> >>>
> >>> Maybe there's a concern that, if lots of people consume Spark on
> >>> Hadoop, and most people consume Hadoop through distros, and distros
> >>> alone manage Spark distributions, then you de facto 'have to' go
> >>> through a distro instead of get bits from Spark? Different
> >>> conversation but I think this sort of effect does not end up being a
> >>> negative.
> >>>
> >>> Well anyway, I like the idea of seeing how far Hadoop-provided
> >>> releases can help. It might kill several birds with one stone.
> >>>
> >>> On Sun, Mar 8, 2015 at 11:07 PM, Matei Zaharia <
> matei.zaharia@gmail.com> wrote:
> >>>> Our goal is to let people use the latest Apache release even if
> vendors fall behind or don't want to package everything, so that's why we
> put out releases for vendors' versions. It's fairly low overhead.
> >>>>
> >>>> Matei
> >>>>
> >>>>> On Mar 8, 2015, at 5:56 PM, Sean Owen <so...@cloudera.com> wrote:
> >>>>>
> >>>>> Ah. I misunderstood that Matei was referring to the Scala 2.11
> tarball
> >>>>> at http://people.apache.org/~pwendell/spark-1.3.0-rc3/ and not the
> >>>>> Maven artifacts.
> >>>>>
> >>>>> Patrick I see you just commented on SPARK-5134 and will follow up
> >>>>> there. Sounds like this may accidentally not be a problem.
> >>>>>
> >>>>> On binary tarball releases, I wonder if anyone has an opinion on my
> >>>>> opinion that these shouldn't be distributed for specific Hadoop
> >>>>> *distributions* to begin with. (Won't repeat the argument here yet.)
> >>>>> That resolves this n x m explosion too.
> >>>>>
> >>>>> Vendors already provide their own distribution, yes, that's their
> job.
> >>>>>
> >>>>>
> >>>>> On Sun, Mar 8, 2015 at 9:42 PM, Krishna Sankar <ks...@gmail.com>
> wrote:
> >>>>>> Yep, otherwise this will become an N^2 problem - Scala versions X
> Hadoop
> >>>>>> Distributions X ...
> >>>>>>
> >>>>>> May be one option is to have a minimum basic set (which I know is
> what we
> >>>>>> are discussing) and move the rest to spark-packages.org. There the
> vendors
> >>>>>> can add the latest downloads - for example when 1.4 is released,
> HDP can
> >>>>>> build a release of HDP Spark 1.4 bundle.
> >>>>>>
> >>>>>> Cheers
> >>>>>> <k/>
> >>>>>>
> >>>>>> On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell <pw...@gmail.com>
> wrote:
> >>>>>>>
> >>>>>>> We probably want to revisit the way we do binaries in general for
> >>>>>>> 1.4+. IMO, something worth forking a separate thread for.
> >>>>>>>
> >>>>>>> I've been hesitating to add new binaries because people
> >>>>>>> (understandably) complain if you ever stop packaging older ones,
> but
> >>>>>>> on the other hand the ASF has complained that we have too many
> >>>>>>> binaries already and that we need to pare it down because of the
> large
> >>>>>>> volume of files. Doubling the number of binaries we produce for
> Scala
> >>>>>>> 2.11 seemed like it would be too much.
> >>>>>>>
> >>>>>>> One solution potentially is to actually package "Hadoop provided"
> >>>>>>> binaries and encourage users to use these by simply setting
> >>>>>>> HADOOP_HOME, or have instructions for specific distros. I've heard
> >>>>>>> that our existing packages don't work well on HDP for instance,
> since
> >>>>>>> there are some configuration quirks that differ from the upstream
> >>>>>>> Hadoop.
> >>>>>>>
> >>>>>>> If we cut down on the cross building for Hadoop versions, then it
> is
> >>>>>>> more tenable to cross build for Scala versions without exploding
> the
> >>>>>>> number of binaries.
> >>>>>>>
> >>>>>>> - Patrick
> >>>>>>>
> >>>>>>> On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen <so...@cloudera.com>
> wrote:
> >>>>>>>> Yeah, interesting question of what is the better default for the
> >>>>>>>> single set of artifacts published to Maven. I think there's an
> >>>>>>>> argument for Hadoop 2 and perhaps Hive for the 2.10 build too.
> Pros
> >>>>>>>> and cons discussed more at
> >>>>>>>>
> >>>>>>>> https://issues.apache.org/jira/browse/SPARK-5134
> >>>>>>>> https://github.com/apache/spark/pull/3917
> >>>>>>>>
> >>>>>>>> On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia <
> matei.zaharia@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>> +1
> >>>>>>>>>
> >>>>>>>>> Tested it on Mac OS X.
> >>>>>>>>>
> >>>>>>>>> One small issue I noticed is that the Scala 2.11 build is using
> Hadoop
> >>>>>>>>> 1 without Hive, which is kind of weird because people will more
> likely want
> >>>>>>>>> Hadoop 2 with Hive. So it would be good to publish a build for
> that
> >>>>>>>>> configuration instead. We can do it if we do a new RC, or it
> might be that
> >>>>>>>>> binary builds may not need to be voted on (I forgot the details
> there).
> >>>>>>>>>
> >>>>>>>>> Matei
> >>>>>>>
> >>>>>>>
> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> >>>>>>>
> >>>>>>
> >>>>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > For additional commands, e-mail: dev-help@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

Posted by Mridul Muralidharan <mr...@gmail.com>.

In ideal situation, +1 on removing all vendor specific builds and
making just hadoop version specific - that is what we should depend on
anyway.
Though I hope Sean is correct in assuming that vendor specific builds
for hadoop 2.4 are just that; and not 2.4- or 2.4+ which cause
incompatibilities for us or our users !

Regards,
Mridul


On Mon, Mar 9, 2015 at 2:50 AM, Sean Owen <so...@cloudera.com> wrote:
> Yes, you should always find working bits at Apache no matter what --
> though 'no matter what' really means 'as long as you use Hadoop distro
> compatible with upstream Hadoop'. Even distros have a strong interest
> in that, since the market, the 'pie', is made large by this kind of
> freedom at the core.
>
> If tso, then no vendor-specific builds are needed, only some
> Hadoop-release-specific ones. So a Hadoop 2.6-specific build could be
> good (although I'm not yet clear if there's something about 2.5 or 2.6
> that needs a different build.)
>
> I take it that we already believe that, say, the "Hadoop 2.4" build
> works with CDH5, so no CDH5-specific build is provided by Spark.
>
> If a distro doesn't work with stock Spark, then it's either something
> Spark should fix (e.g. use of a private YARN API or something), or
> it's something the distro should really fix because it's incompatible.
>
> Could we maybe rename the "CDH4" build then, as it doesn't really work
> with all CDH4, to be a "Hadoop 2.0.x build"? That's been floated
> before. And can we remove the MapR builds -- or else can someone
> explain why these exist separately from a Hadoop 2.3 build? I hope it
> is not *because* they are somehow non-standard. And shall we first run
> down why Spark doesn't fully work on HDP and see if it's something
> that Spark or HDP needs to tweak, rather than contemplate another
> binary? or, if so, can it simply be called a "Hadoop 2.7 + YARN
> whatever" build and not made specific to a vendor, even if the project
> has to field another tarball combo for a vendor?
>
> Maybe we are saying almost the same thing.
>
>
> On Mon, Mar 9, 2015 at 1:33 AM, Matei Zaharia <ma...@gmail.com> wrote:
>> Yeah, my concern is that people should get Apache Spark from *Apache*, not from a vendor. It helps everyone use the latest features no matter where they are. In the Hadoop distro case, Hadoop made all this effort to have standard APIs (e.g. YARN), so it should be easy. But it is a problem if we're not packaging for the newest versions of some distros; I think we just fell behind at Hadoop 2.4.
>>
>> Matei
>>
>>> On Mar 8, 2015, at 8:02 PM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>> Yeah it's not much overhead, but here's an example of where it causes
>>> a little issue.
>>>
>>> I like that reasoning. However, the released builds don't track the
>>> later versions of Hadoop that vendors would be distributing -- there's
>>> no Hadoop 2.6 build for example. CDH4 is here, but not the
>>> far-more-used CDH5. HDP isn't present at all. The CDH4 build doesn't
>>> actually work with many CDH4 versions.
>>>
>>> I agree with the goal of maximizing the reach of Spark, but I don't
>>> know how much these builds advance that goal.
>>>
>>> Anyone can roll-their-own exactly-right build, and the docs and build
>>> have been set up to make that as simple as can be expected. So these
>>> aren't *required* to let me use latest Spark on distribution X.
>>>
>>> I had thought these existed to sorta support 'legacy' distributions,
>>> like CDH4, and that build was justified as a
>>> quasi-Hadoop-2.0.x-flavored build. But then I don't understand what
>>> the MapR profiles are for.
>>>
>>> I think it's too much work to correctly, in parallel, maintain any
>>> customizations necessary for any major distro, and it might be best to
>>> do not at all than to do it incompletely. You could say it's also an
>>> enabler for distros to vary in ways that require special
>>> customization.
>>>
>>> Maybe there's a concern that, if lots of people consume Spark on
>>> Hadoop, and most people consume Hadoop through distros, and distros
>>> alone manage Spark distributions, then you de facto 'have to' go
>>> through a distro instead of get bits from Spark? Different
>>> conversation but I think this sort of effect does not end up being a
>>> negative.
>>>
>>> Well anyway, I like the idea of seeing how far Hadoop-provided
>>> releases can help. It might kill several birds with one stone.
>>>
>>> On Sun, Mar 8, 2015 at 11:07 PM, Matei Zaharia <ma...@gmail.com> wrote:
>>>> Our goal is to let people use the latest Apache release even if vendors fall behind or don't want to package everything, so that's why we put out releases for vendors' versions. It's fairly low overhead.
>>>>
>>>> Matei
>>>>
>>>>> On Mar 8, 2015, at 5:56 PM, Sean Owen <so...@cloudera.com> wrote:
>>>>>
>>>>> Ah. I misunderstood that Matei was referring to the Scala 2.11 tarball
>>>>> at http://people.apache.org/~pwendell/spark-1.3.0-rc3/ and not the
>>>>> Maven artifacts.
>>>>>
>>>>> Patrick I see you just commented on SPARK-5134 and will follow up
>>>>> there. Sounds like this may accidentally not be a problem.
>>>>>
>>>>> On binary tarball releases, I wonder if anyone has an opinion on my
>>>>> opinion that these shouldn't be distributed for specific Hadoop
>>>>> *distributions* to begin with. (Won't repeat the argument here yet.)
>>>>> That resolves this n x m explosion too.
>>>>>
>>>>> Vendors already provide their own distribution, yes, that's their job.
>>>>>
>>>>>
>>>>> On Sun, Mar 8, 2015 at 9:42 PM, Krishna Sankar <ks...@gmail.com> wrote:
>>>>>> Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop
>>>>>> Distributions X ...
>>>>>>
>>>>>> May be one option is to have a minimum basic set (which I know is what we
>>>>>> are discussing) and move the rest to spark-packages.org. There the vendors
>>>>>> can add the latest downloads - for example when 1.4 is released, HDP can
>>>>>> build a release of HDP Spark 1.4 bundle.
>>>>>>
>>>>>> Cheers
>>>>>> <k/>
>>>>>>
>>>>>> On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell <pw...@gmail.com> wrote:
>>>>>>>
>>>>>>> We probably want to revisit the way we do binaries in general for
>>>>>>> 1.4+. IMO, something worth forking a separate thread for.
>>>>>>>
>>>>>>> I've been hesitating to add new binaries because people
>>>>>>> (understandably) complain if you ever stop packaging older ones, but
>>>>>>> on the other hand the ASF has complained that we have too many
>>>>>>> binaries already and that we need to pare it down because of the large
>>>>>>> volume of files. Doubling the number of binaries we produce for Scala
>>>>>>> 2.11 seemed like it would be too much.
>>>>>>>
>>>>>>> One solution potentially is to actually package "Hadoop provided"
>>>>>>> binaries and encourage users to use these by simply setting
>>>>>>> HADOOP_HOME, or have instructions for specific distros. I've heard
>>>>>>> that our existing packages don't work well on HDP for instance, since
>>>>>>> there are some configuration quirks that differ from the upstream
>>>>>>> Hadoop.
>>>>>>>
>>>>>>> If we cut down on the cross building for Hadoop versions, then it is
>>>>>>> more tenable to cross build for Scala versions without exploding the
>>>>>>> number of binaries.
>>>>>>>
>>>>>>> - Patrick
>>>>>>>
>>>>>>> On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen <so...@cloudera.com> wrote:
>>>>>>>> Yeah, interesting question of what is the better default for the
>>>>>>>> single set of artifacts published to Maven. I think there's an
>>>>>>>> argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros
>>>>>>>> and cons discussed more at
>>>>>>>>
>>>>>>>> https://issues.apache.org/jira/browse/SPARK-5134
>>>>>>>> https://github.com/apache/spark/pull/3917
>>>>>>>>
>>>>>>>> On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia <ma...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>> +1
>>>>>>>>>
>>>>>>>>> Tested it on Mac OS X.
>>>>>>>>>
>>>>>>>>> One small issue I noticed is that the Scala 2.11 build is using Hadoop
>>>>>>>>> 1 without Hive, which is kind of weird because people will more likely want
>>>>>>>>> Hadoop 2 with Hive. So it would be good to publish a build for that
>>>>>>>>> configuration instead. We can do it if we do a new RC, or it might be that
>>>>>>>>> binary builds may not need to be voted on (I forgot the details there).
>>>>>>>>>
>>>>>>>>> Matei
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>>
>>>>>>
>>>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

Posted by Sean Owen <so...@cloudera.com>.

Yes, you should always find working bits at Apache no matter what --
though 'no matter what' really means 'as long as you use Hadoop distro
compatible with upstream Hadoop'. Even distros have a strong interest
in that, since the market, the 'pie', is made large by this kind of
freedom at the core.

If tso, then no vendor-specific builds are needed, only some
Hadoop-release-specific ones. So a Hadoop 2.6-specific build could be
good (although I'm not yet clear if there's something about 2.5 or 2.6
that needs a different build.)

I take it that we already believe that, say, the "Hadoop 2.4" build
works with CDH5, so no CDH5-specific build is provided by Spark.

If a distro doesn't work with stock Spark, then it's either something
Spark should fix (e.g. use of a private YARN API or something), or
it's something the distro should really fix because it's incompatible.

Could we maybe rename the "CDH4" build then, as it doesn't really work
with all CDH4, to be a "Hadoop 2.0.x build"? That's been floated
before. And can we remove the MapR builds -- or else can someone
explain why these exist separately from a Hadoop 2.3 build? I hope it
is not *because* they are somehow non-standard. And shall we first run
down why Spark doesn't fully work on HDP and see if it's something
that Spark or HDP needs to tweak, rather than contemplate another
binary? or, if so, can it simply be called a "Hadoop 2.7 + YARN
whatever" build and not made specific to a vendor, even if the project
has to field another tarball combo for a vendor?

Maybe we are saying almost the same thing.


On Mon, Mar 9, 2015 at 1:33 AM, Matei Zaharia <ma...@gmail.com> wrote:
> Yeah, my concern is that people should get Apache Spark from *Apache*, not from a vendor. It helps everyone use the latest features no matter where they are. In the Hadoop distro case, Hadoop made all this effort to have standard APIs (e.g. YARN), so it should be easy. But it is a problem if we're not packaging for the newest versions of some distros; I think we just fell behind at Hadoop 2.4.
>
> Matei
>
>> On Mar 8, 2015, at 8:02 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> Yeah it's not much overhead, but here's an example of where it causes
>> a little issue.
>>
>> I like that reasoning. However, the released builds don't track the
>> later versions of Hadoop that vendors would be distributing -- there's
>> no Hadoop 2.6 build for example. CDH4 is here, but not the
>> far-more-used CDH5. HDP isn't present at all. The CDH4 build doesn't
>> actually work with many CDH4 versions.
>>
>> I agree with the goal of maximizing the reach of Spark, but I don't
>> know how much these builds advance that goal.
>>
>> Anyone can roll-their-own exactly-right build, and the docs and build
>> have been set up to make that as simple as can be expected. So these
>> aren't *required* to let me use latest Spark on distribution X.
>>
>> I had thought these existed to sorta support 'legacy' distributions,
>> like CDH4, and that build was justified as a
>> quasi-Hadoop-2.0.x-flavored build. But then I don't understand what
>> the MapR profiles are for.
>>
>> I think it's too much work to correctly, in parallel, maintain any
>> customizations necessary for any major distro, and it might be best to
>> do not at all than to do it incompletely. You could say it's also an
>> enabler for distros to vary in ways that require special
>> customization.
>>
>> Maybe there's a concern that, if lots of people consume Spark on
>> Hadoop, and most people consume Hadoop through distros, and distros
>> alone manage Spark distributions, then you de facto 'have to' go
>> through a distro instead of get bits from Spark? Different
>> conversation but I think this sort of effect does not end up being a
>> negative.
>>
>> Well anyway, I like the idea of seeing how far Hadoop-provided
>> releases can help. It might kill several birds with one stone.
>>
>> On Sun, Mar 8, 2015 at 11:07 PM, Matei Zaharia <ma...@gmail.com> wrote:
>>> Our goal is to let people use the latest Apache release even if vendors fall behind or don't want to package everything, so that's why we put out releases for vendors' versions. It's fairly low overhead.
>>>
>>> Matei
>>>
>>>> On Mar 8, 2015, at 5:56 PM, Sean Owen <so...@cloudera.com> wrote:
>>>>
>>>> Ah. I misunderstood that Matei was referring to the Scala 2.11 tarball
>>>> at http://people.apache.org/~pwendell/spark-1.3.0-rc3/ and not the
>>>> Maven artifacts.
>>>>
>>>> Patrick I see you just commented on SPARK-5134 and will follow up
>>>> there. Sounds like this may accidentally not be a problem.
>>>>
>>>> On binary tarball releases, I wonder if anyone has an opinion on my
>>>> opinion that these shouldn't be distributed for specific Hadoop
>>>> *distributions* to begin with. (Won't repeat the argument here yet.)
>>>> That resolves this n x m explosion too.
>>>>
>>>> Vendors already provide their own distribution, yes, that's their job.
>>>>
>>>>
>>>> On Sun, Mar 8, 2015 at 9:42 PM, Krishna Sankar <ks...@gmail.com> wrote:
>>>>> Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop
>>>>> Distributions X ...
>>>>>
>>>>> May be one option is to have a minimum basic set (which I know is what we
>>>>> are discussing) and move the rest to spark-packages.org. There the vendors
>>>>> can add the latest downloads - for example when 1.4 is released, HDP can
>>>>> build a release of HDP Spark 1.4 bundle.
>>>>>
>>>>> Cheers
>>>>> <k/>
>>>>>
>>>>> On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell <pw...@gmail.com> wrote:
>>>>>>
>>>>>> We probably want to revisit the way we do binaries in general for
>>>>>> 1.4+. IMO, something worth forking a separate thread for.
>>>>>>
>>>>>> I've been hesitating to add new binaries because people
>>>>>> (understandably) complain if you ever stop packaging older ones, but
>>>>>> on the other hand the ASF has complained that we have too many
>>>>>> binaries already and that we need to pare it down because of the large
>>>>>> volume of files. Doubling the number of binaries we produce for Scala
>>>>>> 2.11 seemed like it would be too much.
>>>>>>
>>>>>> One solution potentially is to actually package "Hadoop provided"
>>>>>> binaries and encourage users to use these by simply setting
>>>>>> HADOOP_HOME, or have instructions for specific distros. I've heard
>>>>>> that our existing packages don't work well on HDP for instance, since
>>>>>> there are some configuration quirks that differ from the upstream
>>>>>> Hadoop.
>>>>>>
>>>>>> If we cut down on the cross building for Hadoop versions, then it is
>>>>>> more tenable to cross build for Scala versions without exploding the
>>>>>> number of binaries.
>>>>>>
>>>>>> - Patrick
>>>>>>
>>>>>> On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen <so...@cloudera.com> wrote:
>>>>>>> Yeah, interesting question of what is the better default for the
>>>>>>> single set of artifacts published to Maven. I think there's an
>>>>>>> argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros
>>>>>>> and cons discussed more at
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/SPARK-5134
>>>>>>> https://github.com/apache/spark/pull/3917
>>>>>>>
>>>>>>> On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia <ma...@gmail.com>
>>>>>>> wrote:
>>>>>>>> +1
>>>>>>>>
>>>>>>>> Tested it on Mac OS X.
>>>>>>>>
>>>>>>>> One small issue I noticed is that the Scala 2.11 build is using Hadoop
>>>>>>>> 1 without Hive, which is kind of weird because people will more likely want
>>>>>>>> Hadoop 2 with Hive. So it would be good to publish a build for that
>>>>>>>> configuration instead. We can do it if we do a new RC, or it might be that
>>>>>>>> binary builds may not need to be voted on (I forgot the details there).
>>>>>>>>
>>>>>>>> Matei
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>>>
>>>>>
>>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

Posted by Matei Zaharia <ma...@gmail.com>.

Yeah, my concern is that people should get Apache Spark from *Apache*, not from a vendor. It helps everyone use the latest features no matter where they are. In the Hadoop distro case, Hadoop made all this effort to have standard APIs (e.g. YARN), so it should be easy. But it is a problem if we're not packaging for the newest versions of some distros; I think we just fell behind at Hadoop 2.4.

Matei

> On Mar 8, 2015, at 8:02 PM, Sean Owen <so...@cloudera.com> wrote:
> 
> Yeah it's not much overhead, but here's an example of where it causes
> a little issue.
> 
> I like that reasoning. However, the released builds don't track the
> later versions of Hadoop that vendors would be distributing -- there's
> no Hadoop 2.6 build for example. CDH4 is here, but not the
> far-more-used CDH5. HDP isn't present at all. The CDH4 build doesn't
> actually work with many CDH4 versions.
> 
> I agree with the goal of maximizing the reach of Spark, but I don't
> know how much these builds advance that goal.
> 
> Anyone can roll-their-own exactly-right build, and the docs and build
> have been set up to make that as simple as can be expected. So these
> aren't *required* to let me use latest Spark on distribution X.
> 
> I had thought these existed to sorta support 'legacy' distributions,
> like CDH4, and that build was justified as a
> quasi-Hadoop-2.0.x-flavored build. But then I don't understand what
> the MapR profiles are for.
> 
> I think it's too much work to correctly, in parallel, maintain any
> customizations necessary for any major distro, and it might be best to
> do not at all than to do it incompletely. You could say it's also an
> enabler for distros to vary in ways that require special
> customization.
> 
> Maybe there's a concern that, if lots of people consume Spark on
> Hadoop, and most people consume Hadoop through distros, and distros
> alone manage Spark distributions, then you de facto 'have to' go
> through a distro instead of get bits from Spark? Different
> conversation but I think this sort of effect does not end up being a
> negative.
> 
> Well anyway, I like the idea of seeing how far Hadoop-provided
> releases can help. It might kill several birds with one stone.
> 
> On Sun, Mar 8, 2015 at 11:07 PM, Matei Zaharia <ma...@gmail.com> wrote:
>> Our goal is to let people use the latest Apache release even if vendors fall behind or don't want to package everything, so that's why we put out releases for vendors' versions. It's fairly low overhead.
>> 
>> Matei
>> 
>>> On Mar 8, 2015, at 5:56 PM, Sean Owen <so...@cloudera.com> wrote:
>>> 
>>> Ah. I misunderstood that Matei was referring to the Scala 2.11 tarball
>>> at http://people.apache.org/~pwendell/spark-1.3.0-rc3/ and not the
>>> Maven artifacts.
>>> 
>>> Patrick I see you just commented on SPARK-5134 and will follow up
>>> there. Sounds like this may accidentally not be a problem.
>>> 
>>> On binary tarball releases, I wonder if anyone has an opinion on my
>>> opinion that these shouldn't be distributed for specific Hadoop
>>> *distributions* to begin with. (Won't repeat the argument here yet.)
>>> That resolves this n x m explosion too.
>>> 
>>> Vendors already provide their own distribution, yes, that's their job.
>>> 
>>> 
>>> On Sun, Mar 8, 2015 at 9:42 PM, Krishna Sankar <ks...@gmail.com> wrote:
>>>> Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop
>>>> Distributions X ...
>>>> 
>>>> May be one option is to have a minimum basic set (which I know is what we
>>>> are discussing) and move the rest to spark-packages.org. There the vendors
>>>> can add the latest downloads - for example when 1.4 is released, HDP can
>>>> build a release of HDP Spark 1.4 bundle.
>>>> 
>>>> Cheers
>>>> <k/>
>>>> 
>>>> On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell <pw...@gmail.com> wrote:
>>>>> 
>>>>> We probably want to revisit the way we do binaries in general for
>>>>> 1.4+. IMO, something worth forking a separate thread for.
>>>>> 
>>>>> I've been hesitating to add new binaries because people
>>>>> (understandably) complain if you ever stop packaging older ones, but
>>>>> on the other hand the ASF has complained that we have too many
>>>>> binaries already and that we need to pare it down because of the large
>>>>> volume of files. Doubling the number of binaries we produce for Scala
>>>>> 2.11 seemed like it would be too much.
>>>>> 
>>>>> One solution potentially is to actually package "Hadoop provided"
>>>>> binaries and encourage users to use these by simply setting
>>>>> HADOOP_HOME, or have instructions for specific distros. I've heard
>>>>> that our existing packages don't work well on HDP for instance, since
>>>>> there are some configuration quirks that differ from the upstream
>>>>> Hadoop.
>>>>> 
>>>>> If we cut down on the cross building for Hadoop versions, then it is
>>>>> more tenable to cross build for Scala versions without exploding the
>>>>> number of binaries.
>>>>> 
>>>>> - Patrick
>>>>> 
>>>>> On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen <so...@cloudera.com> wrote:
>>>>>> Yeah, interesting question of what is the better default for the
>>>>>> single set of artifacts published to Maven. I think there's an
>>>>>> argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros
>>>>>> and cons discussed more at
>>>>>> 
>>>>>> https://issues.apache.org/jira/browse/SPARK-5134
>>>>>> https://github.com/apache/spark/pull/3917
>>>>>> 
>>>>>> On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia <ma...@gmail.com>
>>>>>> wrote:
>>>>>>> +1
>>>>>>> 
>>>>>>> Tested it on Mac OS X.
>>>>>>> 
>>>>>>> One small issue I noticed is that the Scala 2.11 build is using Hadoop
>>>>>>> 1 without Hive, which is kind of weird because people will more likely want
>>>>>>> Hadoop 2 with Hive. So it would be good to publish a build for that
>>>>>>> configuration instead. We can do it if we do a new RC, or it might be that
>>>>>>> binary builds may not need to be voted on (I forgot the details there).
>>>>>>> 
>>>>>>> Matei
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>> 
>>>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

Posted by Sean Owen <so...@cloudera.com>.

Yeah it's not much overhead, but here's an example of where it causes
a little issue.

I like that reasoning. However, the released builds don't track the
later versions of Hadoop that vendors would be distributing -- there's
no Hadoop 2.6 build for example. CDH4 is here, but not the
far-more-used CDH5. HDP isn't present at all. The CDH4 build doesn't
actually work with many CDH4 versions.

I agree with the goal of maximizing the reach of Spark, but I don't
know how much these builds advance that goal.

Anyone can roll-their-own exactly-right build, and the docs and build
have been set up to make that as simple as can be expected. So these
aren't *required* to let me use latest Spark on distribution X.

I had thought these existed to sorta support 'legacy' distributions,
like CDH4, and that build was justified as a
quasi-Hadoop-2.0.x-flavored build. But then I don't understand what
the MapR profiles are for.

I think it's too much work to correctly, in parallel, maintain any
customizations necessary for any major distro, and it might be best to
do not at all than to do it incompletely. You could say it's also an
enabler for distros to vary in ways that require special
customization.

Maybe there's a concern that, if lots of people consume Spark on
Hadoop, and most people consume Hadoop through distros, and distros
alone manage Spark distributions, then you de facto 'have to' go
through a distro instead of get bits from Spark? Different
conversation but I think this sort of effect does not end up being a
negative.

Well anyway, I like the idea of seeing how far Hadoop-provided
releases can help. It might kill several birds with one stone.

On Sun, Mar 8, 2015 at 11:07 PM, Matei Zaharia <ma...@gmail.com> wrote:
> Our goal is to let people use the latest Apache release even if vendors fall behind or don't want to package everything, so that's why we put out releases for vendors' versions. It's fairly low overhead.
>
> Matei
>
>> On Mar 8, 2015, at 5:56 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> Ah. I misunderstood that Matei was referring to the Scala 2.11 tarball
>> at http://people.apache.org/~pwendell/spark-1.3.0-rc3/ and not the
>> Maven artifacts.
>>
>> Patrick I see you just commented on SPARK-5134 and will follow up
>> there. Sounds like this may accidentally not be a problem.
>>
>> On binary tarball releases, I wonder if anyone has an opinion on my
>> opinion that these shouldn't be distributed for specific Hadoop
>> *distributions* to begin with. (Won't repeat the argument here yet.)
>> That resolves this n x m explosion too.
>>
>> Vendors already provide their own distribution, yes, that's their job.
>>
>>
>> On Sun, Mar 8, 2015 at 9:42 PM, Krishna Sankar <ks...@gmail.com> wrote:
>>> Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop
>>> Distributions X ...
>>>
>>> May be one option is to have a minimum basic set (which I know is what we
>>> are discussing) and move the rest to spark-packages.org. There the vendors
>>> can add the latest downloads - for example when 1.4 is released, HDP can
>>> build a release of HDP Spark 1.4 bundle.
>>>
>>> Cheers
>>> <k/>
>>>
>>> On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell <pw...@gmail.com> wrote:
>>>>
>>>> We probably want to revisit the way we do binaries in general for
>>>> 1.4+. IMO, something worth forking a separate thread for.
>>>>
>>>> I've been hesitating to add new binaries because people
>>>> (understandably) complain if you ever stop packaging older ones, but
>>>> on the other hand the ASF has complained that we have too many
>>>> binaries already and that we need to pare it down because of the large
>>>> volume of files. Doubling the number of binaries we produce for Scala
>>>> 2.11 seemed like it would be too much.
>>>>
>>>> One solution potentially is to actually package "Hadoop provided"
>>>> binaries and encourage users to use these by simply setting
>>>> HADOOP_HOME, or have instructions for specific distros. I've heard
>>>> that our existing packages don't work well on HDP for instance, since
>>>> there are some configuration quirks that differ from the upstream
>>>> Hadoop.
>>>>
>>>> If we cut down on the cross building for Hadoop versions, then it is
>>>> more tenable to cross build for Scala versions without exploding the
>>>> number of binaries.
>>>>
>>>> - Patrick
>>>>
>>>> On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen <so...@cloudera.com> wrote:
>>>>> Yeah, interesting question of what is the better default for the
>>>>> single set of artifacts published to Maven. I think there's an
>>>>> argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros
>>>>> and cons discussed more at
>>>>>
>>>>> https://issues.apache.org/jira/browse/SPARK-5134
>>>>> https://github.com/apache/spark/pull/3917
>>>>>
>>>>> On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia <ma...@gmail.com>
>>>>> wrote:
>>>>>> +1
>>>>>>
>>>>>> Tested it on Mac OS X.
>>>>>>
>>>>>> One small issue I noticed is that the Scala 2.11 build is using Hadoop
>>>>>> 1 without Hive, which is kind of weird because people will more likely want
>>>>>> Hadoop 2 with Hive. So it would be good to publish a build for that
>>>>>> configuration instead. We can do it if we do a new RC, or it might be that
>>>>>> binary builds may not need to be voted on (I forgot the details there).
>>>>>>
>>>>>> Matei
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>>
>>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Release Scala version vs Hadoop version (was: [VOTE] Release Apache Spark 1.3.0 (RC3))

Posted by Matei Zaharia <ma...@gmail.com>.

Our goal is to let people use the latest Apache release even if vendors fall behind or don't want to package everything, so that's why we put out releases for vendors' versions. It's fairly low overhead.

Matei

> On Mar 8, 2015, at 5:56 PM, Sean Owen <so...@cloudera.com> wrote:
> 
> Ah. I misunderstood that Matei was referring to the Scala 2.11 tarball
> at http://people.apache.org/~pwendell/spark-1.3.0-rc3/ and not the
> Maven artifacts.
> 
> Patrick I see you just commented on SPARK-5134 and will follow up
> there. Sounds like this may accidentally not be a problem.
> 
> On binary tarball releases, I wonder if anyone has an opinion on my
> opinion that these shouldn't be distributed for specific Hadoop
> *distributions* to begin with. (Won't repeat the argument here yet.)
> That resolves this n x m explosion too.
> 
> Vendors already provide their own distribution, yes, that's their job.
> 
> 
> On Sun, Mar 8, 2015 at 9:42 PM, Krishna Sankar <ks...@gmail.com> wrote:
>> Yep, otherwise this will become an N^2 problem - Scala versions X Hadoop
>> Distributions X ...
>> 
>> May be one option is to have a minimum basic set (which I know is what we
>> are discussing) and move the rest to spark-packages.org. There the vendors
>> can add the latest downloads - for example when 1.4 is released, HDP can
>> build a release of HDP Spark 1.4 bundle.
>> 
>> Cheers
>> <k/>
>> 
>> On Sun, Mar 8, 2015 at 2:11 PM, Patrick Wendell <pw...@gmail.com> wrote:
>>> 
>>> We probably want to revisit the way we do binaries in general for
>>> 1.4+. IMO, something worth forking a separate thread for.
>>> 
>>> I've been hesitating to add new binaries because people
>>> (understandably) complain if you ever stop packaging older ones, but
>>> on the other hand the ASF has complained that we have too many
>>> binaries already and that we need to pare it down because of the large
>>> volume of files. Doubling the number of binaries we produce for Scala
>>> 2.11 seemed like it would be too much.
>>> 
>>> One solution potentially is to actually package "Hadoop provided"
>>> binaries and encourage users to use these by simply setting
>>> HADOOP_HOME, or have instructions for specific distros. I've heard
>>> that our existing packages don't work well on HDP for instance, since
>>> there are some configuration quirks that differ from the upstream
>>> Hadoop.
>>> 
>>> If we cut down on the cross building for Hadoop versions, then it is
>>> more tenable to cross build for Scala versions without exploding the
>>> number of binaries.
>>> 
>>> - Patrick
>>> 
>>> On Sun, Mar 8, 2015 at 12:46 PM, Sean Owen <so...@cloudera.com> wrote:
>>>> Yeah, interesting question of what is the better default for the
>>>> single set of artifacts published to Maven. I think there's an
>>>> argument for Hadoop 2 and perhaps Hive for the 2.10 build too. Pros
>>>> and cons discussed more at
>>>> 
>>>> https://issues.apache.org/jira/browse/SPARK-5134
>>>> https://github.com/apache/spark/pull/3917
>>>> 
>>>> On Sun, Mar 8, 2015 at 7:42 PM, Matei Zaharia <ma...@gmail.com>
>>>> wrote:
>>>>> +1
>>>>> 
>>>>> Tested it on Mac OS X.
>>>>> 
>>>>> One small issue I noticed is that the Scala 2.11 build is using Hadoop
>>>>> 1 without Hive, which is kind of weird because people will more likely want
>>>>> Hadoop 2 with Hive. So it would be good to publish a build for that
>>>>> configuration instead. We can do it if we do a new RC, or it might be that
>>>>> binary builds may not need to be voted on (I forgot the details there).
>>>>> 
>>>>> Matei
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org