You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Ted Yu <yu...@gmail.com> on 2014/07/27 21:01:57 UTC

Utilize newer hadoop releases WAS: [VOTE] Release Apache Spark 1.0.2 (RC1)

Thanks for replying, Patrick.

The intention of my first email was for utilizing newer hadoop releases for
their bug fixes. I am still looking for clean way of passing hadoop release
version number to individual classes.
Using newer hadoop releases would encourage pushing bug fixes / new
features upstream. Ultimately Spark code would become cleaner.

Cheers

On Sun, Jul 27, 2014 at 8:52 AM, Patrick Wendell <pw...@gmail.com> wrote:

> Ted - technically I think you are correct, although I wouldn't
> recommend disabling this lock. This lock is not expensive (acquired
> once per task, as are many other locks already). Also, we've seen some
> cases where Hadoop concurrency bugs ended up requiring multiple fixes
> - concurrency of client access is not well tested in the Hadoop
> codebase since most of the Hadoop tools to not use concurrent access.
> So in general it's good to be conservative in what we expect of the
> Hadoop client libraries.
>
> If you'd like to discuss this further, please fork a new thread, since
> this is a vote thread. Thanks!
>
> On Fri, Jul 25, 2014 at 10:14 PM, Ted Yu <yu...@gmail.com> wrote:
> > HADOOP-10456 is fixed in hadoop 2.4.1
> >
> > Does this mean that synchronization
> > on HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK can be bypassed for hadoop
> > 2.4.1 ?
> >
> > Cheers
> >
> >
> > On Fri, Jul 25, 2014 at 6:00 PM, Patrick Wendell <pw...@gmail.com>
> wrote:
> >
> >> The most important issue in this release is actually an ammendment to
> >> an earlier fix. The original fix caused a deadlock which was a
> >> regression from 1.0.0->1.0.1:
> >>
> >> Issue:
> >> https://issues.apache.org/jira/browse/SPARK-1097
> >>
> >> 1.0.1 Fix:
> >> https://github.com/apache/spark/pull/1273/files (had a deadlock)
> >>
> >> 1.0.2 Fix:
> >> https://github.com/apache/spark/pull/1409/files
> >>
> >> I failed to correctly label this on JIRA, but I've updated it!
> >>
> >> On Fri, Jul 25, 2014 at 5:35 PM, Michael Armbrust
> >> <mi...@databricks.com> wrote:
> >> > That query is looking at "Fix Version" not "Target Version".  The fact
> >> that
> >> > the first one is still open is only because the bug is not resolved in
> >> > master.  It is fixed in 1.0.2.  The second one is partially fixed in
> >> 1.0.2,
> >> > but is not worth blocking the release for.
> >> >
> >> >
> >> > On Fri, Jul 25, 2014 at 4:23 PM, Nicholas Chammas <
> >> > nicholas.chammas@gmail.com> wrote:
> >> >
> >> >> TD, there are a couple of unresolved issues slated for 1.0.2
> >> >> <
> >> >>
> >>
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%201.0.2%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC
> >> >> >.
> >> >> Should they be edited somehow?
> >> >>
> >> >>
> >> >> On Fri, Jul 25, 2014 at 7:08 PM, Tathagata Das <
> >> >> tathagata.das1565@gmail.com>
> >> >> wrote:
> >> >>
> >> >> > Please vote on releasing the following candidate as Apache Spark
> >> version
> >> >> > 1.0.2.
> >> >> >
> >> >> > This release fixes a number of bugs in Spark 1.0.1.
> >> >> > Some of the notable ones are
> >> >> > - SPARK-2452: Known issue is Spark 1.0.1 caused by attempted fix
> for
> >> >> > SPARK-1199. The fix was reverted for 1.0.2.
> >> >> > - SPARK-2576: NoClassDefFoundError when executing Spark QL query on
> >> >> > HDFS CSV file.
> >> >> > The full list is at http://s.apache.org/9NJ
> >> >> >
> >> >> > The tag to be voted on is v1.0.2-rc1 (commit 8fb6f00e):
> >> >> >
> >> >> >
> >> >>
> >>
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f
> >> >> >
> >> >> > The release files, including signatures, digests, etc can be found
> at:
> >> >> > http://people.apache.org/~tdas/spark-1.0.2-rc1/
> >> >> >
> >> >> > Release artifacts are signed with the following key:
> >> >> > https://people.apache.org/keys/committer/tdas.asc
> >> >> >
> >> >> > The staging repository for this release can be found at:
> >> >> >
> >> https://repository.apache.org/content/repositories/orgapachespark-1024/
> >> >> >
> >> >> > The documentation corresponding to this release can be found at:
> >> >> > http://people.apache.org/~tdas/spark-1.0.2-rc1-docs/
> >> >> >
> >> >> > Please vote on releasing this package as Apache Spark 1.0.2!
> >> >> >
> >> >> > The vote is open until Tuesday, July 29, at 23:00 UTC and passes if
> >> >> > a majority of at least 3 +1 PMC votes are cast.
> >> >> > [ ] +1 Release this package as Apache Spark 1.0.2
> >> >> > [ ] -1 Do not release this package because ...
> >> >> >
> >> >> > To learn more about Apache Spark, please see
> >> >> > http://spark.apache.org/
> >> >> >
> >> >>
> >>
>

Re: Utilize newer hadoop releases WAS: [VOTE] Release Apache Spark 1.0.2 (RC1)

Posted by Sean Owen <so...@cloudera.com>.

Right, the scenario is, for example, that a class is added in release
2.5.0, but has been back-ported to a 2.4.1-based release. 2.4.1 isn't
missing anything from 2.4.1. But a version of "2.4.1" doesn't tell you
whether or not the class is there reliably.

By the way, I just found there is already such a class,
org.apache.hadoop.util.VersionInfo:

https://github.com/apache/hadoop-common/blob/release-2.4.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/VersionInfo.java

It appears to have been around for a long time. Theoretical problems
aside, there may be cases where querying the version is a fine and
reliable solution.

On Jul 28, 2014 12:54 AM, "Matei Zaharia" <ma...@gmail.com> wrote:
>
> We could also do this, though it would be great if the Hadoop project provided this version number as at least a baseline. It's up to distributors to decide which version they report but I imagine they won't remove stuff that's in the reported version number.
>
> Matei
>
> On Jul 27, 2014, at 1:57 PM, Sean Owen <so...@cloudera.com> wrote:
>
> > Good idea, although it gets difficult in the context of multiple
> > distributions. Say change X is not present in version A, but present
> > in version B. If you depend on X, what version can you look for to
> > detect it? The distribution will return "A" or "A+X" or somesuch, but
> > testing for "A" will give an incorrect answer, and the code can't be
> > expected to look for everyone's "A+X" versions. Actually inspecting
> > the code is more robust if a bit messier.
> >
> > On Sun, Jul 27, 2014 at 9:50 PM, Matei Zaharia <ma...@gmail.com> wrote:
> >> For this particular issue, it would be good to know if Hadoop provides an API to determine the Hadoop version. If not, maybe that can be added to Hadoop in its next release, and we can check for it with reflection. We recently added a SparkContext.version() method in Spark to let you tell the version.
> >>
> >> Matei
> >>
> >> On Jul 27, 2014, at 12:19 PM, Patrick Wendell <pw...@gmail.com> wrote:
> >>
> >>> Hey Ted,
> >>>
> >>> We always intend Spark to work with the newer Hadoop versions and
> >>> encourage Spark users to use the newest Hadoop versions for best
> >>> performance.
> >>>
> >>> We do try to be liberal in terms of supporting older versions as well.
> >>> This is because many people run older HDFS versions and we want Spark
> >>> to read and write data from them. So far we've been willing to do this
> >>> despite some maintenance cost.
> >>>
> >>> The reason is that for many users it's very expensive to do a
> >>> whole-sale upgrade of HDFS, but trying out new versions of Spark is
> >>> much easier. For instance, some of the largest scale Spark users run
> >>> fairly old or forked HDFS versions.
> >>>
> >>> - Patrick
> >>>
> >>> On Sun, Jul 27, 2014 at 12:01 PM, Ted Yu <yu...@gmail.com> wrote:
> >>>> Thanks for replying, Patrick.
> >>>>
> >>>> The intention of my first email was for utilizing newer hadoop releases for
> >>>> their bug fixes. I am still looking for clean way of passing hadoop release
> >>>> version number to individual classes.
> >>>> Using newer hadoop releases would encourage pushing bug fixes / new
> >>>> features upstream. Ultimately Spark code would become cleaner.
> >>>>
> >>>> Cheers
> >>>>
> >>>> On Sun, Jul 27, 2014 at 8:52 AM, Patrick Wendell <pw...@gmail.com> wrote:
> >>>>
> >>>>> Ted - technically I think you are correct, although I wouldn't
> >>>>> recommend disabling this lock. This lock is not expensive (acquired
> >>>>> once per task, as are many other locks already). Also, we've seen some
> >>>>> cases where Hadoop concurrency bugs ended up requiring multiple fixes
> >>>>> - concurrency of client access is not well tested in the Hadoop
> >>>>> codebase since most of the Hadoop tools to not use concurrent access.
> >>>>> So in general it's good to be conservative in what we expect of the
> >>>>> Hadoop client libraries.
> >>>>>
> >>>>> If you'd like to discuss this further, please fork a new thread, since
> >>>>> this is a vote thread. Thanks!
> >>>>>
> >>>>> On Fri, Jul 25, 2014 at 10:14 PM, Ted Yu <yu...@gmail.com> wrote:
> >>>>>> HADOOP-10456 is fixed in hadoop 2.4.1
> >>>>>>
> >>>>>> Does this mean that synchronization
> >>>>>> on HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK can be bypassed for hadoop
> >>>>>> 2.4.1 ?
> >>>>>>
> >>>>>> Cheers
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Jul 25, 2014 at 6:00 PM, Patrick Wendell <pw...@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>>> The most important issue in this release is actually an ammendment to
> >>>>>>> an earlier fix. The original fix caused a deadlock which was a
> >>>>>>> regression from 1.0.0->1.0.1:
> >>>>>>>
> >>>>>>> Issue:
> >>>>>>> https://issues.apache.org/jira/browse/SPARK-1097
> >>>>>>>
> >>>>>>> 1.0.1 Fix:
> >>>>>>> https://github.com/apache/spark/pull/1273/files (had a deadlock)
> >>>>>>>
> >>>>>>> 1.0.2 Fix:
> >>>>>>> https://github.com/apache/spark/pull/1409/files
> >>>>>>>
> >>>>>>> I failed to correctly label this on JIRA, but I've updated it!
> >>>>>>>
> >>>>>>> On Fri, Jul 25, 2014 at 5:35 PM, Michael Armbrust
> >>>>>>> <mi...@databricks.com> wrote:
> >>>>>>>> That query is looking at "Fix Version" not "Target Version".  The fact
> >>>>>>> that
> >>>>>>>> the first one is still open is only because the bug is not resolved in
> >>>>>>>> master.  It is fixed in 1.0.2.  The second one is partially fixed in
> >>>>>>> 1.0.2,
> >>>>>>>> but is not worth blocking the release for.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Fri, Jul 25, 2014 at 4:23 PM, Nicholas Chammas <
> >>>>>>>> nicholas.chammas@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>>> TD, there are a couple of unresolved issues slated for 1.0.2
> >>>>>>>>> <
> >>>>>>>>>
> >>>>>>>
> >>>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%201.0.2%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC
> >>>>>>>>>> .
> >>>>>>>>> Should they be edited somehow?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Fri, Jul 25, 2014 at 7:08 PM, Tathagata Das <
> >>>>>>>>> tathagata.das1565@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Please vote on releasing the following candidate as Apache Spark
> >>>>>>> version
> >>>>>>>>>> 1.0.2.
> >>>>>>>>>>
> >>>>>>>>>> This release fixes a number of bugs in Spark 1.0.1.
> >>>>>>>>>> Some of the notable ones are
> >>>>>>>>>> - SPARK-2452: Known issue is Spark 1.0.1 caused by attempted fix
> >>>>> for
> >>>>>>>>>> SPARK-1199. The fix was reverted for 1.0.2.
> >>>>>>>>>> - SPARK-2576: NoClassDefFoundError when executing Spark QL query on
> >>>>>>>>>> HDFS CSV file.
> >>>>>>>>>> The full list is at http://s.apache.org/9NJ
> >>>>>>>>>>
> >>>>>>>>>> The tag to be voted on is v1.0.2-rc1 (commit 8fb6f00e):
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f
> >>>>>>>>>>
> >>>>>>>>>> The release files, including signatures, digests, etc can be found
> >>>>> at:
> >>>>>>>>>> http://people.apache.org/~tdas/spark-1.0.2-rc1/
> >>>>>>>>>>
> >>>>>>>>>> Release artifacts are signed with the following key:
> >>>>>>>>>> https://people.apache.org/keys/committer/tdas.asc
> >>>>>>>>>>
> >>>>>>>>>> The staging repository for this release can be found at:
> >>>>>>>>>>
> >>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1024/
> >>>>>>>>>>
> >>>>>>>>>> The documentation corresponding to this release can be found at:
> >>>>>>>>>> http://people.apache.org/~tdas/spark-1.0.2-rc1-docs/
> >>>>>>>>>>
> >>>>>>>>>> Please vote on releasing this package as Apache Spark 1.0.2!
> >>>>>>>>>>
> >>>>>>>>>> The vote is open until Tuesday, July 29, at 23:00 UTC and passes if
> >>>>>>>>>> a majority of at least 3 +1 PMC votes are cast.
> >>>>>>>>>> [ ] +1 Release this package as Apache Spark 1.0.2
> >>>>>>>>>> [ ] -1 Do not release this package because ...
> >>>>>>>>>>
> >>>>>>>>>> To learn more about Apache Spark, please see
> >>>>>>>>>> http://spark.apache.org/
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>
> >>
>

Re: Utilize newer hadoop releases WAS: [VOTE] Release Apache Spark 1.0.2 (RC1)

Posted by Matei Zaharia <ma...@gmail.com>.

We could also do this, though it would be great if the Hadoop project provided this version number as at least a baseline. It's up to distributors to decide which version they report but I imagine they won't remove stuff that's in the reported version number.

Matei

On Jul 27, 2014, at 1:57 PM, Sean Owen <so...@cloudera.com> wrote:

> Good idea, although it gets difficult in the context of multiple
> distributions. Say change X is not present in version A, but present
> in version B. If you depend on X, what version can you look for to
> detect it? The distribution will return "A" or "A+X" or somesuch, but
> testing for "A" will give an incorrect answer, and the code can't be
> expected to look for everyone's "A+X" versions. Actually inspecting
> the code is more robust if a bit messier.
> 
> On Sun, Jul 27, 2014 at 9:50 PM, Matei Zaharia <ma...@gmail.com> wrote:
>> For this particular issue, it would be good to know if Hadoop provides an API to determine the Hadoop version. If not, maybe that can be added to Hadoop in its next release, and we can check for it with reflection. We recently added a SparkContext.version() method in Spark to let you tell the version.
>> 
>> Matei
>> 
>> On Jul 27, 2014, at 12:19 PM, Patrick Wendell <pw...@gmail.com> wrote:
>> 
>>> Hey Ted,
>>> 
>>> We always intend Spark to work with the newer Hadoop versions and
>>> encourage Spark users to use the newest Hadoop versions for best
>>> performance.
>>> 
>>> We do try to be liberal in terms of supporting older versions as well.
>>> This is because many people run older HDFS versions and we want Spark
>>> to read and write data from them. So far we've been willing to do this
>>> despite some maintenance cost.
>>> 
>>> The reason is that for many users it's very expensive to do a
>>> whole-sale upgrade of HDFS, but trying out new versions of Spark is
>>> much easier. For instance, some of the largest scale Spark users run
>>> fairly old or forked HDFS versions.
>>> 
>>> - Patrick
>>> 
>>> On Sun, Jul 27, 2014 at 12:01 PM, Ted Yu <yu...@gmail.com> wrote:
>>>> Thanks for replying, Patrick.
>>>> 
>>>> The intention of my first email was for utilizing newer hadoop releases for
>>>> their bug fixes. I am still looking for clean way of passing hadoop release
>>>> version number to individual classes.
>>>> Using newer hadoop releases would encourage pushing bug fixes / new
>>>> features upstream. Ultimately Spark code would become cleaner.
>>>> 
>>>> Cheers
>>>> 
>>>> On Sun, Jul 27, 2014 at 8:52 AM, Patrick Wendell <pw...@gmail.com> wrote:
>>>> 
>>>>> Ted - technically I think you are correct, although I wouldn't
>>>>> recommend disabling this lock. This lock is not expensive (acquired
>>>>> once per task, as are many other locks already). Also, we've seen some
>>>>> cases where Hadoop concurrency bugs ended up requiring multiple fixes
>>>>> - concurrency of client access is not well tested in the Hadoop
>>>>> codebase since most of the Hadoop tools to not use concurrent access.
>>>>> So in general it's good to be conservative in what we expect of the
>>>>> Hadoop client libraries.
>>>>> 
>>>>> If you'd like to discuss this further, please fork a new thread, since
>>>>> this is a vote thread. Thanks!
>>>>> 
>>>>> On Fri, Jul 25, 2014 at 10:14 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>>> HADOOP-10456 is fixed in hadoop 2.4.1
>>>>>> 
>>>>>> Does this mean that synchronization
>>>>>> on HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK can be bypassed for hadoop
>>>>>> 2.4.1 ?
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> 
>>>>>> On Fri, Jul 25, 2014 at 6:00 PM, Patrick Wendell <pw...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>>> The most important issue in this release is actually an ammendment to
>>>>>>> an earlier fix. The original fix caused a deadlock which was a
>>>>>>> regression from 1.0.0->1.0.1:
>>>>>>> 
>>>>>>> Issue:
>>>>>>> https://issues.apache.org/jira/browse/SPARK-1097
>>>>>>> 
>>>>>>> 1.0.1 Fix:
>>>>>>> https://github.com/apache/spark/pull/1273/files (had a deadlock)
>>>>>>> 
>>>>>>> 1.0.2 Fix:
>>>>>>> https://github.com/apache/spark/pull/1409/files
>>>>>>> 
>>>>>>> I failed to correctly label this on JIRA, but I've updated it!
>>>>>>> 
>>>>>>> On Fri, Jul 25, 2014 at 5:35 PM, Michael Armbrust
>>>>>>> <mi...@databricks.com> wrote:
>>>>>>>> That query is looking at "Fix Version" not "Target Version".  The fact
>>>>>>> that
>>>>>>>> the first one is still open is only because the bug is not resolved in
>>>>>>>> master.  It is fixed in 1.0.2.  The second one is partially fixed in
>>>>>>> 1.0.2,
>>>>>>>> but is not worth blocking the release for.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, Jul 25, 2014 at 4:23 PM, Nicholas Chammas <
>>>>>>>> nicholas.chammas@gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> TD, there are a couple of unresolved issues slated for 1.0.2
>>>>>>>>> <
>>>>>>>>> 
>>>>>>> 
>>>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%201.0.2%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC
>>>>>>>>>> .
>>>>>>>>> Should they be edited somehow?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, Jul 25, 2014 at 7:08 PM, Tathagata Das <
>>>>>>>>> tathagata.das1565@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>>> version
>>>>>>>>>> 1.0.2.
>>>>>>>>>> 
>>>>>>>>>> This release fixes a number of bugs in Spark 1.0.1.
>>>>>>>>>> Some of the notable ones are
>>>>>>>>>> - SPARK-2452: Known issue is Spark 1.0.1 caused by attempted fix
>>>>> for
>>>>>>>>>> SPARK-1199. The fix was reverted for 1.0.2.
>>>>>>>>>> - SPARK-2576: NoClassDefFoundError when executing Spark QL query on
>>>>>>>>>> HDFS CSV file.
>>>>>>>>>> The full list is at http://s.apache.org/9NJ
>>>>>>>>>> 
>>>>>>>>>> The tag to be voted on is v1.0.2-rc1 (commit 8fb6f00e):
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f
>>>>>>>>>> 
>>>>>>>>>> The release files, including signatures, digests, etc can be found
>>>>> at:
>>>>>>>>>> http://people.apache.org/~tdas/spark-1.0.2-rc1/
>>>>>>>>>> 
>>>>>>>>>> Release artifacts are signed with the following key:
>>>>>>>>>> https://people.apache.org/keys/committer/tdas.asc
>>>>>>>>>> 
>>>>>>>>>> The staging repository for this release can be found at:
>>>>>>>>>> 
>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1024/
>>>>>>>>>> 
>>>>>>>>>> The documentation corresponding to this release can be found at:
>>>>>>>>>> http://people.apache.org/~tdas/spark-1.0.2-rc1-docs/
>>>>>>>>>> 
>>>>>>>>>> Please vote on releasing this package as Apache Spark 1.0.2!
>>>>>>>>>> 
>>>>>>>>>> The vote is open until Tuesday, July 29, at 23:00 UTC and passes if
>>>>>>>>>> a majority of at least 3 +1 PMC votes are cast.
>>>>>>>>>> [ ] +1 Release this package as Apache Spark 1.0.2
>>>>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>>>> 
>>>>>>>>>> To learn more about Apache Spark, please see
>>>>>>>>>> http://spark.apache.org/
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>

Re: Utilize newer hadoop releases WAS: [VOTE] Release Apache Spark 1.0.2 (RC1)

Posted by Sean Owen <so...@cloudera.com>.

Good idea, although it gets difficult in the context of multiple
distributions. Say change X is not present in version A, but present
in version B. If you depend on X, what version can you look for to
detect it? The distribution will return "A" or "A+X" or somesuch, but
testing for "A" will give an incorrect answer, and the code can't be
expected to look for everyone's "A+X" versions. Actually inspecting
the code is more robust if a bit messier.

On Sun, Jul 27, 2014 at 9:50 PM, Matei Zaharia <ma...@gmail.com> wrote:
> For this particular issue, it would be good to know if Hadoop provides an API to determine the Hadoop version. If not, maybe that can be added to Hadoop in its next release, and we can check for it with reflection. We recently added a SparkContext.version() method in Spark to let you tell the version.
>
> Matei
>
> On Jul 27, 2014, at 12:19 PM, Patrick Wendell <pw...@gmail.com> wrote:
>
>> Hey Ted,
>>
>> We always intend Spark to work with the newer Hadoop versions and
>> encourage Spark users to use the newest Hadoop versions for best
>> performance.
>>
>> We do try to be liberal in terms of supporting older versions as well.
>> This is because many people run older HDFS versions and we want Spark
>> to read and write data from them. So far we've been willing to do this
>> despite some maintenance cost.
>>
>> The reason is that for many users it's very expensive to do a
>> whole-sale upgrade of HDFS, but trying out new versions of Spark is
>> much easier. For instance, some of the largest scale Spark users run
>> fairly old or forked HDFS versions.
>>
>> - Patrick
>>
>> On Sun, Jul 27, 2014 at 12:01 PM, Ted Yu <yu...@gmail.com> wrote:
>>> Thanks for replying, Patrick.
>>>
>>> The intention of my first email was for utilizing newer hadoop releases for
>>> their bug fixes. I am still looking for clean way of passing hadoop release
>>> version number to individual classes.
>>> Using newer hadoop releases would encourage pushing bug fixes / new
>>> features upstream. Ultimately Spark code would become cleaner.
>>>
>>> Cheers
>>>
>>> On Sun, Jul 27, 2014 at 8:52 AM, Patrick Wendell <pw...@gmail.com> wrote:
>>>
>>>> Ted - technically I think you are correct, although I wouldn't
>>>> recommend disabling this lock. This lock is not expensive (acquired
>>>> once per task, as are many other locks already). Also, we've seen some
>>>> cases where Hadoop concurrency bugs ended up requiring multiple fixes
>>>> - concurrency of client access is not well tested in the Hadoop
>>>> codebase since most of the Hadoop tools to not use concurrent access.
>>>> So in general it's good to be conservative in what we expect of the
>>>> Hadoop client libraries.
>>>>
>>>> If you'd like to discuss this further, please fork a new thread, since
>>>> this is a vote thread. Thanks!
>>>>
>>>> On Fri, Jul 25, 2014 at 10:14 PM, Ted Yu <yu...@gmail.com> wrote:
>>>>> HADOOP-10456 is fixed in hadoop 2.4.1
>>>>>
>>>>> Does this mean that synchronization
>>>>> on HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK can be bypassed for hadoop
>>>>> 2.4.1 ?
>>>>>
>>>>> Cheers
>>>>>
>>>>>
>>>>> On Fri, Jul 25, 2014 at 6:00 PM, Patrick Wendell <pw...@gmail.com>
>>>> wrote:
>>>>>
>>>>>> The most important issue in this release is actually an ammendment to
>>>>>> an earlier fix. The original fix caused a deadlock which was a
>>>>>> regression from 1.0.0->1.0.1:
>>>>>>
>>>>>> Issue:
>>>>>> https://issues.apache.org/jira/browse/SPARK-1097
>>>>>>
>>>>>> 1.0.1 Fix:
>>>>>> https://github.com/apache/spark/pull/1273/files (had a deadlock)
>>>>>>
>>>>>> 1.0.2 Fix:
>>>>>> https://github.com/apache/spark/pull/1409/files
>>>>>>
>>>>>> I failed to correctly label this on JIRA, but I've updated it!
>>>>>>
>>>>>> On Fri, Jul 25, 2014 at 5:35 PM, Michael Armbrust
>>>>>> <mi...@databricks.com> wrote:
>>>>>>> That query is looking at "Fix Version" not "Target Version".  The fact
>>>>>> that
>>>>>>> the first one is still open is only because the bug is not resolved in
>>>>>>> master.  It is fixed in 1.0.2.  The second one is partially fixed in
>>>>>> 1.0.2,
>>>>>>> but is not worth blocking the release for.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jul 25, 2014 at 4:23 PM, Nicholas Chammas <
>>>>>>> nicholas.chammas@gmail.com> wrote:
>>>>>>>
>>>>>>>> TD, there are a couple of unresolved issues slated for 1.0.2
>>>>>>>> <
>>>>>>>>
>>>>>>
>>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%201.0.2%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC
>>>>>>>>> .
>>>>>>>> Should they be edited somehow?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jul 25, 2014 at 7:08 PM, Tathagata Das <
>>>>>>>> tathagata.das1565@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>> version
>>>>>>>>> 1.0.2.
>>>>>>>>>
>>>>>>>>> This release fixes a number of bugs in Spark 1.0.1.
>>>>>>>>> Some of the notable ones are
>>>>>>>>> - SPARK-2452: Known issue is Spark 1.0.1 caused by attempted fix
>>>> for
>>>>>>>>> SPARK-1199. The fix was reverted for 1.0.2.
>>>>>>>>> - SPARK-2576: NoClassDefFoundError when executing Spark QL query on
>>>>>>>>> HDFS CSV file.
>>>>>>>>> The full list is at http://s.apache.org/9NJ
>>>>>>>>>
>>>>>>>>> The tag to be voted on is v1.0.2-rc1 (commit 8fb6f00e):
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f
>>>>>>>>>
>>>>>>>>> The release files, including signatures, digests, etc can be found
>>>> at:
>>>>>>>>> http://people.apache.org/~tdas/spark-1.0.2-rc1/
>>>>>>>>>
>>>>>>>>> Release artifacts are signed with the following key:
>>>>>>>>> https://people.apache.org/keys/committer/tdas.asc
>>>>>>>>>
>>>>>>>>> The staging repository for this release can be found at:
>>>>>>>>>
>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1024/
>>>>>>>>>
>>>>>>>>> The documentation corresponding to this release can be found at:
>>>>>>>>> http://people.apache.org/~tdas/spark-1.0.2-rc1-docs/
>>>>>>>>>
>>>>>>>>> Please vote on releasing this package as Apache Spark 1.0.2!
>>>>>>>>>
>>>>>>>>> The vote is open until Tuesday, July 29, at 23:00 UTC and passes if
>>>>>>>>> a majority of at least 3 +1 PMC votes are cast.
>>>>>>>>> [ ] +1 Release this package as Apache Spark 1.0.2
>>>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>>>
>>>>>>>>> To learn more about Apache Spark, please see
>>>>>>>>> http://spark.apache.org/
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>

Re: Utilize newer hadoop releases WAS: [VOTE] Release Apache Spark 1.0.2 (RC1)

Posted by Matei Zaharia <ma...@gmail.com>.

For this particular issue, it would be good to know if Hadoop provides an API to determine the Hadoop version. If not, maybe that can be added to Hadoop in its next release, and we can check for it with reflection. We recently added a SparkContext.version() method in Spark to let you tell the version.

Matei

On Jul 27, 2014, at 12:19 PM, Patrick Wendell <pw...@gmail.com> wrote:

> Hey Ted,
> 
> We always intend Spark to work with the newer Hadoop versions and
> encourage Spark users to use the newest Hadoop versions for best
> performance.
> 
> We do try to be liberal in terms of supporting older versions as well.
> This is because many people run older HDFS versions and we want Spark
> to read and write data from them. So far we've been willing to do this
> despite some maintenance cost.
> 
> The reason is that for many users it's very expensive to do a
> whole-sale upgrade of HDFS, but trying out new versions of Spark is
> much easier. For instance, some of the largest scale Spark users run
> fairly old or forked HDFS versions.
> 
> - Patrick
> 
> On Sun, Jul 27, 2014 at 12:01 PM, Ted Yu <yu...@gmail.com> wrote:
>> Thanks for replying, Patrick.
>> 
>> The intention of my first email was for utilizing newer hadoop releases for
>> their bug fixes. I am still looking for clean way of passing hadoop release
>> version number to individual classes.
>> Using newer hadoop releases would encourage pushing bug fixes / new
>> features upstream. Ultimately Spark code would become cleaner.
>> 
>> Cheers
>> 
>> On Sun, Jul 27, 2014 at 8:52 AM, Patrick Wendell <pw...@gmail.com> wrote:
>> 
>>> Ted - technically I think you are correct, although I wouldn't
>>> recommend disabling this lock. This lock is not expensive (acquired
>>> once per task, as are many other locks already). Also, we've seen some
>>> cases where Hadoop concurrency bugs ended up requiring multiple fixes
>>> - concurrency of client access is not well tested in the Hadoop
>>> codebase since most of the Hadoop tools to not use concurrent access.
>>> So in general it's good to be conservative in what we expect of the
>>> Hadoop client libraries.
>>> 
>>> If you'd like to discuss this further, please fork a new thread, since
>>> this is a vote thread. Thanks!
>>> 
>>> On Fri, Jul 25, 2014 at 10:14 PM, Ted Yu <yu...@gmail.com> wrote:
>>>> HADOOP-10456 is fixed in hadoop 2.4.1
>>>> 
>>>> Does this mean that synchronization
>>>> on HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK can be bypassed for hadoop
>>>> 2.4.1 ?
>>>> 
>>>> Cheers
>>>> 
>>>> 
>>>> On Fri, Jul 25, 2014 at 6:00 PM, Patrick Wendell <pw...@gmail.com>
>>> wrote:
>>>> 
>>>>> The most important issue in this release is actually an ammendment to
>>>>> an earlier fix. The original fix caused a deadlock which was a
>>>>> regression from 1.0.0->1.0.1:
>>>>> 
>>>>> Issue:
>>>>> https://issues.apache.org/jira/browse/SPARK-1097
>>>>> 
>>>>> 1.0.1 Fix:
>>>>> https://github.com/apache/spark/pull/1273/files (had a deadlock)
>>>>> 
>>>>> 1.0.2 Fix:
>>>>> https://github.com/apache/spark/pull/1409/files
>>>>> 
>>>>> I failed to correctly label this on JIRA, but I've updated it!
>>>>> 
>>>>> On Fri, Jul 25, 2014 at 5:35 PM, Michael Armbrust
>>>>> <mi...@databricks.com> wrote:
>>>>>> That query is looking at "Fix Version" not "Target Version".  The fact
>>>>> that
>>>>>> the first one is still open is only because the bug is not resolved in
>>>>>> master.  It is fixed in 1.0.2.  The second one is partially fixed in
>>>>> 1.0.2,
>>>>>> but is not worth blocking the release for.
>>>>>> 
>>>>>> 
>>>>>> On Fri, Jul 25, 2014 at 4:23 PM, Nicholas Chammas <
>>>>>> nicholas.chammas@gmail.com> wrote:
>>>>>> 
>>>>>>> TD, there are a couple of unresolved issues slated for 1.0.2
>>>>>>> <
>>>>>>> 
>>>>> 
>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%201.0.2%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC
>>>>>>>> .
>>>>>>> Should they be edited somehow?
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Jul 25, 2014 at 7:08 PM, Tathagata Das <
>>>>>>> tathagata.das1565@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>> version
>>>>>>>> 1.0.2.
>>>>>>>> 
>>>>>>>> This release fixes a number of bugs in Spark 1.0.1.
>>>>>>>> Some of the notable ones are
>>>>>>>> - SPARK-2452: Known issue is Spark 1.0.1 caused by attempted fix
>>> for
>>>>>>>> SPARK-1199. The fix was reverted for 1.0.2.
>>>>>>>> - SPARK-2576: NoClassDefFoundError when executing Spark QL query on
>>>>>>>> HDFS CSV file.
>>>>>>>> The full list is at http://s.apache.org/9NJ
>>>>>>>> 
>>>>>>>> The tag to be voted on is v1.0.2-rc1 (commit 8fb6f00e):
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f
>>>>>>>> 
>>>>>>>> The release files, including signatures, digests, etc can be found
>>> at:
>>>>>>>> http://people.apache.org/~tdas/spark-1.0.2-rc1/
>>>>>>>> 
>>>>>>>> Release artifacts are signed with the following key:
>>>>>>>> https://people.apache.org/keys/committer/tdas.asc
>>>>>>>> 
>>>>>>>> The staging repository for this release can be found at:
>>>>>>>> 
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1024/
>>>>>>>> 
>>>>>>>> The documentation corresponding to this release can be found at:
>>>>>>>> http://people.apache.org/~tdas/spark-1.0.2-rc1-docs/
>>>>>>>> 
>>>>>>>> Please vote on releasing this package as Apache Spark 1.0.2!
>>>>>>>> 
>>>>>>>> The vote is open until Tuesday, July 29, at 23:00 UTC and passes if
>>>>>>>> a majority of at least 3 +1 PMC votes are cast.
>>>>>>>> [ ] +1 Release this package as Apache Spark 1.0.2
>>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>> 
>>>>>>>> To learn more about Apache Spark, please see
>>>>>>>> http://spark.apache.org/
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>

Re: Utilize newer hadoop releases WAS: [VOTE] Release Apache Spark 1.0.2 (RC1)

Posted by Patrick Wendell <pw...@gmail.com>.

Hey Ted,

We always intend Spark to work with the newer Hadoop versions and
encourage Spark users to use the newest Hadoop versions for best
performance.

We do try to be liberal in terms of supporting older versions as well.
This is because many people run older HDFS versions and we want Spark
to read and write data from them. So far we've been willing to do this
despite some maintenance cost.

The reason is that for many users it's very expensive to do a
whole-sale upgrade of HDFS, but trying out new versions of Spark is
much easier. For instance, some of the largest scale Spark users run
fairly old or forked HDFS versions.

- Patrick

On Sun, Jul 27, 2014 at 12:01 PM, Ted Yu <yu...@gmail.com> wrote:
> Thanks for replying, Patrick.
>
> The intention of my first email was for utilizing newer hadoop releases for
> their bug fixes. I am still looking for clean way of passing hadoop release
> version number to individual classes.
> Using newer hadoop releases would encourage pushing bug fixes / new
> features upstream. Ultimately Spark code would become cleaner.
>
> Cheers
>
> On Sun, Jul 27, 2014 at 8:52 AM, Patrick Wendell <pw...@gmail.com> wrote:
>
>> Ted - technically I think you are correct, although I wouldn't
>> recommend disabling this lock. This lock is not expensive (acquired
>> once per task, as are many other locks already). Also, we've seen some
>> cases where Hadoop concurrency bugs ended up requiring multiple fixes
>> - concurrency of client access is not well tested in the Hadoop
>> codebase since most of the Hadoop tools to not use concurrent access.
>> So in general it's good to be conservative in what we expect of the
>> Hadoop client libraries.
>>
>> If you'd like to discuss this further, please fork a new thread, since
>> this is a vote thread. Thanks!
>>
>> On Fri, Jul 25, 2014 at 10:14 PM, Ted Yu <yu...@gmail.com> wrote:
>> > HADOOP-10456 is fixed in hadoop 2.4.1
>> >
>> > Does this mean that synchronization
>> > on HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK can be bypassed for hadoop
>> > 2.4.1 ?
>> >
>> > Cheers
>> >
>> >
>> > On Fri, Jul 25, 2014 at 6:00 PM, Patrick Wendell <pw...@gmail.com>
>> wrote:
>> >
>> >> The most important issue in this release is actually an ammendment to
>> >> an earlier fix. The original fix caused a deadlock which was a
>> >> regression from 1.0.0->1.0.1:
>> >>
>> >> Issue:
>> >> https://issues.apache.org/jira/browse/SPARK-1097
>> >>
>> >> 1.0.1 Fix:
>> >> https://github.com/apache/spark/pull/1273/files (had a deadlock)
>> >>
>> >> 1.0.2 Fix:
>> >> https://github.com/apache/spark/pull/1409/files
>> >>
>> >> I failed to correctly label this on JIRA, but I've updated it!
>> >>
>> >> On Fri, Jul 25, 2014 at 5:35 PM, Michael Armbrust
>> >> <mi...@databricks.com> wrote:
>> >> > That query is looking at "Fix Version" not "Target Version".  The fact
>> >> that
>> >> > the first one is still open is only because the bug is not resolved in
>> >> > master.  It is fixed in 1.0.2.  The second one is partially fixed in
>> >> 1.0.2,
>> >> > but is not worth blocking the release for.
>> >> >
>> >> >
>> >> > On Fri, Jul 25, 2014 at 4:23 PM, Nicholas Chammas <
>> >> > nicholas.chammas@gmail.com> wrote:
>> >> >
>> >> >> TD, there are a couple of unresolved issues slated for 1.0.2
>> >> >> <
>> >> >>
>> >>
>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%201.0.2%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20priority%20DESC
>> >> >> >.
>> >> >> Should they be edited somehow?
>> >> >>
>> >> >>
>> >> >> On Fri, Jul 25, 2014 at 7:08 PM, Tathagata Das <
>> >> >> tathagata.das1565@gmail.com>
>> >> >> wrote:
>> >> >>
>> >> >> > Please vote on releasing the following candidate as Apache Spark
>> >> version
>> >> >> > 1.0.2.
>> >> >> >
>> >> >> > This release fixes a number of bugs in Spark 1.0.1.
>> >> >> > Some of the notable ones are
>> >> >> > - SPARK-2452: Known issue is Spark 1.0.1 caused by attempted fix
>> for
>> >> >> > SPARK-1199. The fix was reverted for 1.0.2.
>> >> >> > - SPARK-2576: NoClassDefFoundError when executing Spark QL query on
>> >> >> > HDFS CSV file.
>> >> >> > The full list is at http://s.apache.org/9NJ
>> >> >> >
>> >> >> > The tag to be voted on is v1.0.2-rc1 (commit 8fb6f00e):
>> >> >> >
>> >> >> >
>> >> >>
>> >>
>> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=8fb6f00e195fb258f3f70f04756e07c259a2351f
>> >> >> >
>> >> >> > The release files, including signatures, digests, etc can be found
>> at:
>> >> >> > http://people.apache.org/~tdas/spark-1.0.2-rc1/
>> >> >> >
>> >> >> > Release artifacts are signed with the following key:
>> >> >> > https://people.apache.org/keys/committer/tdas.asc
>> >> >> >
>> >> >> > The staging repository for this release can be found at:
>> >> >> >
>> >> https://repository.apache.org/content/repositories/orgapachespark-1024/
>> >> >> >
>> >> >> > The documentation corresponding to this release can be found at:
>> >> >> > http://people.apache.org/~tdas/spark-1.0.2-rc1-docs/
>> >> >> >
>> >> >> > Please vote on releasing this package as Apache Spark 1.0.2!
>> >> >> >
>> >> >> > The vote is open until Tuesday, July 29, at 23:00 UTC and passes if
>> >> >> > a majority of at least 3 +1 PMC votes are cast.
>> >> >> > [ ] +1 Release this package as Apache Spark 1.0.2
>> >> >> > [ ] -1 Do not release this package because ...
>> >> >> >
>> >> >> > To learn more about Apache Spark, please see
>> >> >> > http://spark.apache.org/
>> >> >> >
>> >> >>
>> >>
>>