You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Michael Heuer <he...@gmail.com> on 2017/05/01 15:13:21 UTC

Re: [VOTE] Apache Spark 2.2.0 (RC1)

Version 2.2.0 bumps the dependency version for parquet to 1.8.2 but does
not bump the dependency version for avro (currently at 1.7.7).  Though
perhaps not clear from the issue I reported [0], this means that Spark is
internally inconsistent, in that a call through parquet (which depends on
avro 1.8.0 [1]) may throw errors at runtime when it hits avro 1.7.7 on the
classpath.  Avro 1.8.0 is not binary compatible with 1.7.7.

[0] - https://issues.apache.org/jira/browse/SPARK-19697
[1] -
https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/pom.xml#L96

On Sun, Apr 30, 2017 at 3:28 AM, Sean Owen <so...@cloudera.com> wrote:

> I have one more issue that, if it needs to be fixed, needs to be fixed for
> 2.2.0.
>
> I'm fixing build warnings for the release and noticed that checkstyle
> actually complains there are some Java methods named in TitleCase, like
> `ProcessingTimeTimeout`:
>
> https://github.com/apache/spark/pull/17803/files#r113934080
>
> Easy enough to fix and it's right, that's not conventional. However I
> wonder if it was done on purpose to match a class name?
>
> I think this is one for @tdas
>
> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust <mi...@databricks.com>
> wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST and
>> passes if a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 2.2.0
>> [ ] -1 Do not release this package because ...
>>
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v2.2.0-rc1
>> <https://github.com/apache/spark/tree/v2.2.0-rc1> (8ccb4a57c82146c
>> 1a8f8966c7e64010cf5632cb6)
>>
>> List of JIRA tickets resolved can be found with this filter
>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>
>> .
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1235/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/
>>
>>
>> *FAQ*
>>
>> *How can I help test this release?*
>>
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should be
>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>
>> *But my bug isn't fixed!??!*
>>
>> In order to make timely releases, we will typically not hold the release
>> unless the bug in question is a regression from 2.1.1.
>>
>

Re: [VOTE] Apache Spark 2.2.0 (RC1)

Posted by Michael Armbrust <mi...@databricks.com>.
I'm going to -1 this given the number of small bug fixes that have gone
into the release branch.  I'll follow with another RC shortly.

On Tue, May 2, 2017 at 7:35 AM, Nick Pentreath <ni...@gmail.com>
wrote:

> I won't +1 just given that it seems certain there will be another RC and
> there are the outstanding ML QA blocker issues.
>
> But clean build and test for JVM and Python tests LGTM on CentOS Linux
> 7.2.1511, OpenJDK 1.8.0_111
>
>
> On Mon, 1 May 2017 at 22:42 Frank Austin Nothaft <fn...@berkeley.edu>
> wrote:
>
>> Hi Ryan,
>>
>> IMO, the problem is that the Spark Avro version conflicts with the
>> Parquet Avro version. As discussed upthread, I don’t think there’s a way to
>> *reliably *make sure that Avro 1.8 is on the classpath first while using
>> spark-submit. Relocating avro in our project wouldn’t solve the problem,
>> because the MethodNotFoundError is thrown from the internals of the
>> ParquetAvroOutputFormat, not from code in our project.
>>
>> Regards,
>>
>> Frank Austin Nothaft
>> fnothaft@berkeley.edu
>> fnothaft@eecs.berkeley.edu
>> 202-340-0466 <(202)%20340-0466>
>>
>> On May 1, 2017, at 12:33 PM, Ryan Blue <rb...@netflix.com> wrote:
>>
>> Michael, I think that the problem is with your classpath.
>>
>> Spark has a dependency to 1.7.7, which can't be changed. Your project is
>> what pulls in parquet-avro and transitively Avro 1.8. Spark has no runtime
>> dependency on Avro 1.8. It is understandably annoying that using the same
>> version of Parquet for your parquet-avro dependency is what causes your
>> project to depend on Avro 1.8, but Spark's dependencies aren't a problem
>> because its Parquet dependency doesn't bring in Avro.
>>
>> There are a few ways around this:
>> 1. Make sure Avro 1.8 is found in the classpath first
>> 2. Shade Avro 1.8 in your project (assuming Avro classes aren't shared)
>> 3. Use parquet-avro 1.8.1 in your project, which I think should work with
>> 1.8.2 and avoid the Avro change
>>
>> The work-around in Spark is for tests, which do use parquet-avro. We can
>> look at a Parquet 1.8.3 that avoids this issue, but I think this is
>> reasonable for the 2.2.0 release.
>>
>> rb
>>
>> On Mon, May 1, 2017 at 12:08 PM, Michael Heuer <he...@gmail.com> wrote:
>>
>>> Please excuse me if I'm misunderstanding -- the problem is not with our
>>> library or our classpath.
>>>
>>> There is a conflict within Spark itself, in that Parquet 1.8.2 expects
>>> to find Avro 1.8.0 on the runtime classpath and sees 1.7.7 instead.  Spark
>>> already has to work around this for unit tests to pass.
>>>
>>>
>>>
>>> On Mon, May 1, 2017 at 2:00 PM, Ryan Blue <rb...@netflix.com> wrote:
>>>
>>>> Thanks for the extra context, Frank. I agree that it sounds like your
>>>> problem comes from the conflict between your Jars and what comes with
>>>> Spark. Its the same concern that makes everyone shudder when anything has a
>>>> public dependency on Jackson. :)
>>>>
>>>> What we usually do to get around situations like this is to relocate
>>>> the problem library inside the shaded Jar. That way, Spark uses its version
>>>> of Avro and your classes use a different version of Avro. This works if you
>>>> don't need to share classes between the two. Would that work for your
>>>> situation?
>>>>
>>>> rb
>>>>
>>>> On Mon, May 1, 2017 at 11:55 AM, Koert Kuipers <ko...@tresata.com>
>>>> wrote:
>>>>
>>>>> sounds like you are running into the fact that you cannot really put
>>>>> your classes before spark's on classpath? spark's switches to support this
>>>>> never really worked for me either.
>>>>>
>>>>> inability to control the classpath + inconsistent jars => trouble ?
>>>>>
>>>>> On Mon, May 1, 2017 at 2:36 PM, Frank Austin Nothaft <
>>>>> fnothaft@berkeley.edu> wrote:
>>>>>
>>>>>> Hi Ryan,
>>>>>>
>>>>>> We do set Avro to 1.8 in our downstream project. We also set Spark as
>>>>>> a provided dependency, and build an überjar. We run via spark-submit, which
>>>>>> builds the classpath with our überjar and all of the Spark deps. This leads
>>>>>> to avro 1.7.1 getting picked off of the classpath at runtime, which causes
>>>>>> the no such method exception to occur.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Frank Austin Nothaft
>>>>>> fnothaft@berkeley.edu
>>>>>> fnothaft@eecs.berkeley.edu
>>>>>> 202-340-0466 <(202)%20340-0466>
>>>>>>
>>>>>> On May 1, 2017, at 11:31 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>>>>
>>>>>> Frank,
>>>>>>
>>>>>> The issue you're running into is caused by using parquet-avro with
>>>>>> Avro 1.7. Can't your downstream project set the Avro dependency to 1.8?
>>>>>> Spark can't update Avro because it is a breaking change that would force
>>>>>> users to rebuilt specific Avro classes in some cases. But you should be
>>>>>> free to use Avro 1.8 to avoid the problem.
>>>>>>
>>>>>> On Mon, May 1, 2017 at 11:08 AM, Frank Austin Nothaft <
>>>>>> fnothaft@berkeley.edu> wrote:
>>>>>>
>>>>>>> Hi Ryan et al,
>>>>>>>
>>>>>>> The issue we’ve seen using a build of the Spark 2.2.0 branch from a
>>>>>>> downstream project is that parquet-avro uses one of the new Avro 1.8.0
>>>>>>> methods, and you get a NoSuchMethodError since Spark puts Avro 1.7.7 as a
>>>>>>> dependency. My colleague Michael (who posted earlier on this thread)
>>>>>>> documented this in Spark-19697
>>>>>>> <https://issues.apache.org/jira/browse/SPARK-19697>. I know that
>>>>>>> Spark has unit tests that check this compatibility issue, but it looks like
>>>>>>> there was a recent change that sets a test scope dependency on Avro
>>>>>>> 1.8.0
>>>>>>> <https://github.com/apache/spark/commit/0077bfcb93832d93009f73f4b80f2e3d98fd2fa4>,
>>>>>>> which masks this issue in the unit tests. With this error, you can’t use
>>>>>>> the ParquetAvroOutputFormat from a application running on Spark 2.2.0.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Frank Austin Nothaft
>>>>>>> fnothaft@berkeley.edu
>>>>>>> fnothaft@eecs.berkeley.edu
>>>>>>> 202-340-0466 <(202)%20340-0466>
>>>>>>>
>>>>>>> On May 1, 2017, at 10:02 AM, Ryan Blue <rblue@netflix.com.INVALID
>>>>>>> <rb...@netflix.com.invalid>> wrote:
>>>>>>>
>>>>>>> I agree with Sean. Spark only pulls in parquet-avro for tests. For
>>>>>>> execution, it implements the record materialization APIs in Parquet to go
>>>>>>> directly to Spark SQL rows. This doesn't actually leak an Avro 1.8
>>>>>>> dependency into Spark as far as I can tell.
>>>>>>>
>>>>>>> rb
>>>>>>>
>>>>>>> On Mon, May 1, 2017 at 8:34 AM, Sean Owen <so...@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> See discussion at https://github.com/apache/spark/pull/17163 -- I
>>>>>>>> think the issue is that fixing this trades one problem for a slightly
>>>>>>>> bigger one.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, May 1, 2017 at 4:13 PM Michael Heuer <he...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Version 2.2.0 bumps the dependency version for parquet to 1.8.2
>>>>>>>>> but does not bump the dependency version for avro (currently at 1.7.7).
>>>>>>>>> Though perhaps not clear from the issue I reported [0], this means that
>>>>>>>>> Spark is internally inconsistent, in that a call through parquet (which
>>>>>>>>> depends on avro 1.8.0 [1]) may throw errors at runtime when it hits avro
>>>>>>>>> 1.7.7 on the classpath.  Avro 1.8.0 is not binary compatible with 1.7.7.
>>>>>>>>>
>>>>>>>>> [0] - https://issues.apache.org/jira/browse/SPARK-19697
>>>>>>>>> [1] - https://github.com/apache/parquet-mr/blob/apache-
>>>>>>>>> parquet-1.8.2/pom.xml#L96
>>>>>>>>>
>>>>>>>>> On Sun, Apr 30, 2017 at 3:28 AM, Sean Owen <so...@cloudera.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I have one more issue that, if it needs to be fixed, needs to be
>>>>>>>>>> fixed for 2.2.0.
>>>>>>>>>>
>>>>>>>>>> I'm fixing build warnings for the release and noticed that
>>>>>>>>>> checkstyle actually complains there are some Java methods named in
>>>>>>>>>> TitleCase, like `ProcessingTimeTimeout`:
>>>>>>>>>>
>>>>>>>>>> https://github.com/apache/spark/pull/17803/files#r113934080
>>>>>>>>>>
>>>>>>>>>> Easy enough to fix and it's right, that's not conventional.
>>>>>>>>>> However I wonder if it was done on purpose to match a class name?
>>>>>>>>>>
>>>>>>>>>> I think this is one for @tdas
>>>>>>>>>>
>>>>>>>>>> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust <
>>>>>>>>>> michael@databricks.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Please vote on releasing the following candidate as Apache
>>>>>>>>>>> Spark version 2.2.0. The vote is open until Tues, May 2nd, 2017
>>>>>>>>>>> at 12:00 PST and passes if a majority of at least 3 +1 PMC votes are
>>>>>>>>>>> cast.
>>>>>>>>>>>
>>>>>>>>>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>>>>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> To learn more about Apache Spark, please see
>>>>>>>>>>> http://spark.apache.org/
>>>>>>>>>>>
>>>>>>>>>>> The tag to be voted on is v2.2.0-rc1
>>>>>>>>>>> <https://github.com/apache/spark/tree/v2.2.0-rc1> (
>>>>>>>>>>> 8ccb4a57c82146c1a8f8966c7e64010cf5632cb6)
>>>>>>>>>>>
>>>>>>>>>>> List of JIRA tickets resolved can be found with this filter
>>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>
>>>>>>>>>>> .
>>>>>>>>>>>
>>>>>>>>>>> The release files, including signatures, digests, etc. can be
>>>>>>>>>>> found at:
>>>>>>>>>>> http://home.apache.org/~pwendell/spark-releases/spark-
>>>>>>>>>>> 2.2.0-rc1-bin/
>>>>>>>>>>>
>>>>>>>>>>> Release artifacts are signed with the following key:
>>>>>>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>>>>>>>
>>>>>>>>>>> The staging repository for this release can be found at:
>>>>>>>>>>> https://repository.apache.org/content/repositories/
>>>>>>>>>>> orgapachespark-1235/
>>>>>>>>>>>
>>>>>>>>>>> The documentation corresponding to this release can be found at:
>>>>>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-
>>>>>>>>>>> 2.2.0-rc1-docs/
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *FAQ*
>>>>>>>>>>>
>>>>>>>>>>> *How can I help test this release?*
>>>>>>>>>>>
>>>>>>>>>>> If you are a Spark user, you can help us test this release by
>>>>>>>>>>> taking an existing Spark workload and running on this release candidate,
>>>>>>>>>>> then reporting any regressions.
>>>>>>>>>>>
>>>>>>>>>>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>>>>>>>>>>
>>>>>>>>>>> Committers should look at those and triage. Extremely important
>>>>>>>>>>> bug fixes, documentation, and API tweaks that impact compatibility should
>>>>>>>>>>> be worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>>>>>>>>>>
>>>>>>>>>>> *But my bug isn't fixed!??!*
>>>>>>>>>>>
>>>>>>>>>>> In order to make timely releases, we will typically not hold the
>>>>>>>>>>> release unless the bug in question is a regression from 2.1.1.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>>
>>

Re: [VOTE] Apache Spark 2.2.0 (RC1)

Posted by Nick Pentreath <ni...@gmail.com>.
I won't +1 just given that it seems certain there will be another RC and
there are the outstanding ML QA blocker issues.

But clean build and test for JVM and Python tests LGTM on CentOS Linux
7.2.1511, OpenJDK 1.8.0_111

On Mon, 1 May 2017 at 22:42 Frank Austin Nothaft <fn...@berkeley.edu>
wrote:

> Hi Ryan,
>
> IMO, the problem is that the Spark Avro version conflicts with the Parquet
> Avro version. As discussed upthread, I don’t think there’s a way to
> *reliably *make sure that Avro 1.8 is on the classpath first while using
> spark-submit. Relocating avro in our project wouldn’t solve the problem,
> because the MethodNotFoundError is thrown from the internals of the
> ParquetAvroOutputFormat, not from code in our project.
>
> Regards,
>
> Frank Austin Nothaft
> fnothaft@berkeley.edu
> fnothaft@eecs.berkeley.edu
> 202-340-0466 <(202)%20340-0466>
>
> On May 1, 2017, at 12:33 PM, Ryan Blue <rb...@netflix.com> wrote:
>
> Michael, I think that the problem is with your classpath.
>
> Spark has a dependency to 1.7.7, which can't be changed. Your project is
> what pulls in parquet-avro and transitively Avro 1.8. Spark has no runtime
> dependency on Avro 1.8. It is understandably annoying that using the same
> version of Parquet for your parquet-avro dependency is what causes your
> project to depend on Avro 1.8, but Spark's dependencies aren't a problem
> because its Parquet dependency doesn't bring in Avro.
>
> There are a few ways around this:
> 1. Make sure Avro 1.8 is found in the classpath first
> 2. Shade Avro 1.8 in your project (assuming Avro classes aren't shared)
> 3. Use parquet-avro 1.8.1 in your project, which I think should work with
> 1.8.2 and avoid the Avro change
>
> The work-around in Spark is for tests, which do use parquet-avro. We can
> look at a Parquet 1.8.3 that avoids this issue, but I think this is
> reasonable for the 2.2.0 release.
>
> rb
>
> On Mon, May 1, 2017 at 12:08 PM, Michael Heuer <he...@gmail.com> wrote:
>
>> Please excuse me if I'm misunderstanding -- the problem is not with our
>> library or our classpath.
>>
>> There is a conflict within Spark itself, in that Parquet 1.8.2 expects to
>> find Avro 1.8.0 on the runtime classpath and sees 1.7.7 instead.  Spark
>> already has to work around this for unit tests to pass.
>>
>>
>>
>> On Mon, May 1, 2017 at 2:00 PM, Ryan Blue <rb...@netflix.com> wrote:
>>
>>> Thanks for the extra context, Frank. I agree that it sounds like your
>>> problem comes from the conflict between your Jars and what comes with
>>> Spark. Its the same concern that makes everyone shudder when anything has a
>>> public dependency on Jackson. :)
>>>
>>> What we usually do to get around situations like this is to relocate the
>>> problem library inside the shaded Jar. That way, Spark uses its version of
>>> Avro and your classes use a different version of Avro. This works if you
>>> don't need to share classes between the two. Would that work for your
>>> situation?
>>>
>>> rb
>>>
>>> On Mon, May 1, 2017 at 11:55 AM, Koert Kuipers <ko...@tresata.com>
>>> wrote:
>>>
>>>> sounds like you are running into the fact that you cannot really put
>>>> your classes before spark's on classpath? spark's switches to support this
>>>> never really worked for me either.
>>>>
>>>> inability to control the classpath + inconsistent jars => trouble ?
>>>>
>>>> On Mon, May 1, 2017 at 2:36 PM, Frank Austin Nothaft <
>>>> fnothaft@berkeley.edu> wrote:
>>>>
>>>>> Hi Ryan,
>>>>>
>>>>> We do set Avro to 1.8 in our downstream project. We also set Spark as
>>>>> a provided dependency, and build an überjar. We run via spark-submit, which
>>>>> builds the classpath with our überjar and all of the Spark deps. This leads
>>>>> to avro 1.7.1 getting picked off of the classpath at runtime, which causes
>>>>> the no such method exception to occur.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Frank Austin Nothaft
>>>>> fnothaft@berkeley.edu
>>>>> fnothaft@eecs.berkeley.edu
>>>>> 202-340-0466 <(202)%20340-0466>
>>>>>
>>>>> On May 1, 2017, at 11:31 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>>>
>>>>> Frank,
>>>>>
>>>>> The issue you're running into is caused by using parquet-avro with
>>>>> Avro 1.7. Can't your downstream project set the Avro dependency to 1.8?
>>>>> Spark can't update Avro because it is a breaking change that would force
>>>>> users to rebuilt specific Avro classes in some cases. But you should be
>>>>> free to use Avro 1.8 to avoid the problem.
>>>>>
>>>>> On Mon, May 1, 2017 at 11:08 AM, Frank Austin Nothaft <
>>>>> fnothaft@berkeley.edu> wrote:
>>>>>
>>>>>> Hi Ryan et al,
>>>>>>
>>>>>> The issue we’ve seen using a build of the Spark 2.2.0 branch from a
>>>>>> downstream project is that parquet-avro uses one of the new Avro 1.8.0
>>>>>> methods, and you get a NoSuchMethodError since Spark puts Avro 1.7.7 as a
>>>>>> dependency. My colleague Michael (who posted earlier on this thread)
>>>>>> documented this in Spark-19697
>>>>>> <https://issues.apache.org/jira/browse/SPARK-19697>. I know that
>>>>>> Spark has unit tests that check this compatibility issue, but it looks like
>>>>>> there was a recent change that sets a test scope dependency on Avro
>>>>>> 1.8.0
>>>>>> <https://github.com/apache/spark/commit/0077bfcb93832d93009f73f4b80f2e3d98fd2fa4>,
>>>>>> which masks this issue in the unit tests. With this error, you can’t use
>>>>>> the ParquetAvroOutputFormat from a application running on Spark 2.2.0.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Frank Austin Nothaft
>>>>>> fnothaft@berkeley.edu
>>>>>> fnothaft@eecs.berkeley.edu
>>>>>> 202-340-0466 <(202)%20340-0466>
>>>>>>
>>>>>> On May 1, 2017, at 10:02 AM, Ryan Blue <rblue@netflix.com.INVALID
>>>>>> <rb...@netflix.com.invalid>> wrote:
>>>>>>
>>>>>> I agree with Sean. Spark only pulls in parquet-avro for tests. For
>>>>>> execution, it implements the record materialization APIs in Parquet to go
>>>>>> directly to Spark SQL rows. This doesn't actually leak an Avro 1.8
>>>>>> dependency into Spark as far as I can tell.
>>>>>>
>>>>>> rb
>>>>>>
>>>>>> On Mon, May 1, 2017 at 8:34 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>>>
>>>>>>> See discussion at https://github.com/apache/spark/pull/17163 -- I
>>>>>>> think the issue is that fixing this trades one problem for a slightly
>>>>>>> bigger one.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, May 1, 2017 at 4:13 PM Michael Heuer <he...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Version 2.2.0 bumps the dependency version for parquet to 1.8.2 but
>>>>>>>> does not bump the dependency version for avro (currently at 1.7.7).  Though
>>>>>>>> perhaps not clear from the issue I reported [0], this means that Spark is
>>>>>>>> internally inconsistent, in that a call through parquet (which depends on
>>>>>>>> avro 1.8.0 [1]) may throw errors at runtime when it hits avro 1.7.7 on the
>>>>>>>> classpath.  Avro 1.8.0 is not binary compatible with 1.7.7.
>>>>>>>>
>>>>>>>> [0] - https://issues.apache.org/jira/browse/SPARK-19697
>>>>>>>> [1] -
>>>>>>>> https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/pom.xml#L96
>>>>>>>>
>>>>>>>> On Sun, Apr 30, 2017 at 3:28 AM, Sean Owen <so...@cloudera.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I have one more issue that, if it needs to be fixed, needs to be
>>>>>>>>> fixed for 2.2.0.
>>>>>>>>>
>>>>>>>>> I'm fixing build warnings for the release and noticed that
>>>>>>>>> checkstyle actually complains there are some Java methods named in
>>>>>>>>> TitleCase, like `ProcessingTimeTimeout`:
>>>>>>>>>
>>>>>>>>> https://github.com/apache/spark/pull/17803/files#r113934080
>>>>>>>>>
>>>>>>>>> Easy enough to fix and it's right, that's not conventional.
>>>>>>>>> However I wonder if it was done on purpose to match a class name?
>>>>>>>>>
>>>>>>>>> I think this is one for @tdas
>>>>>>>>>
>>>>>>>>> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust <
>>>>>>>>> michael@databricks.com> wrote:
>>>>>>>>>
>>>>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>>>>>> version 2.2.0. The vote is open until Tues, May 2nd, 2017 at
>>>>>>>>>> 12:00 PST and passes if a majority of at least 3 +1 PMC votes are
>>>>>>>>>> cast.
>>>>>>>>>>
>>>>>>>>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>>>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> To learn more about Apache Spark, please see
>>>>>>>>>> http://spark.apache.org/
>>>>>>>>>>
>>>>>>>>>> The tag to be voted on is v2.2.0-rc1
>>>>>>>>>> <https://github.com/apache/spark/tree/v2.2.0-rc1> (
>>>>>>>>>> 8ccb4a57c82146c1a8f8966c7e64010cf5632cb6)
>>>>>>>>>>
>>>>>>>>>> List of JIRA tickets resolved can be found with this filter
>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>
>>>>>>>>>> .
>>>>>>>>>>
>>>>>>>>>> The release files, including signatures, digests, etc. can be
>>>>>>>>>> found at:
>>>>>>>>>>
>>>>>>>>>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/
>>>>>>>>>>
>>>>>>>>>> Release artifacts are signed with the following key:
>>>>>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>>>>>>
>>>>>>>>>> The staging repository for this release can be found at:
>>>>>>>>>>
>>>>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1235/
>>>>>>>>>>
>>>>>>>>>> The documentation corresponding to this release can be found at:
>>>>>>>>>>
>>>>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *FAQ*
>>>>>>>>>>
>>>>>>>>>> *How can I help test this release?*
>>>>>>>>>>
>>>>>>>>>> If you are a Spark user, you can help us test this release by
>>>>>>>>>> taking an existing Spark workload and running on this release candidate,
>>>>>>>>>> then reporting any regressions.
>>>>>>>>>>
>>>>>>>>>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>>>>>>>>>
>>>>>>>>>> Committers should look at those and triage. Extremely important
>>>>>>>>>> bug fixes, documentation, and API tweaks that impact compatibility should
>>>>>>>>>> be worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>>>>>>>>>
>>>>>>>>>> *But my bug isn't fixed!??!*
>>>>>>>>>>
>>>>>>>>>> In order to make timely releases, we will typically not hold the
>>>>>>>>>> release unless the bug in question is a regression from 2.1.1.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>

Re: [VOTE] Apache Spark 2.2.0 (RC1)

Posted by Frank Austin Nothaft <fn...@berkeley.edu>.
Hi Ryan,

IMO, the problem is that the Spark Avro version conflicts with the Parquet Avro version. As discussed upthread, I don’t think there’s a way to reliably make sure that Avro 1.8 is on the classpath first while using spark-submit. Relocating avro in our project wouldn’t solve the problem, because the MethodNotFoundError is thrown from the internals of the ParquetAvroOutputFormat, not from code in our project.

Regards,

Frank Austin Nothaft
fnothaft@berkeley.edu
fnothaft@eecs.berkeley.edu
202-340-0466

> On May 1, 2017, at 12:33 PM, Ryan Blue <rb...@netflix.com> wrote:
> 
> Michael, I think that the problem is with your classpath.
> 
> Spark has a dependency to 1.7.7, which can't be changed. Your project is what pulls in parquet-avro and transitively Avro 1.8. Spark has no runtime dependency on Avro 1.8. It is understandably annoying that using the same version of Parquet for your parquet-avro dependency is what causes your project to depend on Avro 1.8, but Spark's dependencies aren't a problem because its Parquet dependency doesn't bring in Avro.
> 
> There are a few ways around this:
> 1. Make sure Avro 1.8 is found in the classpath first
> 2. Shade Avro 1.8 in your project (assuming Avro classes aren't shared)
> 3. Use parquet-avro 1.8.1 in your project, which I think should work with 1.8.2 and avoid the Avro change
> 
> The work-around in Spark is for tests, which do use parquet-avro. We can look at a Parquet 1.8.3 that avoids this issue, but I think this is reasonable for the 2.2.0 release.
> 
> rb
> 
> On Mon, May 1, 2017 at 12:08 PM, Michael Heuer <heuermh@gmail.com <ma...@gmail.com>> wrote:
> Please excuse me if I'm misunderstanding -- the problem is not with our library or our classpath.
> 
> There is a conflict within Spark itself, in that Parquet 1.8.2 expects to find Avro 1.8.0 on the runtime classpath and sees 1.7.7 instead.  Spark already has to work around this for unit tests to pass.
> 
> 
> 
> On Mon, May 1, 2017 at 2:00 PM, Ryan Blue <rblue@netflix.com <ma...@netflix.com>> wrote:
> Thanks for the extra context, Frank. I agree that it sounds like your problem comes from the conflict between your Jars and what comes with Spark. Its the same concern that makes everyone shudder when anything has a public dependency on Jackson. :)
> 
> What we usually do to get around situations like this is to relocate the problem library inside the shaded Jar. That way, Spark uses its version of Avro and your classes use a different version of Avro. This works if you don't need to share classes between the two. Would that work for your situation?
> 
> rb
> 
> On Mon, May 1, 2017 at 11:55 AM, Koert Kuipers <koert@tresata.com <ma...@tresata.com>> wrote:
> sounds like you are running into the fact that you cannot really put your classes before spark's on classpath? spark's switches to support this never really worked for me either.
> 
> inability to control the classpath + inconsistent jars => trouble ?
> 
> On Mon, May 1, 2017 at 2:36 PM, Frank Austin Nothaft <fnothaft@berkeley.edu <ma...@berkeley.edu>> wrote:
> Hi Ryan,
> 
> We do set Avro to 1.8 in our downstream project. We also set Spark as a provided dependency, and build an überjar. We run via spark-submit, which builds the classpath with our überjar and all of the Spark deps. This leads to avro 1.7.1 getting picked off of the classpath at runtime, which causes the no such method exception to occur.
> 
> Regards,
> 
> Frank Austin Nothaft
> fnothaft@berkeley.edu <ma...@berkeley.edu>
> fnothaft@eecs.berkeley.edu <ma...@eecs.berkeley.edu>
> 202-340-0466 <tel:(202)%20340-0466>
>> On May 1, 2017, at 11:31 AM, Ryan Blue <rblue@netflix.com <ma...@netflix.com>> wrote:
>> 
>> Frank,
>> 
>> The issue you're running into is caused by using parquet-avro with Avro 1.7. Can't your downstream project set the Avro dependency to 1.8? Spark can't update Avro because it is a breaking change that would force users to rebuilt specific Avro classes in some cases. But you should be free to use Avro 1.8 to avoid the problem.
>> 
>> On Mon, May 1, 2017 at 11:08 AM, Frank Austin Nothaft <fnothaft@berkeley.edu <ma...@berkeley.edu>> wrote:
>> Hi Ryan et al,
>> 
>> The issue we’ve seen using a build of the Spark 2.2.0 branch from a downstream project is that parquet-avro uses one of the new Avro 1.8.0 methods, and you get a NoSuchMethodError since Spark puts Avro 1.7.7 as a dependency. My colleague Michael (who posted earlier on this thread) documented this in Spark-19697 <https://issues.apache.org/jira/browse/SPARK-19697>. I know that Spark has unit tests that check this compatibility issue, but it looks like there was a recent change that sets a test scope dependency on Avro 1.8.0 <https://github.com/apache/spark/commit/0077bfcb93832d93009f73f4b80f2e3d98fd2fa4>, which masks this issue in the unit tests. With this error, you can’t use the ParquetAvroOutputFormat from a application running on Spark 2.2.0.
>> 
>> Regards,
>> 
>> Frank Austin Nothaft
>> fnothaft@berkeley.edu <ma...@berkeley.edu>
>> fnothaft@eecs.berkeley.edu <ma...@eecs.berkeley.edu>
>> 202-340-0466 <tel:(202)%20340-0466>
>> 
>>> On May 1, 2017, at 10:02 AM, Ryan Blue <rblue@netflix.com.INVALID <ma...@netflix.com.invalid>> wrote:
>>> 
>>> I agree with Sean. Spark only pulls in parquet-avro for tests. For execution, it implements the record materialization APIs in Parquet to go directly to Spark SQL rows. This doesn't actually leak an Avro 1.8 dependency into Spark as far as I can tell.
>>> 
>>> rb
>>> 
>>> On Mon, May 1, 2017 at 8:34 AM, Sean Owen <sowen@cloudera.com <ma...@cloudera.com>> wrote:
>>> See discussion at https://github.com/apache/spark/pull/17163 <https://github.com/apache/spark/pull/17163> -- I think the issue is that fixing this trades one problem for a slightly bigger one.
>>> 
>>> 
>>> On Mon, May 1, 2017 at 4:13 PM Michael Heuer <heuermh@gmail.com <ma...@gmail.com>> wrote:
>>> Version 2.2.0 bumps the dependency version for parquet to 1.8.2 but does not bump the dependency version for avro (currently at 1.7.7).  Though perhaps not clear from the issue I reported [0], this means that Spark is internally inconsistent, in that a call through parquet (which depends on avro 1.8.0 [1]) may throw errors at runtime when it hits avro 1.7.7 on the classpath.  Avro 1.8.0 is not binary compatible with 1.7.7.
>>> 
>>> [0] - https://issues.apache.org/jira/browse/SPARK-19697 <https://issues.apache.org/jira/browse/SPARK-19697>
>>> [1] - https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/pom.xml#L96 <https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/pom.xml#L96>
>>> 
>>> On Sun, Apr 30, 2017 at 3:28 AM, Sean Owen <sowen@cloudera.com <ma...@cloudera.com>> wrote:
>>> I have one more issue that, if it needs to be fixed, needs to be fixed for 2.2.0.
>>> 
>>> I'm fixing build warnings for the release and noticed that checkstyle actually complains there are some Java methods named in TitleCase, like `ProcessingTimeTimeout`:
>>> 
>>> https://github.com/apache/spark/pull/17803/files#r113934080 <https://github.com/apache/spark/pull/17803/files#r113934080>
>>> 
>>> Easy enough to fix and it's right, that's not conventional. However I wonder if it was done on purpose to match a class name?
>>> 
>>> I think this is one for @tdas
>>> 
>>> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust <michael@databricks.com <ma...@databricks.com>> wrote:
>>> Please vote on releasing the following candidate as Apache Spark version 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>> 
>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>> [ ] -1 Do not release this package because ...
>>> 
>>> 
>>> To learn more about Apache Spark, please see http://spark.apache.org/ <http://spark.apache.org/>
>>> 
>>> The tag to be voted on is v2.2.0-rc1 <https://github.com/apache/spark/tree/v2.2.0-rc1> (8ccb4a57c82146c1a8f8966c7e64010cf5632cb6)
>>> 
>>> List of JIRA tickets resolved can be found with this filter <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>.
>>> 
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/ <http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/>
>>> 
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc <https://people.apache.org/keys/committer/pwendell.asc>
>>> 
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1235/ <https://repository.apache.org/content/repositories/orgapachespark-1235/>
>>> 
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/ <http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/>
>>> 
>>> 
>>> FAQ
>>> 
>>> How can I help test this release?
>>> 
>>> If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions.
>>> 
>>> What should happen to JIRA tickets still targeting 2.2.0?
>>> 
>>> Committers should look at those and triage. Extremely important bug fixes, documentation, and API tweaks that impact compatibility should be worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>> 
>>> But my bug isn't fixed!??!
>>> 
>>> In order to make timely releases, we will typically not hold the release unless the bug in question is a regression from 2.1.1.
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>> 
>> 
>> 
>> 
>> -- 
>> Ryan Blue
>> Software Engineer
>> Netflix
> 
> 
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


Re: [VOTE] Apache Spark 2.2.0 (RC1)

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Michael, I think that the problem is with your classpath.

Spark has a dependency to 1.7.7, which can't be changed. Your project is
what pulls in parquet-avro and transitively Avro 1.8. Spark has no runtime
dependency on Avro 1.8. It is understandably annoying that using the same
version of Parquet for your parquet-avro dependency is what causes your
project to depend on Avro 1.8, but Spark's dependencies aren't a problem
because its Parquet dependency doesn't bring in Avro.

There are a few ways around this:
1. Make sure Avro 1.8 is found in the classpath first
2. Shade Avro 1.8 in your project (assuming Avro classes aren't shared)
3. Use parquet-avro 1.8.1 in your project, which I think should work with
1.8.2 and avoid the Avro change

The work-around in Spark is for tests, which do use parquet-avro. We can
look at a Parquet 1.8.3 that avoids this issue, but I think this is
reasonable for the 2.2.0 release.

rb

On Mon, May 1, 2017 at 12:08 PM, Michael Heuer <he...@gmail.com> wrote:

> Please excuse me if I'm misunderstanding -- the problem is not with our
> library or our classpath.
>
> There is a conflict within Spark itself, in that Parquet 1.8.2 expects to
> find Avro 1.8.0 on the runtime classpath and sees 1.7.7 instead.  Spark
> already has to work around this for unit tests to pass.
>
>
>
> On Mon, May 1, 2017 at 2:00 PM, Ryan Blue <rb...@netflix.com> wrote:
>
>> Thanks for the extra context, Frank. I agree that it sounds like your
>> problem comes from the conflict between your Jars and what comes with
>> Spark. Its the same concern that makes everyone shudder when anything has a
>> public dependency on Jackson. :)
>>
>> What we usually do to get around situations like this is to relocate the
>> problem library inside the shaded Jar. That way, Spark uses its version of
>> Avro and your classes use a different version of Avro. This works if you
>> don't need to share classes between the two. Would that work for your
>> situation?
>>
>> rb
>>
>> On Mon, May 1, 2017 at 11:55 AM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> sounds like you are running into the fact that you cannot really put
>>> your classes before spark's on classpath? spark's switches to support this
>>> never really worked for me either.
>>>
>>> inability to control the classpath + inconsistent jars => trouble ?
>>>
>>> On Mon, May 1, 2017 at 2:36 PM, Frank Austin Nothaft <
>>> fnothaft@berkeley.edu> wrote:
>>>
>>>> Hi Ryan,
>>>>
>>>> We do set Avro to 1.8 in our downstream project. We also set Spark as a
>>>> provided dependency, and build an überjar. We run via spark-submit, which
>>>> builds the classpath with our überjar and all of the Spark deps. This leads
>>>> to avro 1.7.1 getting picked off of the classpath at runtime, which causes
>>>> the no such method exception to occur.
>>>>
>>>> Regards,
>>>>
>>>> Frank Austin Nothaft
>>>> fnothaft@berkeley.edu
>>>> fnothaft@eecs.berkeley.edu
>>>> 202-340-0466 <(202)%20340-0466>
>>>>
>>>> On May 1, 2017, at 11:31 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>>
>>>> Frank,
>>>>
>>>> The issue you're running into is caused by using parquet-avro with Avro
>>>> 1.7. Can't your downstream project set the Avro dependency to 1.8? Spark
>>>> can't update Avro because it is a breaking change that would force users to
>>>> rebuilt specific Avro classes in some cases. But you should be free to use
>>>> Avro 1.8 to avoid the problem.
>>>>
>>>> On Mon, May 1, 2017 at 11:08 AM, Frank Austin Nothaft <
>>>> fnothaft@berkeley.edu> wrote:
>>>>
>>>>> Hi Ryan et al,
>>>>>
>>>>> The issue we’ve seen using a build of the Spark 2.2.0 branch from a
>>>>> downstream project is that parquet-avro uses one of the new Avro 1.8.0
>>>>> methods, and you get a NoSuchMethodError since Spark puts Avro 1.7.7 as a
>>>>> dependency. My colleague Michael (who posted earlier on this thread)
>>>>> documented this in Spark-19697
>>>>> <https://issues.apache.org/jira/browse/SPARK-19697>. I know that
>>>>> Spark has unit tests that check this compatibility issue, but it looks like
>>>>> there was a recent change that sets a test scope dependency on Avro
>>>>> 1.8.0
>>>>> <https://github.com/apache/spark/commit/0077bfcb93832d93009f73f4b80f2e3d98fd2fa4>,
>>>>> which masks this issue in the unit tests. With this error, you can’t use
>>>>> the ParquetAvroOutputFormat from a application running on Spark 2.2.0.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Frank Austin Nothaft
>>>>> fnothaft@berkeley.edu
>>>>> fnothaft@eecs.berkeley.edu
>>>>> 202-340-0466 <(202)%20340-0466>
>>>>>
>>>>> On May 1, 2017, at 10:02 AM, Ryan Blue <rblue@netflix.com.INVALID
>>>>> <rb...@netflix.com.invalid>> wrote:
>>>>>
>>>>> I agree with Sean. Spark only pulls in parquet-avro for tests. For
>>>>> execution, it implements the record materialization APIs in Parquet to go
>>>>> directly to Spark SQL rows. This doesn't actually leak an Avro 1.8
>>>>> dependency into Spark as far as I can tell.
>>>>>
>>>>> rb
>>>>>
>>>>> On Mon, May 1, 2017 at 8:34 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>>
>>>>>> See discussion at https://github.com/apache/spark/pull/17163 -- I
>>>>>> think the issue is that fixing this trades one problem for a slightly
>>>>>> bigger one.
>>>>>>
>>>>>>
>>>>>> On Mon, May 1, 2017 at 4:13 PM Michael Heuer <he...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Version 2.2.0 bumps the dependency version for parquet to 1.8.2 but
>>>>>>> does not bump the dependency version for avro (currently at 1.7.7).  Though
>>>>>>> perhaps not clear from the issue I reported [0], this means that Spark is
>>>>>>> internally inconsistent, in that a call through parquet (which depends on
>>>>>>> avro 1.8.0 [1]) may throw errors at runtime when it hits avro 1.7.7 on the
>>>>>>> classpath.  Avro 1.8.0 is not binary compatible with 1.7.7.
>>>>>>>
>>>>>>> [0] - https://issues.apache.org/jira/browse/SPARK-19697
>>>>>>> [1] - https://github.com/apache/parquet-mr/blob/apache-parquet-1.8
>>>>>>> .2/pom.xml#L96
>>>>>>>
>>>>>>> On Sun, Apr 30, 2017 at 3:28 AM, Sean Owen <so...@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I have one more issue that, if it needs to be fixed, needs to be
>>>>>>>> fixed for 2.2.0.
>>>>>>>>
>>>>>>>> I'm fixing build warnings for the release and noticed that
>>>>>>>> checkstyle actually complains there are some Java methods named in
>>>>>>>> TitleCase, like `ProcessingTimeTimeout`:
>>>>>>>>
>>>>>>>> https://github.com/apache/spark/pull/17803/files#r113934080
>>>>>>>>
>>>>>>>> Easy enough to fix and it's right, that's not conventional. However
>>>>>>>> I wonder if it was done on purpose to match a class name?
>>>>>>>>
>>>>>>>> I think this is one for @tdas
>>>>>>>>
>>>>>>>> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust <
>>>>>>>> michael@databricks.com> wrote:
>>>>>>>>
>>>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>>>>> version 2.2.0. The vote is open until Tues, May 2nd, 2017 at
>>>>>>>>> 12:00 PST and passes if a majority of at least 3 +1 PMC votes are
>>>>>>>>> cast.
>>>>>>>>>
>>>>>>>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> To learn more about Apache Spark, please see
>>>>>>>>> http://spark.apache.org/
>>>>>>>>>
>>>>>>>>> The tag to be voted on is v2.2.0-rc1
>>>>>>>>> <https://github.com/apache/spark/tree/v2.2.0-rc1> (8ccb4a57c82146c
>>>>>>>>> 1a8f8966c7e64010cf5632cb6)
>>>>>>>>>
>>>>>>>>> List of JIRA tickets resolved can be found with this filter
>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>> The release files, including signatures, digests, etc. can be
>>>>>>>>> found at:
>>>>>>>>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-
>>>>>>>>> rc1-bin/
>>>>>>>>>
>>>>>>>>> Release artifacts are signed with the following key:
>>>>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>>>>>
>>>>>>>>> The staging repository for this release can be found at:
>>>>>>>>> https://repository.apache.org/content/repositories/orgapache
>>>>>>>>> spark-1235/
>>>>>>>>>
>>>>>>>>> The documentation corresponding to this release can be found at:
>>>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.
>>>>>>>>> 0-rc1-docs/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *FAQ*
>>>>>>>>>
>>>>>>>>> *How can I help test this release?*
>>>>>>>>>
>>>>>>>>> If you are a Spark user, you can help us test this release by
>>>>>>>>> taking an existing Spark workload and running on this release candidate,
>>>>>>>>> then reporting any regressions.
>>>>>>>>>
>>>>>>>>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>>>>>>>>
>>>>>>>>> Committers should look at those and triage. Extremely important
>>>>>>>>> bug fixes, documentation, and API tweaks that impact compatibility should
>>>>>>>>> be worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>>>>>>>>
>>>>>>>>> *But my bug isn't fixed!??!*
>>>>>>>>>
>>>>>>>>> In order to make timely releases, we will typically not hold the
>>>>>>>>> release unless the bug in question is a regression from 2.1.1.
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Apache Spark 2.2.0 (RC1)

Posted by Michael Heuer <he...@gmail.com>.
Please excuse me if I'm misunderstanding -- the problem is not with our
library or our classpath.

There is a conflict within Spark itself, in that Parquet 1.8.2 expects to
find Avro 1.8.0 on the runtime classpath and sees 1.7.7 instead.  Spark
already has to work around this for unit tests to pass.



On Mon, May 1, 2017 at 2:00 PM, Ryan Blue <rb...@netflix.com> wrote:

> Thanks for the extra context, Frank. I agree that it sounds like your
> problem comes from the conflict between your Jars and what comes with
> Spark. Its the same concern that makes everyone shudder when anything has a
> public dependency on Jackson. :)
>
> What we usually do to get around situations like this is to relocate the
> problem library inside the shaded Jar. That way, Spark uses its version of
> Avro and your classes use a different version of Avro. This works if you
> don't need to share classes between the two. Would that work for your
> situation?
>
> rb
>
> On Mon, May 1, 2017 at 11:55 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> sounds like you are running into the fact that you cannot really put your
>> classes before spark's on classpath? spark's switches to support this never
>> really worked for me either.
>>
>> inability to control the classpath + inconsistent jars => trouble ?
>>
>> On Mon, May 1, 2017 at 2:36 PM, Frank Austin Nothaft <
>> fnothaft@berkeley.edu> wrote:
>>
>>> Hi Ryan,
>>>
>>> We do set Avro to 1.8 in our downstream project. We also set Spark as a
>>> provided dependency, and build an überjar. We run via spark-submit, which
>>> builds the classpath with our überjar and all of the Spark deps. This leads
>>> to avro 1.7.1 getting picked off of the classpath at runtime, which causes
>>> the no such method exception to occur.
>>>
>>> Regards,
>>>
>>> Frank Austin Nothaft
>>> fnothaft@berkeley.edu
>>> fnothaft@eecs.berkeley.edu
>>> 202-340-0466 <(202)%20340-0466>
>>>
>>> On May 1, 2017, at 11:31 AM, Ryan Blue <rb...@netflix.com> wrote:
>>>
>>> Frank,
>>>
>>> The issue you're running into is caused by using parquet-avro with Avro
>>> 1.7. Can't your downstream project set the Avro dependency to 1.8? Spark
>>> can't update Avro because it is a breaking change that would force users to
>>> rebuilt specific Avro classes in some cases. But you should be free to use
>>> Avro 1.8 to avoid the problem.
>>>
>>> On Mon, May 1, 2017 at 11:08 AM, Frank Austin Nothaft <
>>> fnothaft@berkeley.edu> wrote:
>>>
>>>> Hi Ryan et al,
>>>>
>>>> The issue we’ve seen using a build of the Spark 2.2.0 branch from a
>>>> downstream project is that parquet-avro uses one of the new Avro 1.8.0
>>>> methods, and you get a NoSuchMethodError since Spark puts Avro 1.7.7 as a
>>>> dependency. My colleague Michael (who posted earlier on this thread)
>>>> documented this in Spark-19697
>>>> <https://issues.apache.org/jira/browse/SPARK-19697>. I know that Spark
>>>> has unit tests that check this compatibility issue, but it looks like there
>>>> was a recent change that sets a test scope dependency on Avro 1.8.0
>>>> <https://github.com/apache/spark/commit/0077bfcb93832d93009f73f4b80f2e3d98fd2fa4>,
>>>> which masks this issue in the unit tests. With this error, you can’t use
>>>> the ParquetAvroOutputFormat from a application running on Spark 2.2.0.
>>>>
>>>> Regards,
>>>>
>>>> Frank Austin Nothaft
>>>> fnothaft@berkeley.edu
>>>> fnothaft@eecs.berkeley.edu
>>>> 202-340-0466 <(202)%20340-0466>
>>>>
>>>> On May 1, 2017, at 10:02 AM, Ryan Blue <rblue@netflix.com.INVALID
>>>> <rb...@netflix.com.invalid>> wrote:
>>>>
>>>> I agree with Sean. Spark only pulls in parquet-avro for tests. For
>>>> execution, it implements the record materialization APIs in Parquet to go
>>>> directly to Spark SQL rows. This doesn't actually leak an Avro 1.8
>>>> dependency into Spark as far as I can tell.
>>>>
>>>> rb
>>>>
>>>> On Mon, May 1, 2017 at 8:34 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>
>>>>> See discussion at https://github.com/apache/spark/pull/17163 -- I
>>>>> think the issue is that fixing this trades one problem for a slightly
>>>>> bigger one.
>>>>>
>>>>>
>>>>> On Mon, May 1, 2017 at 4:13 PM Michael Heuer <he...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Version 2.2.0 bumps the dependency version for parquet to 1.8.2 but
>>>>>> does not bump the dependency version for avro (currently at 1.7.7).  Though
>>>>>> perhaps not clear from the issue I reported [0], this means that Spark is
>>>>>> internally inconsistent, in that a call through parquet (which depends on
>>>>>> avro 1.8.0 [1]) may throw errors at runtime when it hits avro 1.7.7 on the
>>>>>> classpath.  Avro 1.8.0 is not binary compatible with 1.7.7.
>>>>>>
>>>>>> [0] - https://issues.apache.org/jira/browse/SPARK-19697
>>>>>> [1] - https://github.com/apache/parquet-mr/blob/apache-parquet-1.8
>>>>>> .2/pom.xml#L96
>>>>>>
>>>>>> On Sun, Apr 30, 2017 at 3:28 AM, Sean Owen <so...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I have one more issue that, if it needs to be fixed, needs to be
>>>>>>> fixed for 2.2.0.
>>>>>>>
>>>>>>> I'm fixing build warnings for the release and noticed that
>>>>>>> checkstyle actually complains there are some Java methods named in
>>>>>>> TitleCase, like `ProcessingTimeTimeout`:
>>>>>>>
>>>>>>> https://github.com/apache/spark/pull/17803/files#r113934080
>>>>>>>
>>>>>>> Easy enough to fix and it's right, that's not conventional. However
>>>>>>> I wonder if it was done on purpose to match a class name?
>>>>>>>
>>>>>>> I think this is one for @tdas
>>>>>>>
>>>>>>> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust <
>>>>>>> michael@databricks.com> wrote:
>>>>>>>
>>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>>>> version 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00
>>>>>>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>>>>
>>>>>>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>>
>>>>>>>>
>>>>>>>> To learn more about Apache Spark, please see
>>>>>>>> http://spark.apache.org/
>>>>>>>>
>>>>>>>> The tag to be voted on is v2.2.0-rc1
>>>>>>>> <https://github.com/apache/spark/tree/v2.2.0-rc1> (8ccb4a57c82146c
>>>>>>>> 1a8f8966c7e64010cf5632cb6)
>>>>>>>>
>>>>>>>> List of JIRA tickets resolved can be found with this filter
>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>
>>>>>>>> .
>>>>>>>>
>>>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>>>> at:
>>>>>>>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-
>>>>>>>> rc1-bin/
>>>>>>>>
>>>>>>>> Release artifacts are signed with the following key:
>>>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>>>>
>>>>>>>> The staging repository for this release can be found at:
>>>>>>>> https://repository.apache.org/content/repositories/orgapache
>>>>>>>> spark-1235/
>>>>>>>>
>>>>>>>> The documentation corresponding to this release can be found at:
>>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.
>>>>>>>> 0-rc1-docs/
>>>>>>>>
>>>>>>>>
>>>>>>>> *FAQ*
>>>>>>>>
>>>>>>>> *How can I help test this release?*
>>>>>>>>
>>>>>>>> If you are a Spark user, you can help us test this release by
>>>>>>>> taking an existing Spark workload and running on this release candidate,
>>>>>>>> then reporting any regressions.
>>>>>>>>
>>>>>>>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>>>>>>>
>>>>>>>> Committers should look at those and triage. Extremely important bug
>>>>>>>> fixes, documentation, and API tweaks that impact compatibility should be
>>>>>>>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>>>>>>>
>>>>>>>> *But my bug isn't fixed!??!*
>>>>>>>>
>>>>>>>> In order to make timely releases, we will typically not hold the
>>>>>>>> release unless the bug in question is a regression from 2.1.1.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>>
>>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: [VOTE] Apache Spark 2.2.0 (RC1)

Posted by Frank Austin Nothaft <fn...@berkeley.edu>.
Hi Ryan!

I think relocating the avro dependency inside of Spark would make a lot of sense. Otherwise, we’d need Spark to move to Avro 1.8.0, or Parquet to cut a new 1.8.3 release that either reverts back to Avro 1.7.7 or that eliminates the code that is binary incompatible between Avro 1.7.7 and 1.8.0.

Regards,

Frank Austin Nothaft
fnothaft@berkeley.edu
fnothaft@eecs.berkeley.edu
202-340-0466

> On May 1, 2017, at 12:00 PM, Ryan Blue <rb...@netflix.com> wrote:
> 
> Thanks for the extra context, Frank. I agree that it sounds like your problem comes from the conflict between your Jars and what comes with Spark. Its the same concern that makes everyone shudder when anything has a public dependency on Jackson. :)
> 
> What we usually do to get around situations like this is to relocate the problem library inside the shaded Jar. That way, Spark uses its version of Avro and your classes use a different version of Avro. This works if you don't need to share classes between the two. Would that work for your situation?
> 
> rb
> 
> On Mon, May 1, 2017 at 11:55 AM, Koert Kuipers <koert@tresata.com <ma...@tresata.com>> wrote:
> sounds like you are running into the fact that you cannot really put your classes before spark's on classpath? spark's switches to support this never really worked for me either.
> 
> inability to control the classpath + inconsistent jars => trouble ?
> 
> On Mon, May 1, 2017 at 2:36 PM, Frank Austin Nothaft <fnothaft@berkeley.edu <ma...@berkeley.edu>> wrote:
> Hi Ryan,
> 
> We do set Avro to 1.8 in our downstream project. We also set Spark as a provided dependency, and build an überjar. We run via spark-submit, which builds the classpath with our überjar and all of the Spark deps. This leads to avro 1.7.1 getting picked off of the classpath at runtime, which causes the no such method exception to occur.
> 
> Regards,
> 
> Frank Austin Nothaft
> fnothaft@berkeley.edu <ma...@berkeley.edu>
> fnothaft@eecs.berkeley.edu <ma...@eecs.berkeley.edu>
> 202-340-0466 <tel:(202)%20340-0466>
>> On May 1, 2017, at 11:31 AM, Ryan Blue <rblue@netflix.com <ma...@netflix.com>> wrote:
>> 
>> Frank,
>> 
>> The issue you're running into is caused by using parquet-avro with Avro 1.7. Can't your downstream project set the Avro dependency to 1.8? Spark can't update Avro because it is a breaking change that would force users to rebuilt specific Avro classes in some cases. But you should be free to use Avro 1.8 to avoid the problem.
>> 
>> On Mon, May 1, 2017 at 11:08 AM, Frank Austin Nothaft <fnothaft@berkeley.edu <ma...@berkeley.edu>> wrote:
>> Hi Ryan et al,
>> 
>> The issue we’ve seen using a build of the Spark 2.2.0 branch from a downstream project is that parquet-avro uses one of the new Avro 1.8.0 methods, and you get a NoSuchMethodError since Spark puts Avro 1.7.7 as a dependency. My colleague Michael (who posted earlier on this thread) documented this in Spark-19697 <https://issues.apache.org/jira/browse/SPARK-19697>. I know that Spark has unit tests that check this compatibility issue, but it looks like there was a recent change that sets a test scope dependency on Avro 1.8.0 <https://github.com/apache/spark/commit/0077bfcb93832d93009f73f4b80f2e3d98fd2fa4>, which masks this issue in the unit tests. With this error, you can’t use the ParquetAvroOutputFormat from a application running on Spark 2.2.0.
>> 
>> Regards,
>> 
>> Frank Austin Nothaft
>> fnothaft@berkeley.edu <ma...@berkeley.edu>
>> fnothaft@eecs.berkeley.edu <ma...@eecs.berkeley.edu>
>> 202-340-0466 <tel:(202)%20340-0466>
>> 
>>> On May 1, 2017, at 10:02 AM, Ryan Blue <rblue@netflix.com.INVALID <ma...@netflix.com.invalid>> wrote:
>>> 
>>> I agree with Sean. Spark only pulls in parquet-avro for tests. For execution, it implements the record materialization APIs in Parquet to go directly to Spark SQL rows. This doesn't actually leak an Avro 1.8 dependency into Spark as far as I can tell.
>>> 
>>> rb
>>> 
>>> On Mon, May 1, 2017 at 8:34 AM, Sean Owen <sowen@cloudera.com <ma...@cloudera.com>> wrote:
>>> See discussion at https://github.com/apache/spark/pull/17163 <https://github.com/apache/spark/pull/17163> -- I think the issue is that fixing this trades one problem for a slightly bigger one.
>>> 
>>> 
>>> On Mon, May 1, 2017 at 4:13 PM Michael Heuer <heuermh@gmail.com <ma...@gmail.com>> wrote:
>>> Version 2.2.0 bumps the dependency version for parquet to 1.8.2 but does not bump the dependency version for avro (currently at 1.7.7).  Though perhaps not clear from the issue I reported [0], this means that Spark is internally inconsistent, in that a call through parquet (which depends on avro 1.8.0 [1]) may throw errors at runtime when it hits avro 1.7.7 on the classpath.  Avro 1.8.0 is not binary compatible with 1.7.7.
>>> 
>>> [0] - https://issues.apache.org/jira/browse/SPARK-19697 <https://issues.apache.org/jira/browse/SPARK-19697>
>>> [1] - https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/pom.xml#L96 <https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/pom.xml#L96>
>>> 
>>> On Sun, Apr 30, 2017 at 3:28 AM, Sean Owen <sowen@cloudera.com <ma...@cloudera.com>> wrote:
>>> I have one more issue that, if it needs to be fixed, needs to be fixed for 2.2.0.
>>> 
>>> I'm fixing build warnings for the release and noticed that checkstyle actually complains there are some Java methods named in TitleCase, like `ProcessingTimeTimeout`:
>>> 
>>> https://github.com/apache/spark/pull/17803/files#r113934080 <https://github.com/apache/spark/pull/17803/files#r113934080>
>>> 
>>> Easy enough to fix and it's right, that's not conventional. However I wonder if it was done on purpose to match a class name?
>>> 
>>> I think this is one for @tdas
>>> 
>>> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust <michael@databricks.com <ma...@databricks.com>> wrote:
>>> Please vote on releasing the following candidate as Apache Spark version 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>> 
>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>> [ ] -1 Do not release this package because ...
>>> 
>>> 
>>> To learn more about Apache Spark, please see http://spark.apache.org/ <http://spark.apache.org/>
>>> 
>>> The tag to be voted on is v2.2.0-rc1 <https://github.com/apache/spark/tree/v2.2.0-rc1> (8ccb4a57c82146c1a8f8966c7e64010cf5632cb6)
>>> 
>>> List of JIRA tickets resolved can be found with this filter <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>.
>>> 
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/ <http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/>
>>> 
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc <https://people.apache.org/keys/committer/pwendell.asc>
>>> 
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1235/ <https://repository.apache.org/content/repositories/orgapachespark-1235/>
>>> 
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/ <http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/>
>>> 
>>> 
>>> FAQ
>>> 
>>> How can I help test this release?
>>> 
>>> If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions.
>>> 
>>> What should happen to JIRA tickets still targeting 2.2.0?
>>> 
>>> Committers should look at those and triage. Extremely important bug fixes, documentation, and API tweaks that impact compatibility should be worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>> 
>>> But my bug isn't fixed!??!
>>> 
>>> In order to make timely releases, we will typically not hold the release unless the bug in question is a regression from 2.1.1.
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>> 
>> 
>> 
>> 
>> -- 
>> Ryan Blue
>> Software Engineer
>> Netflix
> 
> 
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


Re: [VOTE] Apache Spark 2.2.0 (RC1)

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Thanks for the extra context, Frank. I agree that it sounds like your
problem comes from the conflict between your Jars and what comes with
Spark. Its the same concern that makes everyone shudder when anything has a
public dependency on Jackson. :)

What we usually do to get around situations like this is to relocate the
problem library inside the shaded Jar. That way, Spark uses its version of
Avro and your classes use a different version of Avro. This works if you
don't need to share classes between the two. Would that work for your
situation?

rb

On Mon, May 1, 2017 at 11:55 AM, Koert Kuipers <ko...@tresata.com> wrote:

> sounds like you are running into the fact that you cannot really put your
> classes before spark's on classpath? spark's switches to support this never
> really worked for me either.
>
> inability to control the classpath + inconsistent jars => trouble ?
>
> On Mon, May 1, 2017 at 2:36 PM, Frank Austin Nothaft <
> fnothaft@berkeley.edu> wrote:
>
>> Hi Ryan,
>>
>> We do set Avro to 1.8 in our downstream project. We also set Spark as a
>> provided dependency, and build an überjar. We run via spark-submit, which
>> builds the classpath with our überjar and all of the Spark deps. This leads
>> to avro 1.7.1 getting picked off of the classpath at runtime, which causes
>> the no such method exception to occur.
>>
>> Regards,
>>
>> Frank Austin Nothaft
>> fnothaft@berkeley.edu
>> fnothaft@eecs.berkeley.edu
>> 202-340-0466 <(202)%20340-0466>
>>
>> On May 1, 2017, at 11:31 AM, Ryan Blue <rb...@netflix.com> wrote:
>>
>> Frank,
>>
>> The issue you're running into is caused by using parquet-avro with Avro
>> 1.7. Can't your downstream project set the Avro dependency to 1.8? Spark
>> can't update Avro because it is a breaking change that would force users to
>> rebuilt specific Avro classes in some cases. But you should be free to use
>> Avro 1.8 to avoid the problem.
>>
>> On Mon, May 1, 2017 at 11:08 AM, Frank Austin Nothaft <
>> fnothaft@berkeley.edu> wrote:
>>
>>> Hi Ryan et al,
>>>
>>> The issue we’ve seen using a build of the Spark 2.2.0 branch from a
>>> downstream project is that parquet-avro uses one of the new Avro 1.8.0
>>> methods, and you get a NoSuchMethodError since Spark puts Avro 1.7.7 as a
>>> dependency. My colleague Michael (who posted earlier on this thread)
>>> documented this in Spark-19697
>>> <https://issues.apache.org/jira/browse/SPARK-19697>. I know that Spark
>>> has unit tests that check this compatibility issue, but it looks like there
>>> was a recent change that sets a test scope dependency on Avro 1.8.0
>>> <https://github.com/apache/spark/commit/0077bfcb93832d93009f73f4b80f2e3d98fd2fa4>,
>>> which masks this issue in the unit tests. With this error, you can’t use
>>> the ParquetAvroOutputFormat from a application running on Spark 2.2.0.
>>>
>>> Regards,
>>>
>>> Frank Austin Nothaft
>>> fnothaft@berkeley.edu
>>> fnothaft@eecs.berkeley.edu
>>> 202-340-0466 <(202)%20340-0466>
>>>
>>> On May 1, 2017, at 10:02 AM, Ryan Blue <rblue@netflix.com.INVALID
>>> <rb...@netflix.com.invalid>> wrote:
>>>
>>> I agree with Sean. Spark only pulls in parquet-avro for tests. For
>>> execution, it implements the record materialization APIs in Parquet to go
>>> directly to Spark SQL rows. This doesn't actually leak an Avro 1.8
>>> dependency into Spark as far as I can tell.
>>>
>>> rb
>>>
>>> On Mon, May 1, 2017 at 8:34 AM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> See discussion at https://github.com/apache/spark/pull/17163 -- I
>>>> think the issue is that fixing this trades one problem for a slightly
>>>> bigger one.
>>>>
>>>>
>>>> On Mon, May 1, 2017 at 4:13 PM Michael Heuer <he...@gmail.com> wrote:
>>>>
>>>>> Version 2.2.0 bumps the dependency version for parquet to 1.8.2 but
>>>>> does not bump the dependency version for avro (currently at 1.7.7).  Though
>>>>> perhaps not clear from the issue I reported [0], this means that Spark is
>>>>> internally inconsistent, in that a call through parquet (which depends on
>>>>> avro 1.8.0 [1]) may throw errors at runtime when it hits avro 1.7.7 on the
>>>>> classpath.  Avro 1.8.0 is not binary compatible with 1.7.7.
>>>>>
>>>>> [0] - https://issues.apache.org/jira/browse/SPARK-19697
>>>>> [1] - https://github.com/apache/parquet-mr/blob/apache-parquet-1.8
>>>>> .2/pom.xml#L96
>>>>>
>>>>> On Sun, Apr 30, 2017 at 3:28 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>>
>>>>>> I have one more issue that, if it needs to be fixed, needs to be
>>>>>> fixed for 2.2.0.
>>>>>>
>>>>>> I'm fixing build warnings for the release and noticed that checkstyle
>>>>>> actually complains there are some Java methods named in TitleCase, like
>>>>>> `ProcessingTimeTimeout`:
>>>>>>
>>>>>> https://github.com/apache/spark/pull/17803/files#r113934080
>>>>>>
>>>>>> Easy enough to fix and it's right, that's not conventional. However I
>>>>>> wonder if it was done on purpose to match a class name?
>>>>>>
>>>>>> I think this is one for @tdas
>>>>>>
>>>>>> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust <
>>>>>> michael@databricks.com> wrote:
>>>>>>
>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>>> version 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00
>>>>>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>>>
>>>>>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>
>>>>>>>
>>>>>>> To learn more about Apache Spark, please see
>>>>>>> http://spark.apache.org/
>>>>>>>
>>>>>>> The tag to be voted on is v2.2.0-rc1
>>>>>>> <https://github.com/apache/spark/tree/v2.2.0-rc1> (8ccb4a57c82146c
>>>>>>> 1a8f8966c7e64010cf5632cb6)
>>>>>>>
>>>>>>> List of JIRA tickets resolved can be found with this filter
>>>>>>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>
>>>>>>> .
>>>>>>>
>>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>>> at:
>>>>>>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/
>>>>>>>
>>>>>>> Release artifacts are signed with the following key:
>>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>>>
>>>>>>> The staging repository for this release can be found at:
>>>>>>> https://repository.apache.org/content/repositories/orgapache
>>>>>>> spark-1235/
>>>>>>>
>>>>>>> The documentation corresponding to this release can be found at:
>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.
>>>>>>> 0-rc1-docs/
>>>>>>>
>>>>>>>
>>>>>>> *FAQ*
>>>>>>>
>>>>>>> *How can I help test this release?*
>>>>>>>
>>>>>>> If you are a Spark user, you can help us test this release by taking
>>>>>>> an existing Spark workload and running on this release candidate, then
>>>>>>> reporting any regressions.
>>>>>>>
>>>>>>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>>>>>>
>>>>>>> Committers should look at those and triage. Extremely important bug
>>>>>>> fixes, documentation, and API tweaks that impact compatibility should be
>>>>>>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>>>>>>
>>>>>>> *But my bug isn't fixed!??!*
>>>>>>>
>>>>>>> In order to make timely releases, we will typically not hold the
>>>>>>> release unless the bug in question is a regression from 2.1.1.
>>>>>>>
>>>>>>
>>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>>
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>>
>>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Apache Spark 2.2.0 (RC1)

Posted by Koert Kuipers <ko...@tresata.com>.
sounds like you are running into the fact that you cannot really put your
classes before spark's on classpath? spark's switches to support this never
really worked for me either.

inability to control the classpath + inconsistent jars => trouble ?

On Mon, May 1, 2017 at 2:36 PM, Frank Austin Nothaft <fn...@berkeley.edu>
wrote:

> Hi Ryan,
>
> We do set Avro to 1.8 in our downstream project. We also set Spark as a
> provided dependency, and build an überjar. We run via spark-submit, which
> builds the classpath with our überjar and all of the Spark deps. This leads
> to avro 1.7.1 getting picked off of the classpath at runtime, which causes
> the no such method exception to occur.
>
> Regards,
>
> Frank Austin Nothaft
> fnothaft@berkeley.edu
> fnothaft@eecs.berkeley.edu
> 202-340-0466 <(202)%20340-0466>
>
> On May 1, 2017, at 11:31 AM, Ryan Blue <rb...@netflix.com> wrote:
>
> Frank,
>
> The issue you're running into is caused by using parquet-avro with Avro
> 1.7. Can't your downstream project set the Avro dependency to 1.8? Spark
> can't update Avro because it is a breaking change that would force users to
> rebuilt specific Avro classes in some cases. But you should be free to use
> Avro 1.8 to avoid the problem.
>
> On Mon, May 1, 2017 at 11:08 AM, Frank Austin Nothaft <
> fnothaft@berkeley.edu> wrote:
>
>> Hi Ryan et al,
>>
>> The issue we’ve seen using a build of the Spark 2.2.0 branch from a
>> downstream project is that parquet-avro uses one of the new Avro 1.8.0
>> methods, and you get a NoSuchMethodError since Spark puts Avro 1.7.7 as a
>> dependency. My colleague Michael (who posted earlier on this thread)
>> documented this in Spark-19697
>> <https://issues.apache.org/jira/browse/SPARK-19697>. I know that Spark
>> has unit tests that check this compatibility issue, but it looks like there
>> was a recent change that sets a test scope dependency on Avro 1.8.0
>> <https://github.com/apache/spark/commit/0077bfcb93832d93009f73f4b80f2e3d98fd2fa4>,
>> which masks this issue in the unit tests. With this error, you can’t use
>> the ParquetAvroOutputFormat from a application running on Spark 2.2.0.
>>
>> Regards,
>>
>> Frank Austin Nothaft
>> fnothaft@berkeley.edu
>> fnothaft@eecs.berkeley.edu
>> 202-340-0466 <(202)%20340-0466>
>>
>> On May 1, 2017, at 10:02 AM, Ryan Blue <rblue@netflix.com.INVALID
>> <rb...@netflix.com.invalid>> wrote:
>>
>> I agree with Sean. Spark only pulls in parquet-avro for tests. For
>> execution, it implements the record materialization APIs in Parquet to go
>> directly to Spark SQL rows. This doesn't actually leak an Avro 1.8
>> dependency into Spark as far as I can tell.
>>
>> rb
>>
>> On Mon, May 1, 2017 at 8:34 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> See discussion at https://github.com/apache/spark/pull/17163 -- I think
>>> the issue is that fixing this trades one problem for a slightly bigger one.
>>>
>>>
>>> On Mon, May 1, 2017 at 4:13 PM Michael Heuer <he...@gmail.com> wrote:
>>>
>>>> Version 2.2.0 bumps the dependency version for parquet to 1.8.2 but
>>>> does not bump the dependency version for avro (currently at 1.7.7).  Though
>>>> perhaps not clear from the issue I reported [0], this means that Spark is
>>>> internally inconsistent, in that a call through parquet (which depends on
>>>> avro 1.8.0 [1]) may throw errors at runtime when it hits avro 1.7.7 on the
>>>> classpath.  Avro 1.8.0 is not binary compatible with 1.7.7.
>>>>
>>>> [0] - https://issues.apache.org/jira/browse/SPARK-19697
>>>> [1] - https://github.com/apache/parquet-mr/blob/apache-parquet-1.8
>>>> .2/pom.xml#L96
>>>>
>>>> On Sun, Apr 30, 2017 at 3:28 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>
>>>>> I have one more issue that, if it needs to be fixed, needs to be fixed
>>>>> for 2.2.0.
>>>>>
>>>>> I'm fixing build warnings for the release and noticed that checkstyle
>>>>> actually complains there are some Java methods named in TitleCase, like
>>>>> `ProcessingTimeTimeout`:
>>>>>
>>>>> https://github.com/apache/spark/pull/17803/files#r113934080
>>>>>
>>>>> Easy enough to fix and it's right, that's not conventional. However I
>>>>> wonder if it was done on purpose to match a class name?
>>>>>
>>>>> I think this is one for @tdas
>>>>>
>>>>> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust <
>>>>> michael@databricks.com> wrote:
>>>>>
>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>> version 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00
>>>>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>>
>>>>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>>>>> [ ] -1 Do not release this package because ...
>>>>>>
>>>>>>
>>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>>
>>>>>> The tag to be voted on is v2.2.0-rc1
>>>>>> <https://github.com/apache/spark/tree/v2.2.0-rc1> (8ccb4a57c82146c
>>>>>> 1a8f8966c7e64010cf5632cb6)
>>>>>>
>>>>>> List of JIRA tickets resolved can be found with this filter
>>>>>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>
>>>>>> .
>>>>>>
>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>> at:
>>>>>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/
>>>>>>
>>>>>> Release artifacts are signed with the following key:
>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>>
>>>>>> The staging repository for this release can be found at:
>>>>>> https://repository.apache.org/content/repositories/orgapache
>>>>>> spark-1235/
>>>>>>
>>>>>> The documentation corresponding to this release can be found at:
>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.
>>>>>> 0-rc1-docs/
>>>>>>
>>>>>>
>>>>>> *FAQ*
>>>>>>
>>>>>> *How can I help test this release?*
>>>>>>
>>>>>> If you are a Spark user, you can help us test this release by taking
>>>>>> an existing Spark workload and running on this release candidate, then
>>>>>> reporting any regressions.
>>>>>>
>>>>>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>>>>>
>>>>>> Committers should look at those and triage. Extremely important bug
>>>>>> fixes, documentation, and API tweaks that impact compatibility should be
>>>>>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>>>>>
>>>>>> *But my bug isn't fixed!??!*
>>>>>>
>>>>>> In order to make timely releases, we will typically not hold the
>>>>>> release unless the bug in question is a regression from 2.1.1.
>>>>>>
>>>>>
>>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>

Re: [VOTE] Apache Spark 2.2.0 (RC1)

Posted by Frank Austin Nothaft <fn...@berkeley.edu>.
Hi Ryan,

We do set Avro to 1.8 in our downstream project. We also set Spark as a provided dependency, and build an überjar. We run via spark-submit, which builds the classpath with our überjar and all of the Spark deps. This leads to avro 1.7.1 getting picked off of the classpath at runtime, which causes the no such method exception to occur.

Regards,

Frank Austin Nothaft
fnothaft@berkeley.edu
fnothaft@eecs.berkeley.edu
202-340-0466

> On May 1, 2017, at 11:31 AM, Ryan Blue <rb...@netflix.com> wrote:
> 
> Frank,
> 
> The issue you're running into is caused by using parquet-avro with Avro 1.7. Can't your downstream project set the Avro dependency to 1.8? Spark can't update Avro because it is a breaking change that would force users to rebuilt specific Avro classes in some cases. But you should be free to use Avro 1.8 to avoid the problem.
> 
> On Mon, May 1, 2017 at 11:08 AM, Frank Austin Nothaft <fnothaft@berkeley.edu <ma...@berkeley.edu>> wrote:
> Hi Ryan et al,
> 
> The issue we’ve seen using a build of the Spark 2.2.0 branch from a downstream project is that parquet-avro uses one of the new Avro 1.8.0 methods, and you get a NoSuchMethodError since Spark puts Avro 1.7.7 as a dependency. My colleague Michael (who posted earlier on this thread) documented this in Spark-19697 <https://issues.apache.org/jira/browse/SPARK-19697>. I know that Spark has unit tests that check this compatibility issue, but it looks like there was a recent change that sets a test scope dependency on Avro 1.8.0 <https://github.com/apache/spark/commit/0077bfcb93832d93009f73f4b80f2e3d98fd2fa4>, which masks this issue in the unit tests. With this error, you can’t use the ParquetAvroOutputFormat from a application running on Spark 2.2.0.
> 
> Regards,
> 
> Frank Austin Nothaft
> fnothaft@berkeley.edu <ma...@berkeley.edu>
> fnothaft@eecs.berkeley.edu <ma...@eecs.berkeley.edu>
> 202-340-0466 <tel:(202)%20340-0466>
> 
>> On May 1, 2017, at 10:02 AM, Ryan Blue <rblue@netflix.com.INVALID <ma...@netflix.com.invalid>> wrote:
>> 
>> I agree with Sean. Spark only pulls in parquet-avro for tests. For execution, it implements the record materialization APIs in Parquet to go directly to Spark SQL rows. This doesn't actually leak an Avro 1.8 dependency into Spark as far as I can tell.
>> 
>> rb
>> 
>> On Mon, May 1, 2017 at 8:34 AM, Sean Owen <sowen@cloudera.com <ma...@cloudera.com>> wrote:
>> See discussion at https://github.com/apache/spark/pull/17163 <https://github.com/apache/spark/pull/17163> -- I think the issue is that fixing this trades one problem for a slightly bigger one.
>> 
>> 
>> On Mon, May 1, 2017 at 4:13 PM Michael Heuer <heuermh@gmail.com <ma...@gmail.com>> wrote:
>> Version 2.2.0 bumps the dependency version for parquet to 1.8.2 but does not bump the dependency version for avro (currently at 1.7.7).  Though perhaps not clear from the issue I reported [0], this means that Spark is internally inconsistent, in that a call through parquet (which depends on avro 1.8.0 [1]) may throw errors at runtime when it hits avro 1.7.7 on the classpath.  Avro 1.8.0 is not binary compatible with 1.7.7.
>> 
>> [0] - https://issues.apache.org/jira/browse/SPARK-19697 <https://issues.apache.org/jira/browse/SPARK-19697>
>> [1] - https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/pom.xml#L96 <https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/pom.xml#L96>
>> 
>> On Sun, Apr 30, 2017 at 3:28 AM, Sean Owen <sowen@cloudera.com <ma...@cloudera.com>> wrote:
>> I have one more issue that, if it needs to be fixed, needs to be fixed for 2.2.0.
>> 
>> I'm fixing build warnings for the release and noticed that checkstyle actually complains there are some Java methods named in TitleCase, like `ProcessingTimeTimeout`:
>> 
>> https://github.com/apache/spark/pull/17803/files#r113934080 <https://github.com/apache/spark/pull/17803/files#r113934080>
>> 
>> Easy enough to fix and it's right, that's not conventional. However I wonder if it was done on purpose to match a class name?
>> 
>> I think this is one for @tdas
>> 
>> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust <michael@databricks.com <ma...@databricks.com>> wrote:
>> Please vote on releasing the following candidate as Apache Spark version 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST and passes if a majority of at least 3 +1 PMC votes are cast.
>> 
>> [ ] +1 Release this package as Apache Spark 2.2.0
>> [ ] -1 Do not release this package because ...
>> 
>> 
>> To learn more about Apache Spark, please see http://spark.apache.org/ <http://spark.apache.org/>
>> 
>> The tag to be voted on is v2.2.0-rc1 <https://github.com/apache/spark/tree/v2.2.0-rc1> (8ccb4a57c82146c1a8f8966c7e64010cf5632cb6)
>> 
>> List of JIRA tickets resolved can be found with this filter <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>.
>> 
>> The release files, including signatures, digests, etc. can be found at:
>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/ <http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/>
>> 
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc <https://people.apache.org/keys/committer/pwendell.asc>
>> 
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1235/ <https://repository.apache.org/content/repositories/orgapachespark-1235/>
>> 
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/ <http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/>
>> 
>> 
>> FAQ
>> 
>> How can I help test this release?
>> 
>> If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions.
>> 
>> What should happen to JIRA tickets still targeting 2.2.0?
>> 
>> Committers should look at those and triage. Extremely important bug fixes, documentation, and API tweaks that impact compatibility should be worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>> 
>> But my bug isn't fixed!??!
>> 
>> In order to make timely releases, we will typically not hold the release unless the bug in question is a regression from 2.1.1.
>> 
>> 
>> 
>> 
>> -- 
>> Ryan Blue
>> Software Engineer
>> Netflix
> 
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


Re: [VOTE] Apache Spark 2.2.0 (RC1)

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
Frank,

The issue you're running into is caused by using parquet-avro with Avro
1.7. Can't your downstream project set the Avro dependency to 1.8? Spark
can't update Avro because it is a breaking change that would force users to
rebuilt specific Avro classes in some cases. But you should be free to use
Avro 1.8 to avoid the problem.

On Mon, May 1, 2017 at 11:08 AM, Frank Austin Nothaft <fnothaft@berkeley.edu
> wrote:

> Hi Ryan et al,
>
> The issue we’ve seen using a build of the Spark 2.2.0 branch from a
> downstream project is that parquet-avro uses one of the new Avro 1.8.0
> methods, and you get a NoSuchMethodError since Spark puts Avro 1.7.7 as a
> dependency. My colleague Michael (who posted earlier on this thread)
> documented this in Spark-19697
> <https://issues.apache.org/jira/browse/SPARK-19697>. I know that Spark
> has unit tests that check this compatibility issue, but it looks like there
> was a recent change that sets a test scope dependency on Avro 1.8.0
> <https://github.com/apache/spark/commit/0077bfcb93832d93009f73f4b80f2e3d98fd2fa4>,
> which masks this issue in the unit tests. With this error, you can’t use
> the ParquetAvroOutputFormat from a application running on Spark 2.2.0.
>
> Regards,
>
> Frank Austin Nothaft
> fnothaft@berkeley.edu
> fnothaft@eecs.berkeley.edu
> 202-340-0466 <(202)%20340-0466>
>
> On May 1, 2017, at 10:02 AM, Ryan Blue <rblue@netflix.com.INVALID
> <rb...@netflix.com.invalid>> wrote:
>
> I agree with Sean. Spark only pulls in parquet-avro for tests. For
> execution, it implements the record materialization APIs in Parquet to go
> directly to Spark SQL rows. This doesn't actually leak an Avro 1.8
> dependency into Spark as far as I can tell.
>
> rb
>
> On Mon, May 1, 2017 at 8:34 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> See discussion at https://github.com/apache/spark/pull/17163 -- I think
>> the issue is that fixing this trades one problem for a slightly bigger one.
>>
>>
>> On Mon, May 1, 2017 at 4:13 PM Michael Heuer <he...@gmail.com> wrote:
>>
>>> Version 2.2.0 bumps the dependency version for parquet to 1.8.2 but does
>>> not bump the dependency version for avro (currently at 1.7.7).  Though
>>> perhaps not clear from the issue I reported [0], this means that Spark is
>>> internally inconsistent, in that a call through parquet (which depends on
>>> avro 1.8.0 [1]) may throw errors at runtime when it hits avro 1.7.7 on the
>>> classpath.  Avro 1.8.0 is not binary compatible with 1.7.7.
>>>
>>> [0] - https://issues.apache.org/jira/browse/SPARK-19697
>>> [1] - https://github.com/apache/parquet-mr/blob/apache-parquet-1.
>>> 8.2/pom.xml#L96
>>>
>>> On Sun, Apr 30, 2017 at 3:28 AM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> I have one more issue that, if it needs to be fixed, needs to be fixed
>>>> for 2.2.0.
>>>>
>>>> I'm fixing build warnings for the release and noticed that checkstyle
>>>> actually complains there are some Java methods named in TitleCase, like
>>>> `ProcessingTimeTimeout`:
>>>>
>>>> https://github.com/apache/spark/pull/17803/files#r113934080
>>>>
>>>> Easy enough to fix and it's right, that's not conventional. However I
>>>> wonder if it was done on purpose to match a class name?
>>>>
>>>> I think this is one for @tdas
>>>>
>>>> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust <
>>>> michael@databricks.com> wrote:
>>>>
>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>> version 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00
>>>>> PST and passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>>
>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>
>>>>> The tag to be voted on is v2.2.0-rc1
>>>>> <https://github.com/apache/spark/tree/v2.2.0-rc1> (8ccb4a57c82146c
>>>>> 1a8f8966c7e64010cf5632cb6)
>>>>>
>>>>> List of JIRA tickets resolved can be found with this filter
>>>>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>
>>>>> .
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/
>>>>>
>>>>> Release artifacts are signed with the following key:
>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/orgapache
>>>>> spark-1235/
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.
>>>>> 0-rc1-docs/
>>>>>
>>>>>
>>>>> *FAQ*
>>>>>
>>>>> *How can I help test this release?*
>>>>>
>>>>> If you are a Spark user, you can help us test this release by taking
>>>>> an existing Spark workload and running on this release candidate, then
>>>>> reporting any regressions.
>>>>>
>>>>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>>>>
>>>>> Committers should look at those and triage. Extremely important bug
>>>>> fixes, documentation, and API tweaks that impact compatibility should be
>>>>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>>>>
>>>>> *But my bug isn't fixed!??!*
>>>>>
>>>>> In order to make timely releases, we will typically not hold the
>>>>> release unless the bug in question is a regression from 2.1.1.
>>>>>
>>>>
>>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Apache Spark 2.2.0 (RC1)

Posted by Frank Austin Nothaft <fn...@berkeley.edu>.
Hi Ryan et al,

The issue we’ve seen using a build of the Spark 2.2.0 branch from a downstream project is that parquet-avro uses one of the new Avro 1.8.0 methods, and you get a NoSuchMethodError since Spark puts Avro 1.7.7 as a dependency. My colleague Michael (who posted earlier on this thread) documented this in Spark-19697 <https://issues.apache.org/jira/browse/SPARK-19697>. I know that Spark has unit tests that check this compatibility issue, but it looks like there was a recent change that sets a test scope dependency on Avro 1.8.0 <https://github.com/apache/spark/commit/0077bfcb93832d93009f73f4b80f2e3d98fd2fa4>, which masks this issue in the unit tests. With this error, you can’t use the ParquetAvroOutputFormat from a application running on Spark 2.2.0.

Regards,

Frank Austin Nothaft
fnothaft@berkeley.edu
fnothaft@eecs.berkeley.edu
202-340-0466

> On May 1, 2017, at 10:02 AM, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> 
> I agree with Sean. Spark only pulls in parquet-avro for tests. For execution, it implements the record materialization APIs in Parquet to go directly to Spark SQL rows. This doesn't actually leak an Avro 1.8 dependency into Spark as far as I can tell.
> 
> rb
> 
> On Mon, May 1, 2017 at 8:34 AM, Sean Owen <sowen@cloudera.com <ma...@cloudera.com>> wrote:
> See discussion at https://github.com/apache/spark/pull/17163 <https://github.com/apache/spark/pull/17163> -- I think the issue is that fixing this trades one problem for a slightly bigger one.
> 
> 
> On Mon, May 1, 2017 at 4:13 PM Michael Heuer <heuermh@gmail.com <ma...@gmail.com>> wrote:
> Version 2.2.0 bumps the dependency version for parquet to 1.8.2 but does not bump the dependency version for avro (currently at 1.7.7).  Though perhaps not clear from the issue I reported [0], this means that Spark is internally inconsistent, in that a call through parquet (which depends on avro 1.8.0 [1]) may throw errors at runtime when it hits avro 1.7.7 on the classpath.  Avro 1.8.0 is not binary compatible with 1.7.7.
> 
> [0] - https://issues.apache.org/jira/browse/SPARK-19697 <https://issues.apache.org/jira/browse/SPARK-19697>
> [1] - https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/pom.xml#L96 <https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/pom.xml#L96>
> 
> On Sun, Apr 30, 2017 at 3:28 AM, Sean Owen <sowen@cloudera.com <ma...@cloudera.com>> wrote:
> I have one more issue that, if it needs to be fixed, needs to be fixed for 2.2.0.
> 
> I'm fixing build warnings for the release and noticed that checkstyle actually complains there are some Java methods named in TitleCase, like `ProcessingTimeTimeout`:
> 
> https://github.com/apache/spark/pull/17803/files#r113934080 <https://github.com/apache/spark/pull/17803/files#r113934080>
> 
> Easy enough to fix and it's right, that's not conventional. However I wonder if it was done on purpose to match a class name?
> 
> I think this is one for @tdas
> 
> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust <michael@databricks.com <ma...@databricks.com>> wrote:
> Please vote on releasing the following candidate as Apache Spark version 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST and passes if a majority of at least 3 +1 PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Spark 2.2.0
> [ ] -1 Do not release this package because ...
> 
> 
> To learn more about Apache Spark, please see http://spark.apache.org/ <http://spark.apache.org/>
> 
> The tag to be voted on is v2.2.0-rc1 <https://github.com/apache/spark/tree/v2.2.0-rc1> (8ccb4a57c82146c1a8f8966c7e64010cf5632cb6)
> 
> List of JIRA tickets resolved can be found with this filter <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>.
> 
> The release files, including signatures, digests, etc. can be found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/ <http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/>
> 
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc <https://people.apache.org/keys/committer/pwendell.asc>
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1235/ <https://repository.apache.org/content/repositories/orgapachespark-1235/>
> 
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/ <http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/>
> 
> 
> FAQ
> 
> How can I help test this release?
> 
> If you are a Spark user, you can help us test this release by taking an existing Spark workload and running on this release candidate, then reporting any regressions.
> 
> What should happen to JIRA tickets still targeting 2.2.0?
> 
> Committers should look at those and triage. Extremely important bug fixes, documentation, and API tweaks that impact compatibility should be worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
> 
> But my bug isn't fixed!??!
> 
> In order to make timely releases, we will typically not hold the release unless the bug in question is a regression from 2.1.1.
> 
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


Re: [VOTE] Apache Spark 2.2.0 (RC1)

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
I agree with Sean. Spark only pulls in parquet-avro for tests. For
execution, it implements the record materialization APIs in Parquet to go
directly to Spark SQL rows. This doesn't actually leak an Avro 1.8
dependency into Spark as far as I can tell.

rb

On Mon, May 1, 2017 at 8:34 AM, Sean Owen <so...@cloudera.com> wrote:

> See discussion at https://github.com/apache/spark/pull/17163 -- I think
> the issue is that fixing this trades one problem for a slightly bigger one.
>
>
> On Mon, May 1, 2017 at 4:13 PM Michael Heuer <he...@gmail.com> wrote:
>
>> Version 2.2.0 bumps the dependency version for parquet to 1.8.2 but does
>> not bump the dependency version for avro (currently at 1.7.7).  Though
>> perhaps not clear from the issue I reported [0], this means that Spark is
>> internally inconsistent, in that a call through parquet (which depends on
>> avro 1.8.0 [1]) may throw errors at runtime when it hits avro 1.7.7 on the
>> classpath.  Avro 1.8.0 is not binary compatible with 1.7.7.
>>
>> [0] - https://issues.apache.org/jira/browse/SPARK-19697
>> [1] - https://github.com/apache/parquet-mr/blob/apache-
>> parquet-1.8.2/pom.xml#L96
>>
>> On Sun, Apr 30, 2017 at 3:28 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> I have one more issue that, if it needs to be fixed, needs to be fixed
>>> for 2.2.0.
>>>
>>> I'm fixing build warnings for the release and noticed that checkstyle
>>> actually complains there are some Java methods named in TitleCase, like
>>> `ProcessingTimeTimeout`:
>>>
>>> https://github.com/apache/spark/pull/17803/files#r113934080
>>>
>>> Easy enough to fix and it's right, that's not conventional. However I
>>> wonder if it was done on purpose to match a class name?
>>>
>>> I think this is one for @tdas
>>>
>>> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust <mi...@databricks.com>
>>> wrote:
>>>
>>>> Please vote on releasing the following candidate as Apache Spark
>>>> version 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST
>>>> and passes if a majority of at least 3 +1 PMC votes are cast.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>>
>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>
>>>> The tag to be voted on is v2.2.0-rc1
>>>> <https://github.com/apache/spark/tree/v2.2.0-rc1> (8ccb4a57c82146c
>>>> 1a8f8966c7e64010cf5632cb6)
>>>>
>>>> List of JIRA tickets resolved can be found with this filter
>>>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>
>>>> .
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/
>>>>
>>>> Release artifacts are signed with the following key:
>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>
>>>> The staging repository for this release can be found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1235/
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/
>>>>
>>>>
>>>> *FAQ*
>>>>
>>>> *How can I help test this release?*
>>>>
>>>> If you are a Spark user, you can help us test this release by taking an
>>>> existing Spark workload and running on this release candidate, then
>>>> reporting any regressions.
>>>>
>>>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>>>
>>>> Committers should look at those and triage. Extremely important bug
>>>> fixes, documentation, and API tweaks that impact compatibility should be
>>>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>>>
>>>> *But my bug isn't fixed!??!*
>>>>
>>>> In order to make timely releases, we will typically not hold the
>>>> release unless the bug in question is a regression from 2.1.1.
>>>>
>>>
>>


-- 
Ryan Blue
Software Engineer
Netflix

Re: [VOTE] Apache Spark 2.2.0 (RC1)

Posted by Sean Owen <so...@cloudera.com>.
See discussion at https://github.com/apache/spark/pull/17163 -- I think the
issue is that fixing this trades one problem for a slightly bigger one.

On Mon, May 1, 2017 at 4:13 PM Michael Heuer <he...@gmail.com> wrote:

> Version 2.2.0 bumps the dependency version for parquet to 1.8.2 but does
> not bump the dependency version for avro (currently at 1.7.7).  Though
> perhaps not clear from the issue I reported [0], this means that Spark is
> internally inconsistent, in that a call through parquet (which depends on
> avro 1.8.0 [1]) may throw errors at runtime when it hits avro 1.7.7 on the
> classpath.  Avro 1.8.0 is not binary compatible with 1.7.7.
>
> [0] - https://issues.apache.org/jira/browse/SPARK-19697
> [1] -
> https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/pom.xml#L96
>
> On Sun, Apr 30, 2017 at 3:28 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> I have one more issue that, if it needs to be fixed, needs to be fixed
>> for 2.2.0.
>>
>> I'm fixing build warnings for the release and noticed that checkstyle
>> actually complains there are some Java methods named in TitleCase, like
>> `ProcessingTimeTimeout`:
>>
>> https://github.com/apache/spark/pull/17803/files#r113934080
>>
>> Easy enough to fix and it's right, that's not conventional. However I
>> wonder if it was done on purpose to match a class name?
>>
>> I think this is one for @tdas
>>
>> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust <mi...@databricks.com>
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST
>>> and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.2.0-rc1
>>> <https://github.com/apache/spark/tree/v2.2.0-rc1> (
>>> 8ccb4a57c82146c1a8f8966c7e64010cf5632cb6)
>>>
>>> List of JIRA tickets resolved can be found with this filter
>>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>
>>> .
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1235/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> *What should happen to JIRA tickets still targeting 2.2.0?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.1.1.
>>>
>>
>