You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Henry Robinson <he...@apache.org> on 2018/04/11 19:35:56 UTC

Maintenance releases for SPARK-23852?

Hi all -

SPARK-23852 (where a query can silently give wrong results thanks to a
predicate pushdown bug in Parquet) is a fairly bad bug. In other projects
I've been involved with, we've released maintenance releases for bugs of
this severity.

Since Spark 2.4.0 is probably a while away, I wanted to see if there was
any consensus over whether we should consider (at least) a 2.3.1.

The reason this particular issue is a bit tricky is that the Parquet
community haven't yet produced a maintenance release that fixes the
underlying bug, but they are in the process of releasing a new minor
version, 1.10, which includes a fix. Having spoken to a couple of Parquet
developers, they'd be willing to consider a maintenance release, but would
probably only bother if we (or another affected project) asked them to.

My guess is that we wouldn't want to upgrade to a new minor version of
Parquet for a Spark maintenance release, so asking for a Parquet
maintenance release makes sense.

What does everyone think?

Best,
Henry

Re: Maintenance releases for SPARK-23852?

Posted by Dongjoon Hyun <do...@gmail.com>.

Since it's a backport from master to branch-2.3 for ORC 1.4.3, I made a
backport PR.

https://github.com/apache/spark/pull/21093

Thank you for raising this issues and confirming, Henry and Xiao. :)

Bests,
Dongjoon.


On Tue, Apr 17, 2018 at 12:01 AM, Xiao Li <ga...@gmail.com> wrote:

> Yes, it sounds good to me. We can upgrade both Parquet 1.8.2 to 1.8.3 and
> ORC 1.4.1 to 1.4.3 in our upcoming Spark 2.3.1 release.
>
> Thanks for your efforts! @Henry and @Dongjoon
>
> Xiao
>
> 2018-04-16 14:41 GMT-07:00 Henry Robinson <he...@apache.org>:
>
>> Seems like there aren't any objections. I'll pick this thread back up
>> when a Parquet maintenance release has happened.
>>
>> Henry
>>
>> On 11 April 2018 at 14:00, Dongjoon Hyun <do...@gmail.com> wrote:
>>
>>> Great.
>>>
>>> If we can upgrade the parquet dependency from 1.8.2 to 1.8.3 in Apache
>>> Spark 2.3.1, let's upgrade orc dependency from 1.4.1 to 1.4.3 together.
>>>
>>> Currently, the patch is only merged into master branch now. 1.4.1 has
>>> the following issue.
>>>
>>> https://issues.apache.org/jira/browse/SPARK-23340
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>> On Wed, Apr 11, 2018 at 1:23 PM, Reynold Xin <rx...@databricks.com>
>>> wrote:
>>>
>>>> Seems like this would make sense... we usually make maintenance
>>>> releases for bug fixes after a month anyway.
>>>>
>>>>
>>>> On Wed, Apr 11, 2018 at 12:52 PM, Henry Robinson <he...@apache.org>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On 11 April 2018 at 12:47, Ryan Blue <rb...@netflix.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> I think a 1.8.3 Parquet release makes sense for the 2.3.x releases of
>>>>>> Spark.
>>>>>>
>>>>>> To be clear though, this only affects Spark when reading data written
>>>>>> by Impala, right? Or does Parquet CPP also produce data like this?
>>>>>>
>>>>>
>>>>> I don't know about parquet-cpp, but yeah, the only implementation I've
>>>>> seen writing the half-completed stats is Impala. (as you know, that's
>>>>> compliant with the spec, just an unusual choice).
>>>>>
>>>>>
>>>>>>
>>>>>> On Wed, Apr 11, 2018 at 12:35 PM, Henry Robinson <he...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all -
>>>>>>>
>>>>>>> SPARK-23852 (where a query can silently give wrong results thanks to
>>>>>>> a predicate pushdown bug in Parquet) is a fairly bad bug. In other projects
>>>>>>> I've been involved with, we've released maintenance releases for bugs of
>>>>>>> this severity.
>>>>>>>
>>>>>>> Since Spark 2.4.0 is probably a while away, I wanted to see if there
>>>>>>> was any consensus over whether we should consider (at least) a 2.3.1.
>>>>>>>
>>>>>>> The reason this particular issue is a bit tricky is that the Parquet
>>>>>>> community haven't yet produced a maintenance release that fixes the
>>>>>>> underlying bug, but they are in the process of releasing a new minor
>>>>>>> version, 1.10, which includes a fix. Having spoken to a couple of Parquet
>>>>>>> developers, they'd be willing to consider a maintenance release, but would
>>>>>>> probably only bother if we (or another affected project) asked them to.
>>>>>>>
>>>>>>> My guess is that we wouldn't want to upgrade to a new minor version
>>>>>>> of Parquet for a Spark maintenance release, so asking for a Parquet
>>>>>>> maintenance release makes sense.
>>>>>>>
>>>>>>> What does everyone think?
>>>>>>>
>>>>>>> Best,
>>>>>>> Henry
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Maintenance releases for SPARK-23852?

Posted by Xiao Li <ga...@gmail.com>.

Yes, it sounds good to me. We can upgrade both Parquet 1.8.2 to 1.8.3 and
ORC 1.4.1 to 1.4.3 in our upcoming Spark 2.3.1 release.

Thanks for your efforts! @Henry and @Dongjoon

Xiao

2018-04-16 14:41 GMT-07:00 Henry Robinson <he...@apache.org>:

> Seems like there aren't any objections. I'll pick this thread back up when
> a Parquet maintenance release has happened.
>
> Henry
>
> On 11 April 2018 at 14:00, Dongjoon Hyun <do...@gmail.com> wrote:
>
>> Great.
>>
>> If we can upgrade the parquet dependency from 1.8.2 to 1.8.3 in Apache
>> Spark 2.3.1, let's upgrade orc dependency from 1.4.1 to 1.4.3 together.
>>
>> Currently, the patch is only merged into master branch now. 1.4.1 has the
>> following issue.
>>
>> https://issues.apache.org/jira/browse/SPARK-23340
>>
>> Bests,
>> Dongjoon.
>>
>>
>>
>> On Wed, Apr 11, 2018 at 1:23 PM, Reynold Xin <rx...@databricks.com> wrote:
>>
>>> Seems like this would make sense... we usually make maintenance releases
>>> for bug fixes after a month anyway.
>>>
>>>
>>> On Wed, Apr 11, 2018 at 12:52 PM, Henry Robinson <he...@apache.org>
>>> wrote:
>>>
>>>>
>>>>
>>>> On 11 April 2018 at 12:47, Ryan Blue <rb...@netflix.com.invalid> wrote:
>>>>
>>>>> I think a 1.8.3 Parquet release makes sense for the 2.3.x releases of
>>>>> Spark.
>>>>>
>>>>> To be clear though, this only affects Spark when reading data written
>>>>> by Impala, right? Or does Parquet CPP also produce data like this?
>>>>>
>>>>
>>>> I don't know about parquet-cpp, but yeah, the only implementation I've
>>>> seen writing the half-completed stats is Impala. (as you know, that's
>>>> compliant with the spec, just an unusual choice).
>>>>
>>>>
>>>>>
>>>>> On Wed, Apr 11, 2018 at 12:35 PM, Henry Robinson <he...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi all -
>>>>>>
>>>>>> SPARK-23852 (where a query can silently give wrong results thanks to
>>>>>> a predicate pushdown bug in Parquet) is a fairly bad bug. In other projects
>>>>>> I've been involved with, we've released maintenance releases for bugs of
>>>>>> this severity.
>>>>>>
>>>>>> Since Spark 2.4.0 is probably a while away, I wanted to see if there
>>>>>> was any consensus over whether we should consider (at least) a 2.3.1.
>>>>>>
>>>>>> The reason this particular issue is a bit tricky is that the Parquet
>>>>>> community haven't yet produced a maintenance release that fixes the
>>>>>> underlying bug, but they are in the process of releasing a new minor
>>>>>> version, 1.10, which includes a fix. Having spoken to a couple of Parquet
>>>>>> developers, they'd be willing to consider a maintenance release, but would
>>>>>> probably only bother if we (or another affected project) asked them to.
>>>>>>
>>>>>> My guess is that we wouldn't want to upgrade to a new minor version
>>>>>> of Parquet for a Spark maintenance release, so asking for a Parquet
>>>>>> maintenance release makes sense.
>>>>>>
>>>>>> What does everyone think?
>>>>>>
>>>>>> Best,
>>>>>> Henry
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Maintenance releases for SPARK-23852?

Posted by Henry Robinson <he...@apache.org>.

Seems like there aren't any objections. I'll pick this thread back up when
a Parquet maintenance release has happened.

Henry

On 11 April 2018 at 14:00, Dongjoon Hyun <do...@gmail.com> wrote:

> Great.
>
> If we can upgrade the parquet dependency from 1.8.2 to 1.8.3 in Apache
> Spark 2.3.1, let's upgrade orc dependency from 1.4.1 to 1.4.3 together.
>
> Currently, the patch is only merged into master branch now. 1.4.1 has the
> following issue.
>
> https://issues.apache.org/jira/browse/SPARK-23340
>
> Bests,
> Dongjoon.
>
>
>
> On Wed, Apr 11, 2018 at 1:23 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> Seems like this would make sense... we usually make maintenance releases
>> for bug fixes after a month anyway.
>>
>>
>> On Wed, Apr 11, 2018 at 12:52 PM, Henry Robinson <he...@apache.org>
>> wrote:
>>
>>>
>>>
>>> On 11 April 2018 at 12:47, Ryan Blue <rb...@netflix.com.invalid> wrote:
>>>
>>>> I think a 1.8.3 Parquet release makes sense for the 2.3.x releases of
>>>> Spark.
>>>>
>>>> To be clear though, this only affects Spark when reading data written
>>>> by Impala, right? Or does Parquet CPP also produce data like this?
>>>>
>>>
>>> I don't know about parquet-cpp, but yeah, the only implementation I've
>>> seen writing the half-completed stats is Impala. (as you know, that's
>>> compliant with the spec, just an unusual choice).
>>>
>>>
>>>>
>>>> On Wed, Apr 11, 2018 at 12:35 PM, Henry Robinson <he...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi all -
>>>>>
>>>>> SPARK-23852 (where a query can silently give wrong results thanks to a
>>>>> predicate pushdown bug in Parquet) is a fairly bad bug. In other projects
>>>>> I've been involved with, we've released maintenance releases for bugs of
>>>>> this severity.
>>>>>
>>>>> Since Spark 2.4.0 is probably a while away, I wanted to see if there
>>>>> was any consensus over whether we should consider (at least) a 2.3.1.
>>>>>
>>>>> The reason this particular issue is a bit tricky is that the Parquet
>>>>> community haven't yet produced a maintenance release that fixes the
>>>>> underlying bug, but they are in the process of releasing a new minor
>>>>> version, 1.10, which includes a fix. Having spoken to a couple of Parquet
>>>>> developers, they'd be willing to consider a maintenance release, but would
>>>>> probably only bother if we (or another affected project) asked them to.
>>>>>
>>>>> My guess is that we wouldn't want to upgrade to a new minor version of
>>>>> Parquet for a Spark maintenance release, so asking for a Parquet
>>>>> maintenance release makes sense.
>>>>>
>>>>> What does everyone think?
>>>>>
>>>>> Best,
>>>>> Henry
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>>
>

Re: Maintenance releases for SPARK-23852?

Posted by Dongjoon Hyun <do...@gmail.com>.

Great.

If we can upgrade the parquet dependency from 1.8.2 to 1.8.3 in Apache
Spark 2.3.1, let's upgrade orc dependency from 1.4.1 to 1.4.3 together.

Currently, the patch is only merged into master branch now. 1.4.1 has the
following issue.

https://issues.apache.org/jira/browse/SPARK-23340

Bests,
Dongjoon.



On Wed, Apr 11, 2018 at 1:23 PM, Reynold Xin <rx...@databricks.com> wrote:

> Seems like this would make sense... we usually make maintenance releases
> for bug fixes after a month anyway.
>
>
> On Wed, Apr 11, 2018 at 12:52 PM, Henry Robinson <he...@apache.org> wrote:
>
>>
>>
>> On 11 April 2018 at 12:47, Ryan Blue <rb...@netflix.com.invalid> wrote:
>>
>>> I think a 1.8.3 Parquet release makes sense for the 2.3.x releases of
>>> Spark.
>>>
>>> To be clear though, this only affects Spark when reading data written by
>>> Impala, right? Or does Parquet CPP also produce data like this?
>>>
>>
>> I don't know about parquet-cpp, but yeah, the only implementation I've
>> seen writing the half-completed stats is Impala. (as you know, that's
>> compliant with the spec, just an unusual choice).
>>
>>
>>>
>>> On Wed, Apr 11, 2018 at 12:35 PM, Henry Robinson <he...@apache.org>
>>> wrote:
>>>
>>>> Hi all -
>>>>
>>>> SPARK-23852 (where a query can silently give wrong results thanks to a
>>>> predicate pushdown bug in Parquet) is a fairly bad bug. In other projects
>>>> I've been involved with, we've released maintenance releases for bugs of
>>>> this severity.
>>>>
>>>> Since Spark 2.4.0 is probably a while away, I wanted to see if there
>>>> was any consensus over whether we should consider (at least) a 2.3.1.
>>>>
>>>> The reason this particular issue is a bit tricky is that the Parquet
>>>> community haven't yet produced a maintenance release that fixes the
>>>> underlying bug, but they are in the process of releasing a new minor
>>>> version, 1.10, which includes a fix. Having spoken to a couple of Parquet
>>>> developers, they'd be willing to consider a maintenance release, but would
>>>> probably only bother if we (or another affected project) asked them to.
>>>>
>>>> My guess is that we wouldn't want to upgrade to a new minor version of
>>>> Parquet for a Spark maintenance release, so asking for a Parquet
>>>> maintenance release makes sense.
>>>>
>>>> What does everyone think?
>>>>
>>>> Best,
>>>> Henry
>>>>
>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>

Re: Maintenance releases for SPARK-23852?

Posted by Reynold Xin <rx...@databricks.com>.

Seems like this would make sense... we usually make maintenance releases
for bug fixes after a month anyway.


On Wed, Apr 11, 2018 at 12:52 PM, Henry Robinson <he...@apache.org> wrote:

>
>
> On 11 April 2018 at 12:47, Ryan Blue <rb...@netflix.com.invalid> wrote:
>
>> I think a 1.8.3 Parquet release makes sense for the 2.3.x releases of
>> Spark.
>>
>> To be clear though, this only affects Spark when reading data written by
>> Impala, right? Or does Parquet CPP also produce data like this?
>>
>
> I don't know about parquet-cpp, but yeah, the only implementation I've
> seen writing the half-completed stats is Impala. (as you know, that's
> compliant with the spec, just an unusual choice).
>
>
>>
>> On Wed, Apr 11, 2018 at 12:35 PM, Henry Robinson <he...@apache.org>
>> wrote:
>>
>>> Hi all -
>>>
>>> SPARK-23852 (where a query can silently give wrong results thanks to a
>>> predicate pushdown bug in Parquet) is a fairly bad bug. In other projects
>>> I've been involved with, we've released maintenance releases for bugs of
>>> this severity.
>>>
>>> Since Spark 2.4.0 is probably a while away, I wanted to see if there was
>>> any consensus over whether we should consider (at least) a 2.3.1.
>>>
>>> The reason this particular issue is a bit tricky is that the Parquet
>>> community haven't yet produced a maintenance release that fixes the
>>> underlying bug, but they are in the process of releasing a new minor
>>> version, 1.10, which includes a fix. Having spoken to a couple of Parquet
>>> developers, they'd be willing to consider a maintenance release, but would
>>> probably only bother if we (or another affected project) asked them to.
>>>
>>> My guess is that we wouldn't want to upgrade to a new minor version of
>>> Parquet for a Spark maintenance release, so asking for a Parquet
>>> maintenance release makes sense.
>>>
>>> What does everyone think?
>>>
>>> Best,
>>> Henry
>>>
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>

Re: Maintenance releases for SPARK-23852?

Posted by Henry Robinson <he...@apache.org>.

On 11 April 2018 at 12:47, Ryan Blue <rb...@netflix.com.invalid> wrote:

> I think a 1.8.3 Parquet release makes sense for the 2.3.x releases of
> Spark.
>
> To be clear though, this only affects Spark when reading data written by
> Impala, right? Or does Parquet CPP also produce data like this?
>

I don't know about parquet-cpp, but yeah, the only implementation I've seen
writing the half-completed stats is Impala. (as you know, that's compliant
with the spec, just an unusual choice).


>
> On Wed, Apr 11, 2018 at 12:35 PM, Henry Robinson <he...@apache.org> wrote:
>
>> Hi all -
>>
>> SPARK-23852 (where a query can silently give wrong results thanks to a
>> predicate pushdown bug in Parquet) is a fairly bad bug. In other projects
>> I've been involved with, we've released maintenance releases for bugs of
>> this severity.
>>
>> Since Spark 2.4.0 is probably a while away, I wanted to see if there was
>> any consensus over whether we should consider (at least) a 2.3.1.
>>
>> The reason this particular issue is a bit tricky is that the Parquet
>> community haven't yet produced a maintenance release that fixes the
>> underlying bug, but they are in the process of releasing a new minor
>> version, 1.10, which includes a fix. Having spoken to a couple of Parquet
>> developers, they'd be willing to consider a maintenance release, but would
>> probably only bother if we (or another affected project) asked them to.
>>
>> My guess is that we wouldn't want to upgrade to a new minor version of
>> Parquet for a Spark maintenance release, so asking for a Parquet
>> maintenance release makes sense.
>>
>> What does everyone think?
>>
>> Best,
>> Henry
>>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Maintenance releases for SPARK-23852?

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

I think a 1.8.3 Parquet release makes sense for the 2.3.x releases of Spark.

To be clear though, this only affects Spark when reading data written by
Impala, right? Or does Parquet CPP also produce data like this?

On Wed, Apr 11, 2018 at 12:35 PM, Henry Robinson <he...@apache.org> wrote:

> Hi all -
>
> SPARK-23852 (where a query can silently give wrong results thanks to a
> predicate pushdown bug in Parquet) is a fairly bad bug. In other projects
> I've been involved with, we've released maintenance releases for bugs of
> this severity.
>
> Since Spark 2.4.0 is probably a while away, I wanted to see if there was
> any consensus over whether we should consider (at least) a 2.3.1.
>
> The reason this particular issue is a bit tricky is that the Parquet
> community haven't yet produced a maintenance release that fixes the
> underlying bug, but they are in the process of releasing a new minor
> version, 1.10, which includes a fix. Having spoken to a couple of Parquet
> developers, they'd be willing to consider a maintenance release, but would
> probably only bother if we (or another affected project) asked them to.
>
> My guess is that we wouldn't want to upgrade to a new minor version of
> Parquet for a Spark maintenance release, so asking for a Parquet
> maintenance release makes sense.
>
> What does everyone think?
>
> Best,
> Henry
>



-- 
Ryan Blue
Software Engineer
Netflix