You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Michael Allman <mi...@videoamp.com> on 2016/10/31 21:07:25 UTC

Updating Parquet dep to 1.9

Hi All,

Is anyone working on updating Spark's Parquet library dep to 1.9? If not, I can at least get started on it and publish a PR.

Cheers,

Michael
---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: Updating Parquet dep to 1.9

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
The stats problem is on the write side. Parquet compares byte buffers (used
for UTF8 strings also) using byte-wise comparison, but got it wrong and
compares the Java byte values, which are signed. UTF8 ordering is the same
as byte-wise comparison, but only if the bytes are compared as unsigned
values. So Parquet ends up with the wrong min and max if there are
characters where the sign bit / msb is set. For ASCII, the results are
identical, but other character sets, like latin1, end up with accented
characters out of order.

Parquet 1.9.0 suppresses the min and max values when the sort order that
produced them is incorrect to fix the correctness bug in applications like
SparkSQL. There is a property to override this if you know your data has
only ASCII characters, but by default min and max are not considered
reliable and are not used to eliminate row groups with predicate push-down.
Other types aren't affected and row group filters will still work.

1.9.0 also adds dictionary filtering to predicate push-down, which can be
used in many cases to skip row groups as well. This doesn't use the min and
max values so it will still work.

The issue for the stats ordering bug is PARQUET-686. Writes will be fixed
in 1.9.1, which I'd like to have out in the next couple of weeks.

My overall recommendation is to do the update to 1.9.0, which fixes the
logging problem, too.

rb

On Wed, Nov 2, 2016 at 8:31 AM, Michael Allman <mi...@videoamp.com> wrote:

> Sounds great. Regarding the min/max stats issue, is that an issue with the
> way the files are written or read? What's the Parquet project issue for
> that bug? What's the 1.9.1 release timeline look like?
>
> I will aim to have a PR in by the end of the week. I feel strongly that
> either this or https://github.com/apache/spark/pull/15538 needs to make
> it into 2.1. The logging output issue is really bad. I would probably call
> it a blocker.
>
> Michael
>
>
> On Nov 1, 2016, at 1:22 PM, Ryan Blue <rb...@netflix.com> wrote:
>
> I can when I'm finished with a couple other issues if no one gets to it
> first.
>
> Michael, if you're interested in updating to 1.9.0 I'm happy to help
> review that PR.
>
> On Tue, Nov 1, 2016 at 1:03 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> Ryan want to submit a pull request?
>>
>>
>> On Tue, Nov 1, 2016 at 9:05 AM, Ryan Blue <rb...@netflix.com.invalid>
>> wrote:
>>
>>> 1.9.0 includes some fixes intended specifically for Spark:
>>>
>>> * PARQUET-389: Evaluates push-down predicates for missing columns as
>>> though they are null. This is to address Spark's work-around that requires
>>> reading and merging file schemas, even for metastore tables.
>>> * PARQUET-654: Adds an option to disable record-level predicate
>>> push-down, but keep row group evaluation. This allows Spark to skip row
>>> groups based on stats and dictionaries, but implement its own vectorized
>>> record filtering.
>>>
>>> The Parquet community also evaluated performance to ensure no
>>> performance regressions from moving to the ByteBuffer read path.
>>>
>>> There is one concern about 1.9.0 that will be addressed in 1.9.1, which
>>> is that stats calculations were incorrectly using unsigned byte order for
>>> string comparison. This means that min/max stats can't be used if the data
>>> contains (or may contain) UTF8 characters with the msb set. 1.9.0 won't
>>> return the bad min/max values for correctness, but there is a property to
>>> override this behavior for data that doesn't use the affected code points.
>>>
>>> Upgrading to 1.9.0 depends on how the community wants to handle the sort
>>> order bug: whether correctness or performance should be the default.
>>>
>>> rb
>>>
>>> On Tue, Nov 1, 2016 at 2:22 AM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> Yes this came up from a different direction: https://issues.apac
>>>> he.org/jira/browse/SPARK-18140
>>>>
>>>> I think it's fine to pursue an upgrade to fix these several issues. The
>>>> question is just how well it will play with other components, so bears some
>>>> testing and evaluation of the changes from 1.8, but yes this would be good.
>>>>
>>>> On Mon, Oct 31, 2016 at 9:07 PM Michael Allman <mi...@videoamp.com>
>>>> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> Is anyone working on updating Spark's Parquet library dep to 1.9? If
>>>>> not, I can at least get started on it and publish a PR.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Michael
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>>
>>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Updating Parquet dep to 1.9

Posted by Michael Allman <mi...@videoamp.com>.
Sounds great. Regarding the min/max stats issue, is that an issue with the way the files are written or read? What's the Parquet project issue for that bug? What's the 1.9.1 release timeline look like?

I will aim to have a PR in by the end of the week. I feel strongly that either this or https://github.com/apache/spark/pull/15538 <https://github.com/apache/spark/pull/15538> needs to make it into 2.1. The logging output issue is really bad. I would probably call it a blocker.

Michael


> On Nov 1, 2016, at 1:22 PM, Ryan Blue <rb...@netflix.com> wrote:
> 
> I can when I'm finished with a couple other issues if no one gets to it first.
> 
> Michael, if you're interested in updating to 1.9.0 I'm happy to help review that PR.
> 
> On Tue, Nov 1, 2016 at 1:03 PM, Reynold Xin <rxin@databricks.com <ma...@databricks.com>> wrote:
> Ryan want to submit a pull request?
> 
> 
> On Tue, Nov 1, 2016 at 9:05 AM, Ryan Blue <rblue@netflix.com.invalid <ma...@netflix.com.invalid>> wrote:
> 1.9.0 includes some fixes intended specifically for Spark:
> 
> * PARQUET-389: Evaluates push-down predicates for missing columns as though they are null. This is to address Spark's work-around that requires reading and merging file schemas, even for metastore tables.
> * PARQUET-654: Adds an option to disable record-level predicate push-down, but keep row group evaluation. This allows Spark to skip row groups based on stats and dictionaries, but implement its own vectorized record filtering.
> 
> The Parquet community also evaluated performance to ensure no performance regressions from moving to the ByteBuffer read path.
> 
> There is one concern about 1.9.0 that will be addressed in 1.9.1, which is that stats calculations were incorrectly using unsigned byte order for string comparison. This means that min/max stats can't be used if the data contains (or may contain) UTF8 characters with the msb set. 1.9.0 won't return the bad min/max values for correctness, but there is a property to override this behavior for data that doesn't use the affected code points.
> 
> Upgrading to 1.9.0 depends on how the community wants to handle the sort order bug: whether correctness or performance should be the default.
> 
> rb
> 
> On Tue, Nov 1, 2016 at 2:22 AM, Sean Owen <sowen@cloudera.com <ma...@cloudera.com>> wrote:
> Yes this came up from a different direction: https://issues.apache.org/jira/browse/SPARK-18140 <https://issues.apache.org/jira/browse/SPARK-18140>
> 
> I think it's fine to pursue an upgrade to fix these several issues. The question is just how well it will play with other components, so bears some testing and evaluation of the changes from 1.8, but yes this would be good.
> 
> On Mon, Oct 31, 2016 at 9:07 PM Michael Allman <michael@videoamp.com <ma...@videoamp.com>> wrote:
> Hi All,
> 
> Is anyone working on updating Spark's Parquet library dep to 1.9? If not, I can at least get started on it and publish a PR.
> 
> Cheers,
> 
> Michael
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org <ma...@spark.apache.org>
> 
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix
> 
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


Re: Updating Parquet dep to 1.9

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
I can when I'm finished with a couple other issues if no one gets to it
first.

Michael, if you're interested in updating to 1.9.0 I'm happy to help review
that PR.

On Tue, Nov 1, 2016 at 1:03 PM, Reynold Xin <rx...@databricks.com> wrote:

> Ryan want to submit a pull request?
>
>
> On Tue, Nov 1, 2016 at 9:05 AM, Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> 1.9.0 includes some fixes intended specifically for Spark:
>>
>> * PARQUET-389: Evaluates push-down predicates for missing columns as
>> though they are null. This is to address Spark's work-around that requires
>> reading and merging file schemas, even for metastore tables.
>> * PARQUET-654: Adds an option to disable record-level predicate
>> push-down, but keep row group evaluation. This allows Spark to skip row
>> groups based on stats and dictionaries, but implement its own vectorized
>> record filtering.
>>
>> The Parquet community also evaluated performance to ensure no performance
>> regressions from moving to the ByteBuffer read path.
>>
>> There is one concern about 1.9.0 that will be addressed in 1.9.1, which
>> is that stats calculations were incorrectly using unsigned byte order for
>> string comparison. This means that min/max stats can't be used if the data
>> contains (or may contain) UTF8 characters with the msb set. 1.9.0 won't
>> return the bad min/max values for correctness, but there is a property to
>> override this behavior for data that doesn't use the affected code points.
>>
>> Upgrading to 1.9.0 depends on how the community wants to handle the sort
>> order bug: whether correctness or performance should be the default.
>>
>> rb
>>
>> On Tue, Nov 1, 2016 at 2:22 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> Yes this came up from a different direction: https://issues.apac
>>> he.org/jira/browse/SPARK-18140
>>>
>>> I think it's fine to pursue an upgrade to fix these several issues. The
>>> question is just how well it will play with other components, so bears some
>>> testing and evaluation of the changes from 1.8, but yes this would be good.
>>>
>>> On Mon, Oct 31, 2016 at 9:07 PM Michael Allman <mi...@videoamp.com>
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> Is anyone working on updating Spark's Parquet library dep to 1.9? If
>>>> not, I can at least get started on it and publish a PR.
>>>>
>>>> Cheers,
>>>>
>>>> Michael
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>>
>>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Updating Parquet dep to 1.9

Posted by Reynold Xin <rx...@databricks.com>.
Ryan want to submit a pull request?


On Tue, Nov 1, 2016 at 9:05 AM, Ryan Blue <rb...@netflix.com.invalid> wrote:

> 1.9.0 includes some fixes intended specifically for Spark:
>
> * PARQUET-389: Evaluates push-down predicates for missing columns as
> though they are null. This is to address Spark's work-around that requires
> reading and merging file schemas, even for metastore tables.
> * PARQUET-654: Adds an option to disable record-level predicate push-down,
> but keep row group evaluation. This allows Spark to skip row groups based
> on stats and dictionaries, but implement its own vectorized record
> filtering.
>
> The Parquet community also evaluated performance to ensure no performance
> regressions from moving to the ByteBuffer read path.
>
> There is one concern about 1.9.0 that will be addressed in 1.9.1, which is
> that stats calculations were incorrectly using unsigned byte order for
> string comparison. This means that min/max stats can't be used if the data
> contains (or may contain) UTF8 characters with the msb set. 1.9.0 won't
> return the bad min/max values for correctness, but there is a property to
> override this behavior for data that doesn't use the affected code points.
>
> Upgrading to 1.9.0 depends on how the community wants to handle the sort
> order bug: whether correctness or performance should be the default.
>
> rb
>
> On Tue, Nov 1, 2016 at 2:22 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> Yes this came up from a different direction: https://issues.apac
>> he.org/jira/browse/SPARK-18140
>>
>> I think it's fine to pursue an upgrade to fix these several issues. The
>> question is just how well it will play with other components, so bears some
>> testing and evaluation of the changes from 1.8, but yes this would be good.
>>
>> On Mon, Oct 31, 2016 at 9:07 PM Michael Allman <mi...@videoamp.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> Is anyone working on updating Spark's Parquet library dep to 1.9? If
>>> not, I can at least get started on it and publish a PR.
>>>
>>> Cheers,
>>>
>>> Michael
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Updating Parquet dep to 1.9

Posted by Ryan Blue <rb...@netflix.com.INVALID>.
1.9.0 includes some fixes intended specifically for Spark:

* PARQUET-389: Evaluates push-down predicates for missing columns as though
they are null. This is to address Spark's work-around that requires reading
and merging file schemas, even for metastore tables.
* PARQUET-654: Adds an option to disable record-level predicate push-down,
but keep row group evaluation. This allows Spark to skip row groups based
on stats and dictionaries, but implement its own vectorized record
filtering.

The Parquet community also evaluated performance to ensure no performance
regressions from moving to the ByteBuffer read path.

There is one concern about 1.9.0 that will be addressed in 1.9.1, which is
that stats calculations were incorrectly using unsigned byte order for
string comparison. This means that min/max stats can't be used if the data
contains (or may contain) UTF8 characters with the msb set. 1.9.0 won't
return the bad min/max values for correctness, but there is a property to
override this behavior for data that doesn't use the affected code points.

Upgrading to 1.9.0 depends on how the community wants to handle the sort
order bug: whether correctness or performance should be the default.

rb

On Tue, Nov 1, 2016 at 2:22 AM, Sean Owen <so...@cloudera.com> wrote:

> Yes this came up from a different direction: https://issues.
> apache.org/jira/browse/SPARK-18140
>
> I think it's fine to pursue an upgrade to fix these several issues. The
> question is just how well it will play with other components, so bears some
> testing and evaluation of the changes from 1.8, but yes this would be good.
>
> On Mon, Oct 31, 2016 at 9:07 PM Michael Allman <mi...@videoamp.com>
> wrote:
>
>> Hi All,
>>
>> Is anyone working on updating Spark's Parquet library dep to 1.9? If not,
>> I can at least get started on it and publish a PR.
>>
>> Cheers,
>>
>> Michael
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>
>>


-- 
Ryan Blue
Software Engineer
Netflix

Re: Updating Parquet dep to 1.9

Posted by Sean Owen <so...@cloudera.com>.
Yes this came up from a different direction:
https://issues.apache.org/jira/browse/SPARK-18140

I think it's fine to pursue an upgrade to fix these several issues. The
question is just how well it will play with other components, so bears some
testing and evaluation of the changes from 1.8, but yes this would be good.

On Mon, Oct 31, 2016 at 9:07 PM Michael Allman <mi...@videoamp.com> wrote:

> Hi All,
>
> Is anyone working on updating Spark's Parquet library dep to 1.9? If not,
> I can at least get started on it and publish a PR.
>
> Cheers,
>
> Michael
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>