You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Uwe Korn <uw...@xhochy.com> on 2016/06/16 19:57:01 UTC

List of Additions to Parquet 2

Hello,

I'm currently looking at the differences between Parquet 1 and Parquet 2 
to implement these versions as a switch in parquet-cpp. The only list I 
could find is the rather undetailed changelog [1]. Is there maybe some 
better list or do I need to go through the referenced changesets entries 
myself to find the actual differences? (If the latter is the case, I'd 
also make a PR afterwards that augments the documentation with some 
"(since version 2.0)" markings. But I'm hoping a bit that there is some 
blog post or so out there that could make my life easier.

Thanks,

Uwe

[1] https://github.com/apache/parquet-format/blob/master/CHANGES.md

Re: List of Additions to Parquet 2

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

Good point about needing more documentation on the 2.0 spec. Right now,
it's mostly only documented in the code and a table in the README [1]. But
even that table is unclear because many of the additions we've made are
forward-compatible and can be used in 1.0 files.

For example, the stats that were added are compatible with the 1.0 format
because older readers will ignore the stats objects when reading. The
"ConvertedType" annotations, like INT_8 or UINT_16 are similar. Thrift will
ignore unknown enum values and the field is optional so a UINT_16 looks
like an un-annotated INT32 to older readers. The addition or use of new
logical type annotations is compatible with 1.0 and implementing read-side
support should always be considered compatible with the format (though not
necessarily with the API).

The only features that aren't 1.0 compatible are those that cause a file to
be unreadable by existing 1.0 readers, like the new delta page encodings
and the addition of BROTLI to the compression enum [2].

rb

[1]: https://github.com/apache/parquet-mr#features
[2]: https://github.com/apache/parquet-format/pull/40

On Thu, Jun 16, 2016 at 1:19 PM, Wes McKinney <we...@gmail.com> wrote:

> To add a one bit of context, we're looking at the handling of integers
> other than INT32 and INT64 from the perspective of Apache Arrow. It
> seems that in Parquet 1 files, you may not be able to recover the
> original integer types from the file alone. The question is, should we
> put this metadata in the Parquet file? See
>
> https://github.com/apache/arrow/pull/89/files#diff-147a93dad8a2dfdac5531007c5c686b1R67
>
> If it may cause problems, we can leave the physical storage type as is
> and leave users to explicitly cast on deserialization to another
> integer type.
>
> Thanks,
> Wes
>
> On Thu, Jun 16, 2016 at 12:57 PM, Uwe Korn <uw...@xhochy.com> wrote:
> > Hello,
> >
> > I'm currently looking at the differences between Parquet 1 and Parquet 2
> to
> > implement these versions as a switch in parquet-cpp. The only list I
> could
> > find is the rather undetailed changelog [1]. Is there maybe some better
> list
> > or do I need to go through the referenced changesets entries myself to
> find
> > the actual differences? (If the latter is the case, I'd also make a PR
> > afterwards that augments the documentation with some "(since version
> 2.0)"
> > markings. But I'm hoping a bit that there is some blog post or so out
> there
> > that could make my life easier.
> >
> > Thanks,
> >
> > Uwe
> >
> > [1] https://github.com/apache/parquet-format/blob/master/CHANGES.md
> >
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: List of Additions to Parquet 2

Posted by Ryan Blue <rb...@netflix.com.INVALID>.

I don't think it's a good idea to use the metadata summary files or to
merge footer information from multiple Parquet files. As you note, it's not
possible to merge user-defined metadata, there are significant problems
from merging schemas (without needing to), and keeping the summary files
up-to-date is manual and often overlooked. The right thing to do is to
reconcile the expected metadata with each file's metadata individually when
that file is read. That distributes the work and avoids bottlenecks like
those addressed in PARQUET-139 [1].

The only reason to merge file metadata is to infer an overall schema from a
set of data files, which is not usually necessary because the schema for a
table is tracked by the metastore. If you're not storing the table schema,
then it's much better to use no schema and return null when columns are
missing or require an expected schema from the reader (e.g., from Avro
specific or thrift classes).

rb


[1]: https://issues.apache.org/jira/browse/PARQUET-139

On Thu, Jun 16, 2016 at 2:11 PM, Cheng Lian <li...@gmail.com> wrote:

> One problem of Parquet user-defined key/value metadata is that, when
> merging footers of multiple Parquet files to generate the summary files, if
> two Parquet files have key/value entries with the same key but different
> values, Parquet doesn't know how to merge them, and simply throws an
> exception and gives up writing the summary file. If you're appending new
> data into an existing directory with old summary files, you may end up with
> stale summary files since the old ones are not properly overwritten.
>
> This can be a problem in the case of schema evolution. For example, Spark
> SQL writes JSON-ized schema strings to Parquet files as key/value metadata.
> When appending new Parquet files into an existing directory containing
> existing files with different but compatible schemata, summary files can't
> be properly generated.
>
> But in practice this isn't a big problem since Parquet summary files are
> not that important nowadays.
>
>
> Cheng
>
>
>
> On 6/16/16 1:19 PM, Wes McKinney wrote:
>
>> To add a one bit of context, we're looking at the handling of integers
>> other than INT32 and INT64 from the perspective of Apache Arrow. It
>> seems that in Parquet 1 files, you may not be able to recover the
>> original integer types from the file alone. The question is, should we
>> put this metadata in the Parquet file? See
>>
>> https://github.com/apache/arrow/pull/89/files#diff-147a93dad8a2dfdac5531007c5c686b1R67
>>
>> If it may cause problems, we can leave the physical storage type as is
>> and leave users to explicitly cast on deserialization to another
>> integer type.
>>
>> Thanks,
>> Wes
>>
>> On Thu, Jun 16, 2016 at 12:57 PM, Uwe Korn <uw...@xhochy.com> wrote:
>>
>>> Hello,
>>>
>>> I'm currently looking at the differences between Parquet 1 and Parquet 2
>>> to
>>> implement these versions as a switch in parquet-cpp. The only list I
>>> could
>>> find is the rather undetailed changelog [1]. Is there maybe some better
>>> list
>>> or do I need to go through the referenced changesets entries myself to
>>> find
>>> the actual differences? (If the latter is the case, I'd also make a PR
>>> afterwards that augments the documentation with some "(since version
>>> 2.0)"
>>> markings. But I'm hoping a bit that there is some blog post or so out
>>> there
>>> that could make my life easier.
>>>
>>> Thanks,
>>>
>>> Uwe
>>>
>>> [1] https://github.com/apache/parquet-format/blob/master/CHANGES.md
>>>
>>>
>


-- 
Ryan Blue
Software Engineer
Netflix

Re: List of Additions to Parquet 2

Posted by Cheng Lian <li...@gmail.com>.

One problem of Parquet user-defined key/value metadata is that, when 
merging footers of multiple Parquet files to generate the summary files, 
if two Parquet files have key/value entries with the same key but 
different values, Parquet doesn't know how to merge them, and simply 
throws an exception and gives up writing the summary file. If you're 
appending new data into an existing directory with old summary files, 
you may end up with stale summary files since the old ones are not 
properly overwritten.

This can be a problem in the case of schema evolution. For example, 
Spark SQL writes JSON-ized schema strings to Parquet files as key/value 
metadata. When appending new Parquet files into an existing directory 
containing existing files with different but compatible schemata, 
summary files can't be properly generated.

But in practice this isn't a big problem since Parquet summary files are 
not that important nowadays.

Cheng

On 6/16/16 1:19 PM, Wes McKinney wrote:
> To add a one bit of context, we're looking at the handling of integers
> other than INT32 and INT64 from the perspective of Apache Arrow. It
> seems that in Parquet 1 files, you may not be able to recover the
> original integer types from the file alone. The question is, should we
> put this metadata in the Parquet file? See
> https://github.com/apache/arrow/pull/89/files#diff-147a93dad8a2dfdac5531007c5c686b1R67
>
> If it may cause problems, we can leave the physical storage type as is
> and leave users to explicitly cast on deserialization to another
> integer type.
>
> Thanks,
> Wes
>
> On Thu, Jun 16, 2016 at 12:57 PM, Uwe Korn <uw...@xhochy.com> wrote:
>> Hello,
>>
>> I'm currently looking at the differences between Parquet 1 and Parquet 2 to
>> implement these versions as a switch in parquet-cpp. The only list I could
>> find is the rather undetailed changelog [1]. Is there maybe some better list
>> or do I need to go through the referenced changesets entries myself to find
>> the actual differences? (If the latter is the case, I'd also make a PR
>> afterwards that augments the documentation with some "(since version 2.0)"
>> markings. But I'm hoping a bit that there is some blog post or so out there
>> that could make my life easier.
>>
>> Thanks,
>>
>> Uwe
>>
>> [1] https://github.com/apache/parquet-format/blob/master/CHANGES.md
>>

Re: List of Additions to Parquet 2

Posted by Wes McKinney <we...@gmail.com>.

To add a one bit of context, we're looking at the handling of integers
other than INT32 and INT64 from the perspective of Apache Arrow. It
seems that in Parquet 1 files, you may not be able to recover the
original integer types from the file alone. The question is, should we
put this metadata in the Parquet file? See
https://github.com/apache/arrow/pull/89/files#diff-147a93dad8a2dfdac5531007c5c686b1R67

If it may cause problems, we can leave the physical storage type as is
and leave users to explicitly cast on deserialization to another
integer type.

Thanks,
Wes

On Thu, Jun 16, 2016 at 12:57 PM, Uwe Korn <uw...@xhochy.com> wrote:
> Hello,
>
> I'm currently looking at the differences between Parquet 1 and Parquet 2 to
> implement these versions as a switch in parquet-cpp. The only list I could
> find is the rather undetailed changelog [1]. Is there maybe some better list
> or do I need to go through the referenced changesets entries myself to find
> the actual differences? (If the latter is the case, I'd also make a PR
> afterwards that augments the documentation with some "(since version 2.0)"
> markings. But I'm hoping a bit that there is some blog post or so out there
> that could make my life easier.
>
> Thanks,
>
> Uwe
>
> [1] https://github.com/apache/parquet-format/blob/master/CHANGES.md
>