You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "colin fang (JIRA)" <ji...@apache.org> on 2019/03/18 18:30:00 UTC
[jira] [Updated] (PARQUET-1547) Detect parquet-mr style
dictionary_page
[ https://issues.apache.org/jira/browse/PARQUET-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
colin fang updated PARQUET-1547:
--------------------------------
Description:
parquet-mr incorrectly writes (dictionary_page_offset, first_data_page_offset) as (0, dictionary_page_offset)
So whenever parquet-cpp (pyarrow) reads the file, it sets `has_dictionary_page: False` and `dictionary_page_offset: None`
{code}
row group 0
--------------------------------------------------------------------------------
x: DOUBLE SNAPPY DO:0 FPO:4 SZ:1632/31635/19.38 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000]
y: BINARY SNAPPY DO:0 FPO:1636 SZ:268/3885/14.50 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: a, max: é, num_nulls: 10000]
x TV=70000 RL=0 DL=1 DS: 5 DE:PLAIN_DICTIONARY
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000] SZ:31514 VC:70000
{code}
{code}
<pyarrow._parquet.ColumnChunkMetaData object at 0x7fd3effc1120>
file_offset: 4
file_path:
physical_type: DOUBLE
num_values: 70000
path_in_schema: x
is_stats_set: True
statistics:
<pyarrow._parquet.RowGroupStatistics object at 0x7fd3effc1cb0>
has_min_max: True
min: 1.0
max: 5.0
null_count: 10000
distinct_count: 0
num_values: 60000
physical_type: DOUBLE
compression: SNAPPY
encodings: ('PLAIN_DICTIONARY', 'RLE', 'BIT_PACKED')
has_dictionary_page: False
dictionary_page_offset: None
data_page_offset: 4
total_compressed_size: 1632
total_uncompressed_size: 31635
{code}
Is parquet-cpp still able to use the dictionary in this case?
It would be nice if parquet-cpp can recognize the parquet-mr issue and set `has_dictionary_page` to True.
https://stackoverflow.com/questions/55225108/why-is-dictionary-page-offset-0-for-plain-dictionary-encoding/
was:
parquet-mr incorrectly writes (dictionary_page_offset, first_data_page_offset) as (0, dictionary_page_offset)
So whenever parquet-cpp (pyarrow) reads the file, it sets `has_dictionary_page: False` and `dictionary_page_offset: None`
{code}
row group 0
--------------------------------------------------------------------------------
x: DOUBLE SNAPPY DO:0 FPO:4 SZ:1632/31635/19.38 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000]
y: BINARY SNAPPY DO:0 FPO:1636 SZ:268/3885/14.50 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: a, max: é, num_nulls: 10000]
x TV=70000 RL=0 DL=1 DS: 5 DE:PLAIN_DICTIONARY
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000] SZ:31514 VC:70000
{code}
{code}
<pyarrow._parquet.ColumnChunkMetaData object at 0x7fd3effc1120>
file_offset: 4
file_path:
physical_type: DOUBLE
num_values: 70000
path_in_schema: x
is_stats_set: True
statistics:
<pyarrow._parquet.RowGroupStatistics object at 0x7fd3effc1cb0>
has_min_max: True
min: 1.0
max: 5.0
null_count: 10000
distinct_count: 0
num_values: 60000
physical_type: DOUBLE
compression: SNAPPY
encodings: ('PLAIN_DICTIONARY', 'RLE', 'BIT_PACKED')
has_dictionary_page: False
dictionary_page_offset: None
data_page_offset: 4
total_compressed_size: 1632
total_uncompressed_size: 31635
{code}
Is parquet-cpp still able to use the dictionary in this case?
It would be nice if parquet-cpp can recognize the parquet-mr issue and set `has_dictionary_page` to True.
> Detect parquet-mr style dictionary_page
> ---------------------------------------
>
> Key: PARQUET-1547
> URL: https://issues.apache.org/jira/browse/PARQUET-1547
> Project: Parquet
> Issue Type: Bug
> Components: parquet-cpp
> Reporter: colin fang
> Priority: Minor
>
> parquet-mr incorrectly writes (dictionary_page_offset, first_data_page_offset) as (0, dictionary_page_offset)
> So whenever parquet-cpp (pyarrow) reads the file, it sets `has_dictionary_page: False` and `dictionary_page_offset: None`
> {code}
> row group 0
> --------------------------------------------------------------------------------
> x: DOUBLE SNAPPY DO:0 FPO:4 SZ:1632/31635/19.38 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000]
> y: BINARY SNAPPY DO:0 FPO:1636 SZ:268/3885/14.50 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: a, max: é, num_nulls: 10000]
> x TV=70000 RL=0 DL=1 DS: 5 DE:PLAIN_DICTIONARY
> ----------------------------------------------------------------------------
> page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000] SZ:31514 VC:70000
> {code}
> {code}
> <pyarrow._parquet.ColumnChunkMetaData object at 0x7fd3effc1120>
> file_offset: 4
> file_path:
> physical_type: DOUBLE
> num_values: 70000
> path_in_schema: x
> is_stats_set: True
> statistics:
> <pyarrow._parquet.RowGroupStatistics object at 0x7fd3effc1cb0>
> has_min_max: True
> min: 1.0
> max: 5.0
> null_count: 10000
> distinct_count: 0
> num_values: 60000
> physical_type: DOUBLE
> compression: SNAPPY
> encodings: ('PLAIN_DICTIONARY', 'RLE', 'BIT_PACKED')
> has_dictionary_page: False
> dictionary_page_offset: None
> data_page_offset: 4
> total_compressed_size: 1632
> total_uncompressed_size: 31635
> {code}
> Is parquet-cpp still able to use the dictionary in this case?
> It would be nice if parquet-cpp can recognize the parquet-mr issue and set `has_dictionary_page` to True.
> https://stackoverflow.com/questions/55225108/why-is-dictionary-page-offset-0-for-plain-dictionary-encoding/
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)