You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "colin fang (JIRA)" <ji...@apache.org> on 2019/03/18 18:30:00 UTC

[jira] [Updated] (PARQUET-1547) Detect parquet-mr style dictionary_page

     [ https://issues.apache.org/jira/browse/PARQUET-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

colin fang updated PARQUET-1547:
--------------------------------
    Description: 
parquet-mr incorrectly writes (dictionary_page_offset, first_data_page_offset) as (0, dictionary_page_offset)

So whenever parquet-cpp (pyarrow) reads the file, it sets `has_dictionary_page: False` and `dictionary_page_offset: None`

{code}
row group 0 
--------------------------------------------------------------------------------
x:  DOUBLE SNAPPY DO:0 FPO:4 SZ:1632/31635/19.38 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000]
y:  BINARY SNAPPY DO:0 FPO:1636 SZ:268/3885/14.50 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: a, max: é, num_nulls: 10000]

    x TV=70000 RL=0 DL=1 DS: 5 DE:PLAIN_DICTIONARY
    ----------------------------------------------------------------------------
    page 0:                   DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000] SZ:31514 VC:70000

{code}

{code}
<pyarrow._parquet.ColumnChunkMetaData object at 0x7fd3effc1120>
  file_offset: 4
  file_path: 
  physical_type: DOUBLE
  num_values: 70000
  path_in_schema: x
  is_stats_set: True
  statistics:
    <pyarrow._parquet.RowGroupStatistics object at 0x7fd3effc1cb0>
      has_min_max: True
      min: 1.0
      max: 5.0
      null_count: 10000
      distinct_count: 0
      num_values: 60000
      physical_type: DOUBLE
  compression: SNAPPY
  encodings: ('PLAIN_DICTIONARY', 'RLE', 'BIT_PACKED')
  has_dictionary_page: False
  dictionary_page_offset: None
  data_page_offset: 4
  total_compressed_size: 1632
  total_uncompressed_size: 31635
{code}

Is parquet-cpp still able to use the dictionary in this case?
It would be nice if parquet-cpp can recognize the parquet-mr issue and set `has_dictionary_page` to True.

https://stackoverflow.com/questions/55225108/why-is-dictionary-page-offset-0-for-plain-dictionary-encoding/

  was:
parquet-mr incorrectly writes (dictionary_page_offset, first_data_page_offset) as (0, dictionary_page_offset)

So whenever parquet-cpp (pyarrow) reads the file, it sets `has_dictionary_page: False` and `dictionary_page_offset: None`

{code}
row group 0 
--------------------------------------------------------------------------------
x:  DOUBLE SNAPPY DO:0 FPO:4 SZ:1632/31635/19.38 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000]
y:  BINARY SNAPPY DO:0 FPO:1636 SZ:268/3885/14.50 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: a, max: é, num_nulls: 10000]

    x TV=70000 RL=0 DL=1 DS: 5 DE:PLAIN_DICTIONARY
    ----------------------------------------------------------------------------
    page 0:                   DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000] SZ:31514 VC:70000

{code}

{code}
<pyarrow._parquet.ColumnChunkMetaData object at 0x7fd3effc1120>
  file_offset: 4
  file_path: 
  physical_type: DOUBLE
  num_values: 70000
  path_in_schema: x
  is_stats_set: True
  statistics:
    <pyarrow._parquet.RowGroupStatistics object at 0x7fd3effc1cb0>
      has_min_max: True
      min: 1.0
      max: 5.0
      null_count: 10000
      distinct_count: 0
      num_values: 60000
      physical_type: DOUBLE
  compression: SNAPPY
  encodings: ('PLAIN_DICTIONARY', 'RLE', 'BIT_PACKED')
  has_dictionary_page: False
  dictionary_page_offset: None
  data_page_offset: 4
  total_compressed_size: 1632
  total_uncompressed_size: 31635
{code}

Is parquet-cpp still able to use the dictionary in this case?
It would be nice if parquet-cpp can recognize the parquet-mr issue and set `has_dictionary_page` to True.


> Detect parquet-mr style dictionary_page
> ---------------------------------------
>
>                 Key: PARQUET-1547
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1547
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>            Reporter: colin fang
>            Priority: Minor
>
> parquet-mr incorrectly writes (dictionary_page_offset, first_data_page_offset) as (0, dictionary_page_offset)
> So whenever parquet-cpp (pyarrow) reads the file, it sets `has_dictionary_page: False` and `dictionary_page_offset: None`
> {code}
> row group 0 
> --------------------------------------------------------------------------------
> x:  DOUBLE SNAPPY DO:0 FPO:4 SZ:1632/31635/19.38 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000]
> y:  BINARY SNAPPY DO:0 FPO:1636 SZ:268/3885/14.50 VC:70000 ENC:RLE,BIT_PACKED,PLAIN_DICTIONARY ST:[min: a, max: é, num_nulls: 10000]
>     x TV=70000 RL=0 DL=1 DS: 5 DE:PLAIN_DICTIONARY
>     ----------------------------------------------------------------------------
>     page 0:                   DLE:RLE RLE:BIT_PACKED VLE:PLAIN_DICTIONARY ST:[min: 1.0, max: 5.0, num_nulls: 10000] SZ:31514 VC:70000
> {code}
> {code}
> <pyarrow._parquet.ColumnChunkMetaData object at 0x7fd3effc1120>
>   file_offset: 4
>   file_path: 
>   physical_type: DOUBLE
>   num_values: 70000
>   path_in_schema: x
>   is_stats_set: True
>   statistics:
>     <pyarrow._parquet.RowGroupStatistics object at 0x7fd3effc1cb0>
>       has_min_max: True
>       min: 1.0
>       max: 5.0
>       null_count: 10000
>       distinct_count: 0
>       num_values: 60000
>       physical_type: DOUBLE
>   compression: SNAPPY
>   encodings: ('PLAIN_DICTIONARY', 'RLE', 'BIT_PACKED')
>   has_dictionary_page: False
>   dictionary_page_offset: None
>   data_page_offset: 4
>   total_compressed_size: 1632
>   total_uncompressed_size: 31635
> {code}
> Is parquet-cpp still able to use the dictionary in this case?
> It would be nice if parquet-cpp can recognize the parquet-mr issue and set `has_dictionary_page` to True.
> https://stackoverflow.com/questions/55225108/why-is-dictionary-page-offset-0-for-plain-dictionary-encoding/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)