You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Hatem Helal (JIRA)" <ji...@apache.org> on 2018/10/19 12:58:00 UTC
[jira] [Created] (ARROW-3564) pyarrow: writing version 2.0 parquet format with dictionary encoding enabled

Hatem Helal created ARROW-3564:
----------------------------------

             Summary: pyarrow: writing version 2.0 parquet format with dictionary encoding enabled
                 Key: ARROW-3564
                 URL: https://issues.apache.org/jira/browse/ARROW-3564
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, Python
    Affects Versions: 0.11.0
            Reporter: Hatem Helal
         Attachments: example_v1.0_dict_False.parquet, example_v1.0_dict_True.parquet, example_v2.0_dict_False.parquet, example_v2.0_dict_True.parquet, pyarrow_repro.py

Using pyarrow v0.11.0, the following script writes a simple table (lifted from the [pyarrow doc|https://arrow.apache.org/docs/python/parquet.html]) to both parquet format versions 1.0 and 2.0, with and without dictionary encoding enabled.
|{{import}} {{pyarrow.parquet as pq}}
{{import}} {{numpy as np}}
{{import}} {{pandas as pd}}
{{import}} {{pyarrow as pa}}
{{import}} {{itertools}}
 
{{df }}{{=}} {{pd.DataFrame({}}{{'one'}}{{: [}}{{-}}{{1}}{{, np.nan, }}{{2.5}}{{],}}
{{    }}{{'two'}}{{: [}}{{'foo'}}{{, }}{{'bar'}}{{, }}{{'baz'}}{{],}}
{{    }}{{'three'}}{{: [}}{{True}}{{, }}{{False}}{{, }}{{True}}{{]},}}
{{    }}{{index}}{{=}}{{list}}{{(}}{{'abc'}}{{))}}
 
{{table }}{{=}} {{pa.Table.from_pandas(df)}}
 
{{use_dict }}{{=}} {{[}}{{True}}{{, }}{{False}}{{]}}
{{version }}{{=}} {{[}}{{'1.0'}}{{, }}{{'2.0'}}{{]}}
 
{{for}} {{tf, v }}{{in}} {{itertools.product(use_dict, version):}}
{{    }}{{filename }}{{=}} {{'example_v'}} {{+}} {{v  }}{{+}} {{'_dict_'}} {{+}} {{str}}{{(tf) }}{{+}} {{'.parquet'}}
{{    }}{{pq.write_table(table, filename, use_dictionary}}{{=}}{{tf, version}}{{=}}{{v)}}|

Inspecting the written files using [parquet-tools|https://github.com/apache/parquet-mr/tree/master/parquet-tools] appears to show that dictionary encoding is not used in either of the version 2.0 files.  Both files report that the columns are encoded using {{PLAIN,RLE}} and that the dictionary page offset is zero.  I was expecting that the column encoding would include {{RLE_DICTIONARY}}. Attached are the script with repro steps and the files that were generated by it.

Below is the output of using {{parquet-tools meta}} on the version 2.0 files
{panel:title=version='2.0', use_dictionary = True}
{panel}
|{{% parquet-tools meta example_v2.0_dict_True.parquet}}
{{file:              file:.../example_v2.0_dict_True.parquet}}
{{creator:           parquet-cpp version 1.5.1-SNAPSHOT}}
{{extra:             pandas = \{"pandas_version": "0.23.4", "index_columns": ["__index_level_0__"], "columns": [{"metadata": null, "field_name": "one", "name": "one", "numpy_type": "float64", "pandas_type": "float64"}, \{"metadata": null, "field_name": "three", "name": "three", "numpy_type": "bool", "pandas_type": "bool"}, \{"metadata": null, "field_name": "two", "name": "two", "numpy_type": "object", "pandas_type": "bytes"}, \{"metadata": null, "field_name": "__index_level_0__", "name": null, "numpy_type": "object", "pandas_type": "bytes"}], "column_indexes": [\{"metadata": null, "field_name": null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}}}
 
{{file schema:       schema}}
{{--------------------------------------------------------------------------------}}
{{one:               OPTIONAL DOUBLE R:0 D:1}}
{{three:             OPTIONAL BOOLEAN R:0 D:1}}
{{two:               OPTIONAL BINARY R:0 D:1}}
{{__index_level_0__: OPTIONAL BINARY R:0 D:1}}
 
{{row group 1:       RC:3 TS:211 OFFSET:4}}
{{--------------------------------------------------------------------------------}}
{{one:                DOUBLE SNAPPY DO:0 FPO:4 SZ:65/63/0.97 VC:3 ENC:PLAIN,RLE ST:[min: -1.0, max: 2.5, num_nulls: 1]}}
{{three:              BOOLEAN SNAPPY DO:0 FPO:142 SZ:36/34/0.94 VC:3 ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0]}}
{{two:                BINARY SNAPPY DO:0 FPO:225 SZ:60/58/0.97 VC:3 ENC:PLAIN,RLE ST:[min: 0x626172, max: 0x666F6F, num_nulls: 0]}}
{{__index_level_0__:  BINARY SNAPPY DO:0 FPO:328 SZ:50/48/0.96 VC:3 ENC:PLAIN,RLE ST:[min: 0x61, max: 0x63, num_nulls: 0]}}|
{panel:title=version='2.0', use_dictionary = False}
{panel}
|{{% parquet-tools meta example_v2.0_dict_False.parquet}}
{{file:              file:.../example_v2.0_dict_False.parquet}}
{{creator:           parquet-cpp version 1.5.1-SNAPSHOT}}
{{extra:             pandas = \{"pandas_version": "0.23.4", "index_columns": ["__index_level_0__"], "columns": [{"metadata": null, "field_name": "one", "name": "one", "numpy_type": "float64", "pandas_type": "float64"}, \{"metadata": null, "field_name": "three", "name": "three", "numpy_type": "bool", "pandas_type": "bool"}, \{"metadata": null, "field_name": "two", "name": "two", "numpy_type": "object", "pandas_type": "bytes"}, \{"metadata": null, "field_name": "__index_level_0__", "name": null, "numpy_type": "object", "pandas_type": "bytes"}], "column_indexes": [\{"metadata": null, "field_name": null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}}}
 
{{file schema:       schema}}
{{--------------------------------------------------------------------------------}}
{{one:               OPTIONAL DOUBLE R:0 D:1}}
{{three:             OPTIONAL BOOLEAN R:0 D:1}}
{{two:               OPTIONAL BINARY R:0 D:1}}
{{__index_level_0__: OPTIONAL BINARY R:0 D:1}}
 
{{row group 1:       RC:3 TS:211 OFFSET:4}}
{{--------------------------------------------------------------------------------}}
{{one:                DOUBLE SNAPPY DO:0 FPO:4 SZ:65/63/0.97 VC:3 ENC:PLAIN,RLE ST:[min: -1.0, max: 2.5, num_nulls: 1]}}
{{three:              BOOLEAN SNAPPY DO:0 FPO:142 SZ:36/34/0.94 VC:3 ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0]}}
{{two:                BINARY SNAPPY DO:0 FPO:225 SZ:60/58/0.97 VC:3 ENC:PLAIN,RLE ST:[min: 0x626172, max: 0x666F6F, num_nulls: 0]}}
{{__index_level_0__:  BINARY SNAPPY DO:0 FPO:328 SZ:50/48/0.96 VC:3 ENC:PLAIN,RLE ST:[min: 0x61, max: 0x63, num_nulls: 0]}}|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)