You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Ken Terada (JIRA)" <ji...@apache.org> on 2018/09/18 05:11:00 UTC
[jira] [Commented] (PARQUET-1361) [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT types

    [ https://issues.apache.org/jira/browse/PARQUET-1361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618484#comment-16618484 ] 

Ken Terada commented on PARQUET-1361:
-------------------------------------

[~xhochy] and [~wesmckinn], sorry to reach out to you directly, but could you please review the last comment posted? To be clear, I am asking your guidance as experts in Parquet to determine if this JIRA needs to be refactored as a [parquet-mr|https://github.com/apache/parquet-mr] defect. Thank you for your time.

> [C++] 1.4.1 library allows creation of parquet file w/NULL values for INT types
> -------------------------------------------------------------------------------
>
>                 Key: PARQUET-1361
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1361
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cpp
>    Affects Versions: cpp-1.4.0
>            Reporter: Ken Terada
>            Priority: Major
>         Attachments: parquet-1361-repro-1.py, parquet-1361-repro-2.py, sample_w_null.csv
>
>
> The parquet-cpp v1.4.1 library allows generation of parquet files with NULL values for INT type columns which causes unexpected parsing errors in downstream systems ingesting those files.
> e.g.,
> {{Error parsing the parquet file: UNKNOWN can not be applied to a primitive type}}
> *+Reproduction Steps+*
> OS: CentOS 7.5.1804
> Python: 3.4.8
> +Prerequisites:+
> * Install the following packages: {{Numpy: 1.14.5}}, {{Pandas: 0.22.0}}, {{PyArrow: 0.9.0}}
> +Step 1+
> Generate the parquet file.
> {{sample_w_null.csv}}
> {code}
> col1,col2,col3,col4,col5
> 1,2,,4,5
> {code}
> {{parquet-1361-repro-1.py}}
> {code}
> #!/usr/bin/python
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> input_file = 'sample_w_null.csv'
> output_file = 'int_unknown.parquet'
> p_schema = {'col1': np.int32,
>         'col2': np.int32,
>         'col3': np.unicode_,
>         'col4': np.int32,
>         'col5': np.int32}
> df = pd.read_csv(input_file, dtype=p_schema)
> table = pa.Table.from_pandas(df)
> pq.write_table(table, output_file)
> {code}
> +Step 2+
> Inspect the metadata of the generated file.
> {{parquet-1361-repro-2.py}}
> {code}
> #!/usr/bin/python
> import pyarrow.parquet as pq
> for filename in ['int_unknown.parquet']:
>         pq_file = pq.ParquetFile(filename)
>         print(pq_file.metadata)
>         print(pq_file.schema)
>         print(pq_file.num_row_groups)
>         print(pq.read_table(filename, columns=['col1','col2','col3','col4','col5']).to_pandas())
> {code}
> Results
> {code}
> <pyarrow._parquet.FileMetaData object at 0x7f53e8621100>
>   created_by: parquet-cpp version 1.4.1-SNAPSHOT
>   num_columns: 6
>   num_rows: 1
>   num_row_groups: 1
>   format_version: 1.0
>   serialized_size: 1434
> <pyarrow._parquet.ParquetSchema object at 0x7f53e85bd170>
> col1: INT32
> col2: INT32
> col3: INT32 UNKNOWN
> col4: INT32
> col5: INT32
> __index_level_0__: INT64
> 1
>    col1  col2  col3  col4  col5
> 0     1     2  None     4     5
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)