You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2018/09/13 23:13:00 UTC

[jira] [Updated] (ARROW-3210) [Python] Creating ParquetDataset creates partitioned ParquetFiles with mismatched Parquet schemas

     [ https://issues.apache.org/jira/browse/ARROW-3210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wes McKinney updated ARROW-3210:
--------------------------------
    Summary: [Python] Creating ParquetDataset creates partitioned ParquetFiles with mismatched Parquet schemas  (was: Creating ParquetDataset with PyArrow creates partitioned ParquetFiles with mismatched Parquet schemas)

> [Python] Creating ParquetDataset creates partitioned ParquetFiles with mismatched Parquet schemas
> -------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-3210
>                 URL: https://issues.apache.org/jira/browse/ARROW-3210
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.9.0
>         Environment: Ubuntu 16.04 LTS, System76 Oryx Pro
>            Reporter: Ying Wang
>            Priority: Major
>         Attachments: environment.yml, repro.csv, repro.py, repro_2.py
>
>
> STEPS TO REPRODUCE:
> 1. Create a conda environment reflecting [^environment.yml]
> 2. Execute script [^repro.py], replacing various config variables to create a ParquetDataset on S3 given [^repro.csv]
> 3. Create reference of ParquetDataset using script [^repro_2.py], again replacing various config variables.
>  
> EXPECTED:
> Reference is created correctly.
> GOT:
> Mismatched Arrow schemas in validate_schemas() method:
>  
> ```python
> *** ValueError: Schema in partition[Draught=1, Name=1, VesselType=0, x=1, Heading=1] s3://kio-tests-files/_tmp/test_parquet_dataset/Draught=10.3/Name=MSC RAFAELA/VesselType=Cargo/x=130.43158/Heading=270.0/e9e3cea5a5c24c4da587c263ec817c98.parquet was different. 
> Record_ID: int64
> y: double
> TRACKID: string
> MMSI: int64
> IMO: int64
> AgeMinutes: double
> SoG: double
> Width: int64
> Length: int64
> Callsign: string
> Destination: string
> ETA: int64
> Status: string
> ExtraInfo: string
> TIMESTAMP: int64
> __index_level_0__: int64
> metadata
> --------
> {b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
>  b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
>  b'type": "object", "metadata": \{"encoding": "UTF-8"}}], "columns":'
>  b' [{"name": "Record_ID", "field_name": "Record_ID", "pandas_type"'
>  b': "int64", "numpy_type": "int64", "metadata": null}, {"name": "y'
>  b'", "field_name": "y", "pandas_type": "float64", "numpy_type": "f'
>  b'loat64", "metadata": null}, {"name": "TRACKID", "field_name": "T'
>  b'RACKID", "pandas_type": "unicode", "numpy_type": "object", "meta'
>  b'data": null}, {"name": "MMSI", "field_name": "MMSI", "pandas_typ'
>  b'e": "int64", "numpy_type": "int64", "metadata": null}, {"name": '
>  b'"IMO", "field_name": "IMO", "pandas_type": "int64", "numpy_type"'
>  b': "int64", "metadata": null}, {"name": "AgeMinutes", "field_name'
>  b'": "AgeMinutes", "pandas_type": "float64", "numpy_type": "float6'
>  b'4", "metadata": null}, {"name": "SoG", "field_name": "SoG", "pan'
>  b'das_type": "float64", "numpy_type": "float64", "metadata": null}'
>  b', {"name": "Width", "field_name": "Width", "pandas_type": "int64'
>  b'", "numpy_type": "int64", "metadata": null}, {"name": "Length", '
>  b'"field_name": "Length", "pandas_type": "int64", "numpy_type": "i'
>  b'nt64", "metadata": null}, {"name": "Callsign", "field_name": "Ca'
>  b'llsign", "pandas_type": "unicode", "numpy_type": "object", "meta'
>  b'data": null}, {"name": "Destination", "field_name": "Destination'
>  b'", "pandas_type": "unicode", "numpy_type": "object", "metadata":'
>  b' null}, {"name": "ETA", "field_name": "ETA", "pandas_type": "int'
>  b'64", "numpy_type": "int64", "metadata": null}, {"name": "Status"'
>  b', "field_name": "Status", "pandas_type": "unicode", "numpy_type"'
>  b': "object", "metadata": null}, {"name": "ExtraInfo", "field_name'
>  b'": "ExtraInfo", "pandas_type": "unicode", "numpy_type": "object"'
>  b', "metadata": null}, {"name": "TIMESTAMP", "field_name": "TIMEST'
>  b'AMP", "pandas_type": "int64", "numpy_type": "int64", "metadata":'
>  b' null}, {"name": null, "field_name": "__index_level_0__", "panda'
>  b's_type": "int64", "numpy_type": "int64", "metadata": null}], "pa'
>  b'ndas_version": "0.21.0"}'}
> vs
> Record_ID: int64
> y: double
> TRACKID: string
> MMSI: int64
> IMO: int64
> AgeMinutes: double
> SoG: double
> Width: int64
> Length: int64
> Callsign: string
> Destination: string
> ETA: int64
> Status: string
> ExtraInfo: null
> TIMESTAMP: int64
> __index_level_0__: int64
> metadata
> --------
> {b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
>  b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
>  b'type": "object", "metadata": \{"encoding": "UTF-8"}}], "columns":'
>  b' [{"name": "Record_ID", "field_name": "Record_ID", "pandas_type"'
>  b': "int64", "numpy_type": "int64", "metadata": null}, {"name": "y'
>  b'", "field_name": "y", "pandas_type": "float64", "numpy_type": "f'
>  b'loat64", "metadata": null}, {"name": "TRACKID", "field_name": "T'
>  b'RACKID", "pandas_type": "unicode", "numpy_type": "object", "meta'
>  b'data": null}, {"name": "MMSI", "field_name": "MMSI", "pandas_typ'
>  b'e": "int64", "numpy_type": "int64", "metadata": null}, {"name": '
>  b'"IMO", "field_name": "IMO", "pandas_type": "int64", "numpy_type"'
>  b': "int64", "metadata": null}, {"name": "AgeMinutes", "field_name'
>  b'": "AgeMinutes", "pandas_type": "float64", "numpy_type": "float6'
>  b'4", "metadata": null}, {"name": "SoG", "field_name": "SoG", "pan'
>  b'das_type": "float64", "numpy_type": "float64", "metadata": null}'
>  b', {"name": "Width", "field_name": "Width", "pandas_type": "int64'
>  b'", "numpy_type": "int64", "metadata": null}, {"name": "Length", '
>  b'"field_name": "Length", "pandas_type": "int64", "numpy_type": "i'
>  b'nt64", "metadata": null}, {"name": "Callsign", "field_name": "Ca'
>  b'llsign", "pandas_type": "unicode", "numpy_type": "object", "meta'
>  b'data": null}, {"name": "Destination", "field_name": "Destination'
>  b'", "pandas_type": "unicode", "numpy_type": "object", "metadata":'
>  b' null}, {"name": "ETA", "field_name": "ETA", "pandas_type": "int'
>  b'64", "numpy_type": "int64", "metadata": null}, {"name": "Status"'
>  b', "field_name": "Status", "pandas_type": "unicode", "numpy_type"'
>  b': "object", "metadata": null}, {"name": "ExtraInfo", "field_name'
>  b'": "ExtraInfo", "pandas_type": "empty", "numpy_type": "object", '
>  b'"metadata": null}, {"name": "TIMESTAMP", "field_name": "TIMESTAM'
>  b'P", "pandas_type": "int64", "numpy_type": "int64", "metadata": n'
>  b'ull}, {"name": null, "field_name": "__index_level_0__", "pandas_'
>  b'type": "int64", "numpy_type": "int64", "metadata": null}], "pand'
>  b'as_version": "0.21.0"}'}
> ```
> The issue is with column *ExtraInfo*, where *pandas_type* is *unicode* in a partitioned ParquetDatasetPiece referencing the 2nd Parquet file created, while the ParquetDataset schema referencing the 1st Parquet file created has *pandas_type* *empty* for that same column.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)