You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2018/09/13 23:13:00 UTC
[jira] [Updated] (ARROW-3210) [Python] Creating ParquetDataset
creates partitioned ParquetFiles with mismatched Parquet schemas
[ https://issues.apache.org/jira/browse/ARROW-3210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney updated ARROW-3210:
--------------------------------
Summary: [Python] Creating ParquetDataset creates partitioned ParquetFiles with mismatched Parquet schemas (was: Creating ParquetDataset with PyArrow creates partitioned ParquetFiles with mismatched Parquet schemas)
> [Python] Creating ParquetDataset creates partitioned ParquetFiles with mismatched Parquet schemas
> -------------------------------------------------------------------------------------------------
>
> Key: ARROW-3210
> URL: https://issues.apache.org/jira/browse/ARROW-3210
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.9.0
> Environment: Ubuntu 16.04 LTS, System76 Oryx Pro
> Reporter: Ying Wang
> Priority: Major
> Attachments: environment.yml, repro.csv, repro.py, repro_2.py
>
>
> STEPS TO REPRODUCE:
> 1. Create a conda environment reflecting [^environment.yml]
> 2. Execute script [^repro.py], replacing various config variables to create a ParquetDataset on S3 given [^repro.csv]
> 3. Create reference of ParquetDataset using script [^repro_2.py], again replacing various config variables.
>
> EXPECTED:
> Reference is created correctly.
> GOT:
> Mismatched Arrow schemas in validate_schemas() method:
>
> ```python
> *** ValueError: Schema in partition[Draught=1, Name=1, VesselType=0, x=1, Heading=1] s3://kio-tests-files/_tmp/test_parquet_dataset/Draught=10.3/Name=MSC RAFAELA/VesselType=Cargo/x=130.43158/Heading=270.0/e9e3cea5a5c24c4da587c263ec817c98.parquet was different.
> Record_ID: int64
> y: double
> TRACKID: string
> MMSI: int64
> IMO: int64
> AgeMinutes: double
> SoG: double
> Width: int64
> Length: int64
> Callsign: string
> Destination: string
> ETA: int64
> Status: string
> ExtraInfo: string
> TIMESTAMP: int64
> __index_level_0__: int64
> metadata
> --------
> {b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
> b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
> b'type": "object", "metadata": \{"encoding": "UTF-8"}}], "columns":'
> b' [{"name": "Record_ID", "field_name": "Record_ID", "pandas_type"'
> b': "int64", "numpy_type": "int64", "metadata": null}, {"name": "y'
> b'", "field_name": "y", "pandas_type": "float64", "numpy_type": "f'
> b'loat64", "metadata": null}, {"name": "TRACKID", "field_name": "T'
> b'RACKID", "pandas_type": "unicode", "numpy_type": "object", "meta'
> b'data": null}, {"name": "MMSI", "field_name": "MMSI", "pandas_typ'
> b'e": "int64", "numpy_type": "int64", "metadata": null}, {"name": '
> b'"IMO", "field_name": "IMO", "pandas_type": "int64", "numpy_type"'
> b': "int64", "metadata": null}, {"name": "AgeMinutes", "field_name'
> b'": "AgeMinutes", "pandas_type": "float64", "numpy_type": "float6'
> b'4", "metadata": null}, {"name": "SoG", "field_name": "SoG", "pan'
> b'das_type": "float64", "numpy_type": "float64", "metadata": null}'
> b', {"name": "Width", "field_name": "Width", "pandas_type": "int64'
> b'", "numpy_type": "int64", "metadata": null}, {"name": "Length", '
> b'"field_name": "Length", "pandas_type": "int64", "numpy_type": "i'
> b'nt64", "metadata": null}, {"name": "Callsign", "field_name": "Ca'
> b'llsign", "pandas_type": "unicode", "numpy_type": "object", "meta'
> b'data": null}, {"name": "Destination", "field_name": "Destination'
> b'", "pandas_type": "unicode", "numpy_type": "object", "metadata":'
> b' null}, {"name": "ETA", "field_name": "ETA", "pandas_type": "int'
> b'64", "numpy_type": "int64", "metadata": null}, {"name": "Status"'
> b', "field_name": "Status", "pandas_type": "unicode", "numpy_type"'
> b': "object", "metadata": null}, {"name": "ExtraInfo", "field_name'
> b'": "ExtraInfo", "pandas_type": "unicode", "numpy_type": "object"'
> b', "metadata": null}, {"name": "TIMESTAMP", "field_name": "TIMEST'
> b'AMP", "pandas_type": "int64", "numpy_type": "int64", "metadata":'
> b' null}, {"name": null, "field_name": "__index_level_0__", "panda'
> b's_type": "int64", "numpy_type": "int64", "metadata": null}], "pa'
> b'ndas_version": "0.21.0"}'}
> vs
> Record_ID: int64
> y: double
> TRACKID: string
> MMSI: int64
> IMO: int64
> AgeMinutes: double
> SoG: double
> Width: int64
> Length: int64
> Callsign: string
> Destination: string
> ETA: int64
> Status: string
> ExtraInfo: null
> TIMESTAMP: int64
> __index_level_0__: int64
> metadata
> --------
> {b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
> b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
> b'type": "object", "metadata": \{"encoding": "UTF-8"}}], "columns":'
> b' [{"name": "Record_ID", "field_name": "Record_ID", "pandas_type"'
> b': "int64", "numpy_type": "int64", "metadata": null}, {"name": "y'
> b'", "field_name": "y", "pandas_type": "float64", "numpy_type": "f'
> b'loat64", "metadata": null}, {"name": "TRACKID", "field_name": "T'
> b'RACKID", "pandas_type": "unicode", "numpy_type": "object", "meta'
> b'data": null}, {"name": "MMSI", "field_name": "MMSI", "pandas_typ'
> b'e": "int64", "numpy_type": "int64", "metadata": null}, {"name": '
> b'"IMO", "field_name": "IMO", "pandas_type": "int64", "numpy_type"'
> b': "int64", "metadata": null}, {"name": "AgeMinutes", "field_name'
> b'": "AgeMinutes", "pandas_type": "float64", "numpy_type": "float6'
> b'4", "metadata": null}, {"name": "SoG", "field_name": "SoG", "pan'
> b'das_type": "float64", "numpy_type": "float64", "metadata": null}'
> b', {"name": "Width", "field_name": "Width", "pandas_type": "int64'
> b'", "numpy_type": "int64", "metadata": null}, {"name": "Length", '
> b'"field_name": "Length", "pandas_type": "int64", "numpy_type": "i'
> b'nt64", "metadata": null}, {"name": "Callsign", "field_name": "Ca'
> b'llsign", "pandas_type": "unicode", "numpy_type": "object", "meta'
> b'data": null}, {"name": "Destination", "field_name": "Destination'
> b'", "pandas_type": "unicode", "numpy_type": "object", "metadata":'
> b' null}, {"name": "ETA", "field_name": "ETA", "pandas_type": "int'
> b'64", "numpy_type": "int64", "metadata": null}, {"name": "Status"'
> b', "field_name": "Status", "pandas_type": "unicode", "numpy_type"'
> b': "object", "metadata": null}, {"name": "ExtraInfo", "field_name'
> b'": "ExtraInfo", "pandas_type": "empty", "numpy_type": "object", '
> b'"metadata": null}, {"name": "TIMESTAMP", "field_name": "TIMESTAM'
> b'P", "pandas_type": "int64", "numpy_type": "int64", "metadata": n'
> b'ull}, {"name": null, "field_name": "__index_level_0__", "pandas_'
> b'type": "int64", "numpy_type": "int64", "metadata": null}], "pand'
> b'as_version": "0.21.0"}'}
> ```
> The issue is with column *ExtraInfo*, where *pandas_type* is *unicode* in a partitioned ParquetDatasetPiece referencing the 2nd Parquet file created, while the ParquetDataset schema referencing the 1st Parquet file created has *pandas_type* *empty* for that same column.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)