You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Ying Wang (JIRA)" <ji...@apache.org> on 2018/09/10 19:48:00 UTC
[jira] [Created] (ARROW-3210) Creating ParquetDataset with PyArrow
creates partitioned ParquetFiles with mismatched Parquet schemas
Ying Wang created ARROW-3210:
--------------------------------
Summary: Creating ParquetDataset with PyArrow creates partitioned ParquetFiles with mismatched Parquet schemas
Key: ARROW-3210
URL: https://issues.apache.org/jira/browse/ARROW-3210
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.9.0
Environment: Ubuntu 16.04 LTS, System76 Oryx Pro
Reporter: Ying Wang
Attachments: environment.yml, repro.csv, repro.py, repro_2.py
STEPS TO REPRODUCE:
1. Create a conda environment reflecting [^environment.yml]
2. Execute script [^repro.py], replacing various config variables to create a ParquetDataset on S3 given [^repro.csv]
3. Create reference of ParquetDataset using script [^repro_2.py], again replacing various config variables.
EXPECTED:
Reference is created correctly.
GOT:
Mismatched Arrow schemas in validate_schemas() method:
```python
*** ValueError: Schema in partition[Draught=1, Name=1, VesselType=0, x=1, Heading=1] s3://kio-tests-files/_tmp/test_parquet_dataset/Draught=10.3/Name=MSC RAFAELA/VesselType=Cargo/x=130.43158/Heading=270.0/e9e3cea5a5c24c4da587c263ec817c98.parquet was different.
Record_ID: int64
y: double
TRACKID: string
MMSI: int64
IMO: int64
AgeMinutes: double
SoG: double
Width: int64
Length: int64
Callsign: string
Destination: string
ETA: int64
Status: string
ExtraInfo: string
TIMESTAMP: int64
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
b'type": "object", "metadata": \{"encoding": "UTF-8"}}], "columns":'
b' [{"name": "Record_ID", "field_name": "Record_ID", "pandas_type"'
b': "int64", "numpy_type": "int64", "metadata": null}, {"name": "y'
b'", "field_name": "y", "pandas_type": "float64", "numpy_type": "f'
b'loat64", "metadata": null}, {"name": "TRACKID", "field_name": "T'
b'RACKID", "pandas_type": "unicode", "numpy_type": "object", "meta'
b'data": null}, {"name": "MMSI", "field_name": "MMSI", "pandas_typ'
b'e": "int64", "numpy_type": "int64", "metadata": null}, {"name": '
b'"IMO", "field_name": "IMO", "pandas_type": "int64", "numpy_type"'
b': "int64", "metadata": null}, {"name": "AgeMinutes", "field_name'
b'": "AgeMinutes", "pandas_type": "float64", "numpy_type": "float6'
b'4", "metadata": null}, {"name": "SoG", "field_name": "SoG", "pan'
b'das_type": "float64", "numpy_type": "float64", "metadata": null}'
b', {"name": "Width", "field_name": "Width", "pandas_type": "int64'
b'", "numpy_type": "int64", "metadata": null}, {"name": "Length", '
b'"field_name": "Length", "pandas_type": "int64", "numpy_type": "i'
b'nt64", "metadata": null}, {"name": "Callsign", "field_name": "Ca'
b'llsign", "pandas_type": "unicode", "numpy_type": "object", "meta'
b'data": null}, {"name": "Destination", "field_name": "Destination'
b'", "pandas_type": "unicode", "numpy_type": "object", "metadata":'
b' null}, {"name": "ETA", "field_name": "ETA", "pandas_type": "int'
b'64", "numpy_type": "int64", "metadata": null}, {"name": "Status"'
b', "field_name": "Status", "pandas_type": "unicode", "numpy_type"'
b': "object", "metadata": null}, {"name": "ExtraInfo", "field_name'
b'": "ExtraInfo", "pandas_type": "unicode", "numpy_type": "object"'
b', "metadata": null}, {"name": "TIMESTAMP", "field_name": "TIMEST'
b'AMP", "pandas_type": "int64", "numpy_type": "int64", "metadata":'
b' null}, {"name": null, "field_name": "__index_level_0__", "panda'
b's_type": "int64", "numpy_type": "int64", "metadata": null}], "pa'
b'ndas_version": "0.21.0"}'}
vs
Record_ID: int64
y: double
TRACKID: string
MMSI: int64
IMO: int64
AgeMinutes: double
SoG: double
Width: int64
Length: int64
Callsign: string
Destination: string
ETA: int64
Status: string
ExtraInfo: null
TIMESTAMP: int64
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
b'type": "object", "metadata": \{"encoding": "UTF-8"}}], "columns":'
b' [{"name": "Record_ID", "field_name": "Record_ID", "pandas_type"'
b': "int64", "numpy_type": "int64", "metadata": null}, {"name": "y'
b'", "field_name": "y", "pandas_type": "float64", "numpy_type": "f'
b'loat64", "metadata": null}, {"name": "TRACKID", "field_name": "T'
b'RACKID", "pandas_type": "unicode", "numpy_type": "object", "meta'
b'data": null}, {"name": "MMSI", "field_name": "MMSI", "pandas_typ'
b'e": "int64", "numpy_type": "int64", "metadata": null}, {"name": '
b'"IMO", "field_name": "IMO", "pandas_type": "int64", "numpy_type"'
b': "int64", "metadata": null}, {"name": "AgeMinutes", "field_name'
b'": "AgeMinutes", "pandas_type": "float64", "numpy_type": "float6'
b'4", "metadata": null}, {"name": "SoG", "field_name": "SoG", "pan'
b'das_type": "float64", "numpy_type": "float64", "metadata": null}'
b', {"name": "Width", "field_name": "Width", "pandas_type": "int64'
b'", "numpy_type": "int64", "metadata": null}, {"name": "Length", '
b'"field_name": "Length", "pandas_type": "int64", "numpy_type": "i'
b'nt64", "metadata": null}, {"name": "Callsign", "field_name": "Ca'
b'llsign", "pandas_type": "unicode", "numpy_type": "object", "meta'
b'data": null}, {"name": "Destination", "field_name": "Destination'
b'", "pandas_type": "unicode", "numpy_type": "object", "metadata":'
b' null}, {"name": "ETA", "field_name": "ETA", "pandas_type": "int'
b'64", "numpy_type": "int64", "metadata": null}, {"name": "Status"'
b', "field_name": "Status", "pandas_type": "unicode", "numpy_type"'
b': "object", "metadata": null}, {"name": "ExtraInfo", "field_name'
b'": "ExtraInfo", "pandas_type": "empty", "numpy_type": "object", '
b'"metadata": null}, {"name": "TIMESTAMP", "field_name": "TIMESTAM'
b'P", "pandas_type": "int64", "numpy_type": "int64", "metadata": n'
b'ull}, {"name": null, "field_name": "__index_level_0__", "pandas_'
b'type": "int64", "numpy_type": "int64", "metadata": null}], "pand'
b'as_version": "0.21.0"}'}
```
The issue is with column *ExtraInfo*, where *pandas_type* is *unicode* in a partitioned ParquetDatasetPiece referencing the 2nd Parquet file created, while the ParquetDataset schema referencing the 1st Parquet file created has *pandas_type* *empty* for that same column.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)