You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Todd Farmer (Jira)" <ji...@apache.org> on 2022/07/12 14:05:03 UTC

[jira] [Assigned] (ARROW-2860) [Python][Parquet][C++] Null values in a single partition of Parquet dataset, results in invalid schema on read

     [ https://issues.apache.org/jira/browse/ARROW-2860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Farmer reassigned ARROW-2860:
----------------------------------

    Assignee:     (was: Joris Van den Bossche)

This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

> [Python][Parquet][C++] Null values in a single partition of Parquet dataset, results in invalid schema on read
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-2860
>                 URL: https://issues.apache.org/jira/browse/ARROW-2860
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Sam Oluwalana
>            Priority: Major
>              Labels: dataset, dataset-parquet-read, parquet
>
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> from datetime import datetime, timedelta
> def generate_data(event_type, event_id, offset=0):
>     """Generate data."""
>     now = datetime.utcnow() + timedelta(seconds=offset)
>     obj = {
>         'event_type': event_type,
>         'event_id': event_id,
>         'event_date': now.date(),
>         'foo': None,
>         'bar': u'hello',
>     }
>     if event_type == 2:
>         obj['foo'] = 1
>         obj['bar'] = u'world'
>     if event_type == 3:
>         obj['different'] = u'data'
>         obj['bar'] = u'event type 3'
>     else:
>         obj['different'] = None
>     return obj
> data = [
>     generate_data(1, 1, 1),
>     generate_data(1, 1, 3600 * 72),
>     generate_data(2, 1, 1),
>     generate_data(2, 1, 3600 * 72),
>     generate_data(3, 1, 1),
>     generate_data(3, 1, 3600 * 72),
> ]
> df = pd.DataFrame.from_records(data, index='event_id')
> table = pa.Table.from_pandas(df)
> pq.write_to_dataset(table, root_path='/tmp/events', partition_cols=['event_type', 'event_date'])
> dataset = pq.ParquetDataset('/tmp/events')
> table = dataset.read()
> print(table.num_rows)
> {code}
> Expected output:
> {code:python}
> 6
> {code}
> Actual:
> {code:python}
> python example_failure.py
> Traceback (most recent call last):
>   File "example_failure.py", line 43, in <module>
>     dataset = pq.ParquetDataset('/tmp/events')
>   File "/Users/sam/.virtualenvs/test-parquet/lib/python2.7/site-packages/pyarrow/parquet.py", line 745, in __init__
>     self.validate_schemas()
>   File "/Users/sam/.virtualenvs/test-parquet/lib/python2.7/site-packages/pyarrow/parquet.py", line 775, in validate_schemas
>     dataset_schema))
> ValueError: Schema in partition[event_type=2, event_date=0] /tmp/events/event_type=3/event_date=2018-07-16 00:00:00/be001bf576674d09825539f20e99ebe5.parquet was different.
> bar: string
> different: string
> foo: double
> event_id: int64
> metadata
> --------
> {'pandas': '{"pandas_version": "0.23.3", "index_columns": ["event_id"], "columns": [{"metadata": null, "field_name": "bar", "name": "bar", "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, "field_name": "different", "name": "different", "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, "field_name": "foo", "name": "foo", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "event_id", "name": "event_id", "numpy_type": "int64", "pandas_type": "int64"}], "column_indexes": [{"metadata": null, "field_name": null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}'}
> vs
> bar: string
> different: null
> foo: double
> event_id: int64
> metadata
> --------
> {'pandas': '{"pandas_version": "0.23.3", "index_columns": ["event_id"], "columns": [{"metadata": null, "field_name": "bar", "name": "bar", "numpy_type": "object", "pandas_type": "unicode"}, {"metadata": null, "field_name": "different", "name": "different", "numpy_type": "object", "pandas_type": "empty"}, {"metadata": null, "field_name": "foo", "name": "foo", "numpy_type": "float64", "pandas_type": "float64"}, {"metadata": null, "field_name": "event_id", "name": "event_id", "numpy_type": "int64", "pandas_type": "int64"}], "column_indexes": [{"metadata": null, "field_name": null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}'}
> {code}
> Apparently what is happening is that pyarrow is interpreting the schema from each of the partitions individually and the partitions for `event_type=3 / event_date=*`  both have values for the column `different` whereas the other columns do not. The discrepancy causes the `None` values of the other partitions to be labeled as `pandas_type` `empty` instead of `unicode`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)