You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "David Li (Jira)" <ji...@apache.org> on 2021/03/25 11:44:00 UTC
[jira] [Updated] (ARROW-12080) [Python][Dataset] The first table
schema becomes a common schema for the full Dataset
[ https://issues.apache.org/jira/browse/ARROW-12080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
David Li updated ARROW-12080:
-----------------------------
Summary: [Python][Dataset] The first table schema becomes a common schema for the full Dataset (was: The first table schema becomes a common schema for the full Dataset)
> [Python][Dataset] The first table schema becomes a common schema for the full Dataset
> -------------------------------------------------------------------------------------
>
> Key: ARROW-12080
> URL: https://issues.apache.org/jira/browse/ARROW-12080
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 3.0.0
> Reporter: Borys Kabakov
> Priority: Major
> Labels: dataset, datasets
>
> The first table schema becomes a common schema for the full Dataset. It could cause problems with sparse data.
> Consider example below, when first chunks is full of NA, pyarrow ignores dtypes from pandas for a whole dataset:
> {code:java}
> # get dataset
> !wget https://physionet.org/files/mimiciii-demo/1.4/D_ITEMS.csv
> import pandas as pd
> import pyarrow.parquet as pq
> import pyarrow as pa
> import pyarrow.dataset as ds
> import shutil
> from pathlib import Path
> def foo(input_csv='D_ITEMS.csv', output='tmp.parquet', chunksize=1000):
> if Path(output).exists():
> shutil.rmtree(output) # write dataset
> d_items = pd.read_csv(input_csv, index_col='row_id',
> usecols=['row_id', 'itemid', 'label', 'dbsource', 'category', 'param_type'],
> dtype={'row_id': int, 'itemid': int, 'label': str, 'dbsource': str,
> 'category': str, 'param_type': str}, chunksize=chunksize) for i, chunk in enumerate(d_items):
> table = pa.Table.from_pandas(chunk)
> if i == 0:
> schema1 = pa.Schema.from_pandas(chunk)
> schema2 = table.schema
> # print(table.field('param_type'))
> pq.write_to_dataset(table, root_path=output)
>
> # read dataset
> dataset = ds.dataset(output)
>
> # compare schemas
> print('Schemas are equal: ', dataset.schema == schema1 == schema2)
> print(dataset.schema.types)
> print('Should be string', dataset.schema.field('param_type'))
> return dataset
> {code}
> {code:java}
> dataset = foo()
> dataset.to_table()
> >>>Schemas are equal: False
> [DataType(int64), DataType(string), DataType(string), DataType(null), DataType(null), DataType(int64)]
> Should be string pyarrow.Field<param_type: null>
> ---------------------------------------------------------------------------
> ArrowTypeError: fields had matching names but differing types. From: category: string To: category: null{code}
> If you do schemas listing, you'll see that almost all parquet files ignored pandas dtypes:
> {code:java}
> import os
> for i in os.listdir('tmp.parquet/'):
> print(ds.dataset(os.path.join('tmp.parquet/', i)).schema.field('param_type'))
> >>>pyarrow.Field<param_type: null>
> pyarrow.Field<param_type: string>
> pyarrow.Field<param_type: null>
> pyarrow.Field<param_type: null>
> pyarrow.Field<param_type: null>
> pyarrow.Field<param_type: null>
> pyarrow.Field<param_type: null>
> pyarrow.Field<param_type: string>
> pyarrow.Field<param_type: null>
> pyarrow.Field<param_type: string>
> pyarrow.Field<param_type: string>
> pyarrow.Field<param_type: null>
> pyarrow.Field<param_type: null>
> {code}
> But if we will get bigger chunk of data, that contains non NA values, everything is OK:
> {code:java}
> dataset = foo(chunksize=10000)
> dataset.to_table()
> >>>Schemas are equal: True
> [DataType(int64), DataType(string), DataType(string), DataType(string), DataType(string), DataType(int64)]
> Should be string pyarrow.Field<param_type: string>
> pyarrow.Table
> itemid: int64
> label: string
> dbsource: string
> category: string
> param_type: string
> row_id: int64
> {code}
> Check NA in data:
> {code:java}
> pd.read_csv('D_ITEMS.csv', nrows=1000)['param_type'].unique()
> >>>array([nan])
> pd.read_csv('D_ITEMS.csv', nrows=10000)['param_type'].unique()
> >>>array([nan, 'Numeric', 'Text', 'Date time', 'Solution', 'Process',
> 'Checkbox'], dtype=object)
> {code}
>
> PS: switching issues reporting from github to Jira is outstanding move
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)