You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Borys Kabakov (Jira)" <ji...@apache.org> on 2021/03/24 21:31:00 UTC
[jira] [Created] (ARROW-12079) The first table schema becomes a common schema for the full Dataset

Borys Kabakov created ARROW-12079:
-------------------------------------

             Summary: The first table schema becomes a common schema for the full Dataset
                 Key: ARROW-12079
                 URL: https://issues.apache.org/jira/browse/ARROW-12079
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 3.0.0
         Environment: Ubuntu 18.04 LTS
python 3.8
            Reporter: Borys Kabakov


The first table schema becomes a common schema for the full Dataset. It could cause problems with sparse data.

Consider example below, when first chunks is full of NA, pyarrow ignores dtypes from pandas for a whole dataset:
{code:java}
# get dataset
!wget https://physionet.org/files/mimiciii-demo/1.4/D_ITEMS.csv


import pandas as pd 
import pyarrow.parquet as pq
import pyarrow as pa
import pyarrow.dataset as ds
import shutil
from pathlib import Path


def foo(input_csv='D_ITEMS.csv', output='tmp.parquet', chunksize=1000):
    if Path(output).exists():
        shutil.rmtree(output)    # write dataset
    d_items = pd.read_csv(input_csv, index_col='row_id',
                      usecols=['row_id', 'itemid', 'label', 'dbsource', 'category', 'param_type'],
                      dtype={'row_id': int, 'itemid': int, 'label': str, 'dbsource': str,
                             'category': str, 'param_type': str}, chunksize=chunksize)    for i, chunk in enumerate(d_items):
        table = pa.Table.from_pandas(chunk)
        if i == 0:
            schema1 = pa.Schema.from_pandas(chunk)
            schema2 = table.schema
#         print(table.field('param_type'))
        pq.write_to_dataset(table, root_path=output)
    
    # read dataset
    dataset = ds.dataset(output)
    
    # compare schemas
    print('Schemas are equal: ', dataset.schema == schema1 == schema2)
    print(dataset.schema.types)
    print('Should be string', dataset.schema.field('param_type'))    
    return dataset
{code}
{code:java}
dataset = foo()
dataset.to_table()

>>>Schemas are equal:  False
[DataType(int64), DataType(string), DataType(string), DataType(null), DataType(null), DataType(int64)]
Should be string pyarrow.Field<param_type: null>
---------------------------------------------------------------------------
ArrowTypeError: fields had matching names but differing types. From: category: string To: category: null{code}
If you do schemas listing, you'll see that almost all parquet files ignored pandas dtypes:

 
{code:java}
import os

for i in os.listdir('tmp.parquet/'):
    print(ds.dataset(os.path.join('tmp.parquet/', i)).schema.field('param_type'))

>>>pyarrow.Field<param_type: null>
pyarrow.Field<param_type: string>
pyarrow.Field<param_type: null>
pyarrow.Field<param_type: null>
pyarrow.Field<param_type: null>
pyarrow.Field<param_type: null>
pyarrow.Field<param_type: null>
pyarrow.Field<param_type: string>
pyarrow.Field<param_type: null>
pyarrow.Field<param_type: string>
pyarrow.Field<param_type: string>
pyarrow.Field<param_type: null>
pyarrow.Field<param_type: null>
{code}
But if we will get bigger chunk of data, that contains non NA values, everything is OK:
{code:java}
dataset = foo(chunksize=10000)
dataset.to_table()

>>>Schemas are equal:  True
[DataType(int64), DataType(string), DataType(string), DataType(string), DataType(string), DataType(int64)]
Should be string pyarrow.Field<param_type: string>
pyarrow.Table
itemid: int64
label: string
dbsource: string
category: string
param_type: string
row_id: int64
{code}
Check NA in data:
{code:java}
pd.read_csv('D_ITEMS.csv', nrows=1000)['param_type'].unique()
>>>array([nan])

pd.read_csv('D_ITEMS.csv', nrows=10000)['param_type'].unique()
>>>array([nan, 'Numeric', 'Text', 'Date time', 'Solution', 'Process',
       'Checkbox'], dtype=object)
{code}
 

 PS: switching issues reporting from github to Jira is outstanding move

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)