You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Tim Lantz (Jira)" <ji...@apache.org> on 2020/01/22 17:00:00 UTC
[jira] [Created] (ARROW-7655) [Python] csv.ConvertOptions Do Not Pass Through/Retain Nullability from Schema

Tim Lantz created ARROW-7655:
--------------------------------

             Summary: [Python] csv.ConvertOptions Do Not Pass Through/Retain Nullability from Schema
                 Key: ARROW-7655
                 URL: https://issues.apache.org/jira/browse/ARROW-7655
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.15.1
         Environment: Reproduced on Ubuntu 18.04 and OSX Catalina in Python 3.7.4.
            Reporter: Tim Lantz


 

Originally mentioned in: [https://github.com/apache/arrow/issues/6243]

*High level description of the issue:*
 * It is possible ([though not documented|https://issues.apache.org/jira/browse/ARROW-7654]) that you may assign the column_types field of ConvertOptions to a Schema object instead of a Dict[str, DataType].
 * Expected result: the nullable attribute, in addition to the type, of the Fields in the Schema supplied are present on the Schema used when reading CSV data.
 * Actual result: the Field type information is present, but nullable is lost. All fields are nullable.

*Minimal reproduction case:*
 * Use case notes: this is especially noticeable when using pyarrow as a meant to save data with a known schema to parquet as the ParquetWriter will check that the schema of a table being written matches the schema supplied to the writer. If that same schema is used to to read the CSV data and contains a nullable field, a mismatch will be detected resulting in an error which is demonstrated below.

 
{code:java}
$ cat test.csv 
0
1
$ python
>>> import pyarrow
>>> schema = pyarrow.schema([pyarrow.field(name="foo", type=pyarrow.bool_(), nullable=False)])
>>> read_options = csv.ReadOptions(column_names=["foo"])
>>> from pyarrow import csv
>>> read_options = csv.ReadOptions(column_names=["foo"])
>>> convert_options = csv.ConvertOptions(column_types=schema)
>>> table = csv.read_csv("test.csv", convert_options=convert_options, read_options=read_options)
>>> schema
foo: bool not null
>>> table.schema
foo: bool
>>> from pyarrow import parquet as pq
>>> writer = pq.ParquetWriter("test.parquet", schema)
>>> writer.write_table(table)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "(REDACTED)/lib/python3.7/site-packages/pyarrow-0.15.1-py3.7-macosx-10.9-x86_64.egg/pyarrow/parquet.py", line 472, in write_table
    raise ValueError(msg)
ValueError: Table schema does not match schema used to create file: 
table:
foo: bool vs. 
file:
foo: bool not null
>>> pyarrow.__version__
'0.15.1'
>>> exit()
$ python --version
Python 3.7.4{code}
 
 * As a side note: if I don't set column_names in read_options when calling read_csv, but I set convert_options with column_types set, type inference is still performed which seems like a bug vs. what the docs state. That seems like a possibly related, but independent bug, and I haven't done a search yet to see if it is an open/known issue but if someone believes it should be filed with a repro case upon reading this I am happy to help! I only realized this when minimizing the repro case as my original code was setting column_names.

*Potential source of issue:*
 * **I did not yet look at how hard it is to fix, but I note that [here|https://github.com/apache/arrow/blob/ace72c2afa6b7608bca9ba858fdd10b23e7f2dbf/python/pyarrow/_csv.pyx#L411] only the name and type are passed down from a Field.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)