You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Sep Dehpour (Jira)" <ji...@apache.org> on 2020/07/15 01:28:00 UTC

[jira] [Created] (ARROW-9474) Column type inference in read_csv vs. open_csv. CSV conversion error to null.

Sep Dehpour created ARROW-9474:
----------------------------------

             Summary: Column type inference in read_csv vs. open_csv. CSV conversion error to null.
                 Key: ARROW-9474
                 URL: https://issues.apache.org/jira/browse/ARROW-9474
             Project: Apache Arrow
          Issue Type: Bug
            Reporter: Sep Dehpour


The open_csv stream does not adjust the inferred column type based on the new data seen in new blocks.

For example if a csv has null values in the first few blocks of open_csv reader, the column is inferred as Null type. As PyArrow iterates over blocks and sees non null values in that column,  it crashes.

Example Error:
{code:java}
pyarrow.lib.ArrowInvalid: In CSV column #44: CSV conversion error to null: invalid value '-176400' {code}
 

This problem is resolved if a read_option with a huge block size is passed to the open_csv. But that negates the whole point of having a stream vs. read_csv.

 

System info:

PyArrow 0.17.1, Mac OS Catalina, Python 3.7.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)