You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Athanassios Hatzis (Jira)" <ji...@apache.org> on 2020/02/19 07:36:00 UTC

[jira] [Comment Edited] (ARROW-7628) [Python] read_csv problematic cases

    [ https://issues.apache.org/jira/browse/ARROW-7628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039767#comment-17039767 ] 

Athanassios Hatzis edited comment on ARROW-7628 at 2/19/20 7:35 AM:
--------------------------------------------------------------------

Thanks [~apitrou] for clearing these cases. Yes, I agree, it is a matter of semantics, 

Point2: perhaps it would be better to set < strings_can_be_null=True > if the user specifies the < null_values > parameter.

Point1: I got confused with <include_columns > and <column_names> options, but in my example above, if I specify <column_names=['catcost', 'catqnt', 'catdate', 'catchk', 'catname'] > and <skip_rows=10> then you also get an error

Traceback (most recent call last):
  File "/home/athan/anaconda3/envs/TRIADB/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-15-1b43672bf70b>", line 6, in <module>
    use_threads=True, column_names=column_names)
  File "pyarrow/_csv.pyx", line 541, in pyarrow._csv.read_csv
  File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: CSV parse error: Expected 5 columns, got 7

So I guess the corner case for you is what is the right combination of parameters to read a subset of columns from CSV and also skip the first N lines of the file ? 


was (Author: athanassios):
Thanks [~apitrou] for clearing these cases. Yes, I agree, it is a matter of semantics, 

Point2: perhaps it would be better to set < strings_can_be_null=True > if the user specifies the < null_values > parameter.

Point1: I got confused with <include_columns > and <column_names> options, but in my example above, if I specify column_names=['catcost', 'catqnt', 'catdate', 'catchk', 'catname'] then you also get an error

Traceback (most recent call last):
  File "/home/athan/anaconda3/envs/TRIADB/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-15-1b43672bf70b>", line 6, in <module>
    use_threads=True, column_names=column_names)
  File "pyarrow/_csv.pyx", line 541, in pyarrow._csv.read_csv
  File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: CSV parse error: Expected 5 columns, got 7

So I guess the corner case for you is what is the right combination of parameters to read a subset of columns from CSV and also skip the first N lines of the file ? 

> [Python] read_csv problematic cases
> -----------------------------------
>
>                 Key: ARROW-7628
>                 URL: https://issues.apache.org/jira/browse/ARROW-7628
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.15.1
>         Environment: Ubuntu bionic
>            Reporter: Athanassios Hatzis
>            Priority: Minor
>              Labels: csv, pyarrow
>         Attachments: spc_catalog.tsv
>
>
> Hi, I have found two problematic cases, possibly bugs, in pyarrow *read_csv* module. I have written the following piece of code and run a test on the attached CSV file. 
> The code compares pandas read_csv with pyarrow csv to show that the second is not behaving correctly with the following set of parameters:
> 1. change parameter skip_rows = 10, 
> {code:python}
> Traceback (most recent call last):
>   File "/home/athan/anaconda3/envs/TRIADB/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3326, in run_code
>     exec(code_obj, self.user_global_ns, self.user_ns)
>   File "<ipython-input-21-8c5c88b190c4>", line 4, in <module>
>     read_options=csv.ReadOptions(skip_rows=skip_rows, autogenerate_column_names=False, use_threads=True, column_names=column_names)
>   File "pyarrow/_csv.pyx", line 541, in pyarrow._csv.read_csv
>   File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> pyarrow.lib.ArrowKeyError: Column 'catcost' in include_columns does not exist in CSV file
> {code}
> 2. change parameters skip_rows = 12, columns = None
> In this case you don't get the error above, all columns are fetched, but compare the two dataframes, the one from pyarrow with to_pandas() and the one from the output of pandas read_csv(). You will notice that the first one has not parsed correctly the null values ('\\N') in the last column catname. On the contrary pandas read_csv managed to parse all the null values correctly.
> {code:python}
> Out[28]: 
>    1082  991   16.5    200 2014-09-10  1  bar
> 0  1082  997   0.55  100.0 2014-09-10  1  bar
> 1  1082  998   7.95  200.0 2014-03-03  0   \N
> 2  1083  998  12.50    NaN        NaT  0  bar
> 3  1083  999   1.00    NaN        NaT  0  foo
> 4  1084  994  57.30  100.0 2014-12-20  1   \N
> 5  1084  995  22.20    NaN        NaT  0  foo
> 6  1084  998  48.60  200.0 2014-12-20  1  foo
> {code}
> Python code to test the attached CSV file for the bugs reported above
> {code:python}
> from pyarrow import csv
> import pyarrow as pa
> import pandas as pd
> file_location = 'spc_catalog.tsv'
> sep = '\t'
> nulls=['\\N']
> columns = ['catcost', 'catqnt', 'catdate', 'catchk', 'catname']
> column_names = None
> column_types = None
> skip_rows = None
> nrecords = None
> csv.read_csv(file_location,
>     parse_options=csv.ParseOptions(delimiter=sep),
>     convert_options=csv.ConvertOptions(include_columns=columns, column_types=column_types, null_values=nulls),
>     read_options=csv.ReadOptions(skip_rows=skip_rows, autogenerate_column_names=False, use_threads=True, column_names=column_names)
> ).to_pandas()
> pd.read_csv(file_location, sep=sep, na_values='\\N', usecols=columns, nrows=nrecords, names=column_names, dtype=column_types)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)