You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2021/03/17 14:13:00 UTC
[jira] [Updated] (ARROW-12001) [C++][CSV] Allow missing columns at
end of row
[ https://issues.apache.org/jira/browse/ARROW-12001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoine Pitrou updated ARROW-12001:
-----------------------------------
Summary: [C++][CSV] Allow missing columns at end of row (was: pyarrow.lib.ArrowInvalid: CSV parse error: Expected 4 columns, got 6)
> [C++][CSV] Allow missing columns at end of row
> ----------------------------------------------
>
> Key: ARROW-12001
> URL: https://issues.apache.org/jira/browse/ARROW-12001
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Nithin Kumara Narayanaswamy Teekaramanaa
> Priority: Major
> Attachments: test.csv
>
>
> Test scenario :
> I read the same attched csv file in pandas and pyarrow to make a comparison,
> # With pandas it reads it into a df without problems and result is as follows:
> {code:java}
> import pandas as pd
> df = pd.read_csv('test.csv', names=['col1', 'col2', 'col3', 'col4', 'col5','col6'])
> >>df
> col1 col2 col3 col4 col5 col6
> 0 20210317 julie 23434 test data 1.0
> 1 20210316 adam 232423 test NaN NaN{code}
> 2. With pyarrow csv, I get a parse error:
> {code:java}
> from pyarrow import csv
> import pyarrow as pa
> read_options = csv.ReadOptions(column_names=['col1', 'col2', 'col3', 'col4', 'col5', 'col6'])
> convert_options = csv.ConvertOptions(column_types=pa.schema(fields))
> table = csv.read_csv('test.csv', read_options=read_options, convert_options=convert_options)
> ERROR:
> Traceback (most recent call last):
> File ".../test_pyarr.py", line 71, in <module>
> table = csv.read_csv('test.csv',
> File "pyarrow/_csv.pyx", line 714, in pyarrow._csv.read_csv
> File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
> File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: CSV parse error: Expected 6 columns, got 4
> {code}
> Is there a parameter that can be set to fill null values in case the column values are missing for the specified schema?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)