You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2021/04/29 18:59:00 UTC

[jira] [Updated] (ARROW-12001) [C++][CSV] Allow missing columns at end of row

     [ https://issues.apache.org/jira/browse/ARROW-12001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ASF GitHub Bot updated ARROW-12001:
-----------------------------------
    Labels: pull-request-available  (was: )

> [C++][CSV] Allow missing columns at end of row
> ----------------------------------------------
>
>                 Key: ARROW-12001
>                 URL: https://issues.apache.org/jira/browse/ARROW-12001
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Nithin Kumara Narayanaswamy Teekaramanaa
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: test.csv
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Test scenario :
> I read the same attched csv file in pandas and pyarrow to make a comparison,
>  # With pandas it reads it into a df without problems and result is as follows:
> {code:java}
> import pandas as pd
> df = pd.read_csv('test.csv', names=['col1', 'col2', 'col3', 'col4', 'col5','col6'])
> >>df
>        col1   col2    col3  col4  col5  col6
> 0  20210317  julie   23434  test  data   1.0
> 1  20210316   adam  232423  test   NaN   NaN{code}
>  2.  With pyarrow csv, I get a parse error:
> {code:java}
> from pyarrow import csv
> import pyarrow as pa
> read_options = csv.ReadOptions(column_names=['col1', 'col2', 'col3', 'col4', 'col5', 'col6'])
> convert_options = csv.ConvertOptions(column_types=pa.schema(fields))
> table = csv.read_csv('test.csv', read_options=read_options,                     convert_options=convert_options)
> ERROR:
> Traceback (most recent call last):
>   File ".../test_pyarr.py", line 71, in <module>
>     table = csv.read_csv('test.csv',
>   File "pyarrow/_csv.pyx", line 714, in pyarrow._csv.read_csv
>   File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: CSV parse error: Expected 6 columns, got 4
> {code}
> Is there a parameter that can be set to fill null values in case the column values are missing for the specified schema?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)