You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2019/09/02 15:02:00 UTC

[jira] [Updated] (ARROW-6231) [C++][Python] Consider assigning default column names when reading CSV file and header_rows=0

     [ https://issues.apache.org/jira/browse/ARROW-6231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Neal Richardson updated ARROW-6231:
-----------------------------------
    Component/s: C++

> [C++][Python] Consider assigning default column names when reading CSV file and header_rows=0
> ---------------------------------------------------------------------------------------------
>
>                 Key: ARROW-6231
>                 URL: https://issues.apache.org/jira/browse/ARROW-6231
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Wes McKinney
>            Assignee: Antoine Pitrou
>            Priority: Major
>              Labels: csv, pull-request-available
>             Fix For: 0.15.0
>
>          Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> This is a slight usability rough edge. Assigning default names (like "f0, f1, ...") would probably be better since then at least you can see how many columns there are and what is in them. 
> {code}
> In [10]: parse_options = csv.ParseOptions(delimiter='|', header_rows=0)                                                                                         
> In [11]: %time table = csv.read_csv('Performance_2016Q4.txt', parse_options=parse_options)                                                                      
> ---------------------------------------------------------------------------
> ArrowInvalid                              Traceback (most recent call last)
> <timed exec> in <module>
> ~/miniconda/envs/pyarrow-14-1/lib/python3.7/site-packages/pyarrow/_csv.pyx in pyarrow._csv.read_csv()
> ~/miniconda/envs/pyarrow-14-1/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowInvalid: header_rows == 0 needs explicit column names
> {code}
> In pandas integers are used, so some kind of default string would have to be defined
> {code}
> In [18]: df = pd.read_csv('Performance_2016Q4.txt', sep='|', header=None, low_memory=False)                                                                     
> In [19]: df.columns                                                                                                                                             
> Out[19]: 
> Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
>             17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
>            dtype='int64')
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)