You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/05/10 12:22:01 UTC

[jira] [Comment Edited] (ARROW-12661) [C++] CSV add skip rows after column names

    [ https://issues.apache.org/jira/browse/ARROW-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17341871#comment-17341871 ] 

Joris Van den Bossche edited comment on ARROW-12661 at 5/10/21, 12:21 PM:
--------------------------------------------------------------------------

Some general comments about potential ways to provide this functionality (using the pandas.read_csv API as reference):

- Pandas by default skips empty lines (Arrow does the same, but is more strict: a single white space is already considered as non-empty) and also has a {{comments}} keyword with which you can define a character and if your row starts with this character, it is skipped. Both might already cover part of the use cases where one want to skip rows after the header.
- In pandas.read_csv, the {{skiprows}} keyword can either take an integer (number of rows from start of the file) or a list of integers (exact list of 0-based indices of rows to skip). This way, you can skip rows after the header with a single, flexible keyword (and also with a single keyword skip rows both before and after the header, eg with {{[0, 2]}} if the header is on the second row).


was (Author: jorisvandenbossche):
Some general comments about potential ways to provide this functionality (using the pandas.read_csv API as reference):

- Pandas by default skips empty lines (Arrow does the same, I think) and also has a {{comments}} keyword with which you can define a character and if your row starts with this character, it is skipped. Both might already cover part of the use cases where one want to skip rows after the header.
- In pandas.read_csv, the {{skiprows}} keyword can either take an integer (number of rows from start of the file) or a list of integers (exact list of 0-based indices of rows to skip). This way, you can skip rows after the header with a single, flexible keyword (and also with a single keyword skip rows both before and after the header, eg with {{[0, 2]}} if the header is on the second row).

> [C++] CSV add skip rows after column names
> ------------------------------------------
>
>                 Key: ARROW-12661
>                 URL: https://issues.apache.org/jira/browse/ARROW-12661
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Nate Clark
>            Priority: Major
>              Labels: csv
>
> Some programs generate csv files with additional descriptive information about the columns on a row after the names. For files like this it would be nice to have an option which reads the first row as column names and then can skip those rows after the names.
> This could probably be implemented easily as either another option parallel ReadOptions::skip_rows or a boolean which indicates if skipping should occur before or after the column names are read.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)