You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2018/08/14 12:46:00 UTC

[jira] [Updated] (ARROW-25) [C++] Implement delimited file scanner / CSV reader

     [ https://issues.apache.org/jira/browse/ARROW-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wes McKinney updated ARROW-25:
------------------------------
    Summary: [C++] Implement delimited file scanner / CSV reader  (was: C++: Implement delimited file scanner / CSV reader)

> [C++] Implement delimited file scanner / CSV reader
> ---------------------------------------------------
>
>                 Key: ARROW-25
>                 URL: https://issues.apache.org/jira/browse/ARROW-25
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Wes McKinney
>            Priority: Major
>
> Like Parquet and binary file formats, text files will be an important data medium for converting to and from in-memory Arrow data. 
> pandas has some (Apache-compatible) business logic we can learn from here (as one of the gold-standard CSV readers in production use)
> https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.h
> https://github.com/pydata/pandas/blob/master/pandas/parser.pyx
> While very fast, this this should be largely written from scratch to target the Arrow memory layout, but we can reuse certain aspects like the tokenizer DFA (which originally came from the Python interpreter csv module implementation)
> https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.c#L713



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)