You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Antoine Pitrou (JIRA)" <ji...@apache.org> on 2019/07/18 08:36:00 UTC

[jira] [Commented] (ARROW-5974) read_csv returns truncated read for some valid gzip files

    [ https://issues.apache.org/jira/browse/ARROW-5974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16887750#comment-16887750 ] 

Antoine Pitrou commented on ARROW-5974:
---------------------------------------

I'm not sure Pandas does it deliberately. [~wesmckinn] Do you know about that?

> read_csv returns truncated read for some valid gzip files
> ---------------------------------------------------------
>
>                 Key: ARROW-5974
>                 URL: https://issues.apache.org/jira/browse/ARROW-5974
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.13.0, 0.14.0
>            Reporter: Jordan Samuels
>            Priority: Minor
>
> If two gzipped files are concatenated together, the result is a valid gzip file.  However, it appears that pyarrow.csv.read_csv will only read the portion related to the first file.
> If the repro script [here|https://gist.github.com/jordansamuels/d69f1c22c58418f5dfa0785b9ecd211e] is run, the output is:
> {{$ python repro.py}}
> {{pyarrow.csv only reads one row:}}
> {{ x}}
> {{0 1}}
> {{pandas reads two rows:}}
> {{ x}}
> {{0 1}}
> {{1 2}}
> {{pyarrow version: 0.14.0}}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)