You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "David Li (Jira)" <ji...@apache.org> on 2021/12/10 17:10:00 UTC

[jira] [Commented] (ARROW-15060) open_dataset() on csv files lacks support for compressed files

    [ https://issues.apache.org/jira/browse/ARROW-15060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457278#comment-17457278 ] 

David Li commented on ARROW-15060:
----------------------------------

Hmm, ARROW-10372 was supposed to implement this. Do you get an error when you try to open a dataset of compressed CSV?

> open_dataset() on csv files lacks support for compressed files
> --------------------------------------------------------------
>
>                 Key: ARROW-15060
>                 URL: https://issues.apache.org/jira/browse/ARROW-15060
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>            Reporter: Carl Boettiger
>            Priority: Major
>
> Using open_dataset() on S3 buckets of csv files is a game-changing magic, particularly with all the additional support for database / dplyr operations over the remote connection, and the widespread adoption of S3 buckets even by old-school big data providers like NOAA.
>  
> It's not uncommon to encounter buckets with *.csv.gz formats.  I know technically this should be unnecessary, as compression can be done "in flight" by the server, but usually this is not an issue for R users since R's `connection` class automatically detects and gunzips compressed files (over either POSIX or HTTP connections).  It would be really great if arrow could handle this case too. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)