You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by GitBox <gi...@apache.org> on 2023/01/19 16:36:21 UTC

[GitHub] [arrow] Oduig opened a new issue, #33790: Support for reading .csv files from a zip archive

Oduig opened a new issue, #33790:
URL: https://github.com/apache/arrow/issues/33790

   ### Describe the enhancement requested
   
   I would like to read CSVs from *.zip archives. The supported compression formats include gzip and bz2, but not zip.
   Would it be possible to add this as an extension?
   
   Supporting zip archives would allow Airbyte to use pyarrow to read CSVs from compressed ZIP archives.
   
   I looked around to see if anything had been proposed about this before, but I couldn't find anything and browsing through the sources, I have difficulty to determine how easy/hard it would be to contribute a fix.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Oduig commented on issue #33790: [Python] Support for reading .csv files from a zip archive

Posted by "Oduig (via GitHub)" <gi...@apache.org>.

Oduig commented on issue #33790:
URL: https://github.com/apache/arrow/issues/33790#issuecomment-1399198244

   Thank you for the reply, I see the relevant code is in the cpp section! It already works with gz and bz2, but not with (Windows-esque) .zip files. Is there a reason why it is not included as a compression codec?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #33790: [Python] Support for reading .csv files from a zip archive

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #33790:
URL: https://github.com/apache/arrow/issues/33790#issuecomment-1399044202

   Outside of datasets this is normally achieved by opening a compressed input stream and using the CSV stream reader.  If the path ends in `.gz` or `.bz2` I think we also guess that it is compressed and do this for you.
   
   Within datasets there are a few un/under documented features which may help.  There is a similar "extension guessing" mechanism: https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/file_base.cc#L93  So if your files end in `gz` or `gzip` it should automatically be picked up.
   
   There is also `stream_transform_func` as part of the dataset-csv options which takes an arbitrary callable that transforms the stream before you start reading it.  In theory this could maybe be used to provide support for zipped files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #33790: [Python] Support for reading .csv files from a zip archive

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #33790:
URL: https://github.com/apache/arrow/issues/33790#issuecomment-1399253361

   No reason I'm aware.  It's most likely just that no one has wanted it enough to take the time to add a Windows zip codec.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org