You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/01 13:07:44 UTC

[GitHub] [arrow] zeapo commented on pull request #7309: ARROW-8993: [Rust] support reading gzipped json files

zeapo commented on pull request #7309:
URL: https://github.com/apache/arrow/pull/7309#issuecomment-636850680


   Thanks for your feedback.
   
   > Do you mean other compression formats?
   
   Yes, but not only. This would allow also to have different filesystems (like S3) where using `File` is not possible, and would make much more sense to have `dyn Read` (in rusoto they use ByteStream that implements Read and AsyncRead)
   
   > There have also been some changes on the arrow::csv side, such as allowing inference of multiple files, which might also be convenient to have in arrow::json
   
   That would be great. This would help when there are multiple small file that are written without a fixed batch size (small & big files) and data would be scattered across multiple files. If you have a JIRA issue for this I can take a look at it :)
   
   > I'm still pro returning the reader back to the start, or is there a performance impact in doing so? I wouldn't want to place the burden of seeking on the user, because I'd expect the common inference case to be getting the schema then reading the file.
   
   I agree that placing the burden on the user is a bad idea. However, there are situations where we just can't seek back to start (s3 is one example). Maybe a specific implementation for `Seek + Read`, that would do the seek back to start, and one for `Read` only, that would not. However... this would need the use of specialization, so more nightly dependencies.
   
   Not really sure :/


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org