You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/02/02 13:57:49 UTC

[GitHub] [arrow] westonpace commented on issue #33972: [Python] Remove redundant S3 call

westonpace commented on issue #33972:
URL: https://github.com/apache/arrow/issues/33972#issuecomment-1413788554

   The datasets feature went through considerable change a while back when it moved from a parquet-only feature to format-agnostic.  Looks like this connection came loose in the conversion.  If you just want to read one file the approach is normally something more like:
   
   ```
   import pyarrow.parquet as pq
   pq.read_table(path)
   ```
   
   If you're looking to read a collection of files you would normally use:
   
   ```
   import pyarrow.dataset as ds
   ds.dataset([paths]).to_table()
   ```
   
   I suspect (though am not entirely certain) both of the above paths will only read the metadata once.
   
   However, your usage is legitimate, and it even affects the normal datasets path when you scan the dataset multiple times (because we should be caching the metadata on the first scan and reusing on the second).  So I would consider this a bug.
   
   I don't know for sure but my guess is the problem is [here](https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/file_parquet.cc#L364).  The fragment is opening a reader and should pass the metadata to the reader, if already populated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org