You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "lucazanna (via GitHub)" <gi...@apache.org> on 2023/03/17 15:03:30 UTC

[GitHub] [arrow] lucazanna opened a new issue, #34608: [Python] No limit pushdown for scanning Parquet on Azure

lucazanna opened a new issue, #34608:
URL: https://github.com/apache/arrow/issues/34608

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   I am accessing a Parquet file on Azure data lake with the following code. 
   (to make the example reproducible, the example has a file publicly accessible on Azure)
   
   ```
   import pyarrow.dataset as ds
   from adlfs import AzureBlobFileSystem
   
   abfs_public = AzureBlobFileSystem(
       account_name="azureopendatastorage")
   
   dataset_public = ds.dataset('az://nyctlc/yellow/puYear=2010/puMonth=1/part-00000-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426339-18.c000.snappy.parquet', filesystem=abfs_public)
   ```
   
   The processing time is the same for accessing the full file or only the first 5 rows:
   
   ```
   dataset_public.head(5)
   # 5min 11s
   
   dataset_public.to_table()
   # 5min 30s
   
   dataset_public.scanner().head(5)
   # 5min 43s
   ```
   I would expect the time to be less for 5 rows. I am not sure about the difference between .scanner().head() and .head().
   
   Regarding reducing the number of columns:
   reducing the number of columns retrieved speeds up the query, but the reduction seems small. For example, filtering to only 2 columns out of 21 reduces the query to 2min 7:
   
   ```
   dataset_public.scanner(columns=['vendorID','passengerCount']).to_table()
   # 2min 7s
   ```
   
   I would have expected that collecting 10% of the columns ( 2 columns instead of 21 columns) to reduce the time by more than half. Unless there is an overhead to the query for collecting from Azure?
   
   Sources for the code:
   * Apache Arrow website: https://arrow.apache.org/docs/python/parquet.html#reading-parquet-and-memory-mapping
   * ADLFS Github page: https://github.com/fsspec/adlfs
   
   Thank you for the outstanding job on the Arrow library
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Python] No limit pushdown for scanning Parquet on Azure [arrow]

Posted by "Tom-Newton (via GitHub)" <gi...@apache.org>.

Tom-Newton commented on issue #34608:
URL: https://github.com/apache/arrow/issues/34608#issuecomment-1937748358

   I can't remember especially clearly but I think I have had some success using limit on Azure before. One thing to bear in mind is that I think arrow will only read entire row groups. So depending on the row group layout of your parquet file that might explain things. 
   
   Additionally I have found that `adlfs` does some kind of prefetching thing which personally I've found to be counter-productive. When I tested this I was using a native arrow filesystem implementation, which is very close to being available officially in `pyarrow`. The native filesystem should also be generally faster and more reliable. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org