You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/07/14 14:30:51 UTC

[GitHub] [arrow] maartenbreddels commented on pull request #7756: ARROW-9458: [Python] Release GIL in ScanTask.execute

maartenbreddels commented on pull request #7756:
URL: https://github.com/apache/arrow/pull/7756#issuecomment-658214530


   FYI, this:
   ```python
   ## common code ##
   import pyarrow as pa
   import pyarrow.dataset as ds
   import concurrent.futures
   import glob
   pool = concurrent.futures.ThreadPoolExecutor()
   ds = pa.dataset.dataset(glob.glob('/data/taxi_parquet/data_*.parquet'))
   ## end common code ##
   
   def process(f):
       scan_count = 0
       return len(f.to_table(use_threads=False))
   sum(pool.map(process, ds.get_fragments()))
   ```
   For me takes between 10 and 16 second (very irregular)
   
   While this:
   ```python
   def process(fragment):
       scanned = 0
       for scan_task in fragment.scan(use_threads=False):
           for record_batch in scan_task.execute():
               scanned += record_batch.num_rows
       return scanned
   sum(pool.map(process, ds.get_fragments()))
   ```
   takes 7-9 seconds, more consistently.
   
   Each file (fragment) has 1 million rows (1 rowgroup). Could be uninteresting details, but I though I'd share it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org