You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/07/14 14:30:51 UTC
[GitHub] [arrow] maartenbreddels commented on pull request #7756: ARROW-9458: [Python] Release GIL in ScanTask.execute
maartenbreddels commented on pull request #7756:
URL: https://github.com/apache/arrow/pull/7756#issuecomment-658214530
FYI, this:
```python
## common code ##
import pyarrow as pa
import pyarrow.dataset as ds
import concurrent.futures
import glob
pool = concurrent.futures.ThreadPoolExecutor()
ds = pa.dataset.dataset(glob.glob('/data/taxi_parquet/data_*.parquet'))
## end common code ##
def process(f):
scan_count = 0
return len(f.to_table(use_threads=False))
sum(pool.map(process, ds.get_fragments()))
```
For me takes between 10 and 16 second (very irregular)
While this:
```python
def process(fragment):
scanned = 0
for scan_task in fragment.scan(use_threads=False):
for record_batch in scan_task.execute():
scanned += record_batch.num_rows
return scanned
sum(pool.map(process, ds.get_fragments()))
```
takes 7-9 seconds, more consistently.
Each file (fragment) has 1 million rows (1 rowgroup). Could be uninteresting details, but I though I'd share it.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org