You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/05/11 02:22:49 UTC
[GitHub] [iceberg] puchengy opened a new pull request, #4745: [python legacy] use process pool instead of thread pool for files planning
puchengy opened a new pull request, #4745:
URL: https://github.com/apache/iceberg/pull/4745
Based on analysis, `plan_files()` spent most of the time on copying and file reading which are all IO bounded operations. However, current implementation is using thread pool instead of process pool which does not speed up the operation at all.
This diff proposes switch to process pool instead of thread pool.
Frame graph
![profile](https://user-images.githubusercontent.com/8072956/167751644-bb69eef7-051e-4514-afcb-7e163519cb48.svg)
Command
`py-spy record -o profile.svg -- python myprogram.py`
Code
```
from iceberg.hive import HiveTables
conf = {"hive.metastore.uris": 'xxxx', 'iceberg.scan.plan-in-worker-pool': True, 'iceberg.worker.num-threads': 4}
tables = HiveTables(conf)
table = tables.load('abc.xyz')
scan = table.new_scan()
files = scan.plan_files()
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] samredai commented on pull request #4745: [python legacy] use process pool instead of thread pool for files planning
Posted by GitBox <gi...@apache.org>.
samredai commented on PR #4745:
URL: https://github.com/apache/iceberg/pull/4745#issuecomment-1124296011
Looking at the function passed to the pool ([get_scans_for_manifest](https://github.com/apache/iceberg/blob/master/python_legacy/iceberg/core/data_table_scan.py#L79)) I can understand how this improves the parsing and evaluator logic. For the avro file reads, I'm not sure but I'd expect scaling to be easier with multithreading since that's I/O bound. @Fokko is it possible that threading may be better here beyond a certain number of files?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] puchengy commented on pull request #4745: [python legacy] use process pool instead of thread pool for files planning
Posted by GitBox <gi...@apache.org>.
puchengy commented on PR #4745:
URL: https://github.com/apache/iceberg/pull/4745#issuecomment-1130293044
@rdblue Thanks I will have a follow up PR for that.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] samredai commented on pull request #4745: [python legacy] use process pool instead of thread pool for files planning
Posted by GitBox <gi...@apache.org>.
samredai commented on PR #4745:
URL: https://github.com/apache/iceberg/pull/4745#issuecomment-1126363439
@chyzzqo2 this is a great point. We already have the `iceberg.scan.plan-in-worker-pool` config property that enables this code path. Maybe we can add a config along the lines of `iceberg.scan.plan-in-worker-pool.type` with values of "threads" or "processes". Also I just noticed that "threads" is in the config property name for pool sizes ([here](https://github.com/apache/iceberg/blob/93111e83b2e56989a980d410b431f3edaa1c21c2/python_legacy/iceberg/core/util/__init__.py#L30-L32)) which can be confusing if processes are actually being used.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on a diff in pull request #4745: [python legacy] use process pool instead of thread pool for files planning
Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #4745:
URL: https://github.com/apache/iceberg/pull/4745#discussion_r870996580
##########
python_legacy/iceberg/core/data_table_scan.py:
##########
@@ -18,7 +18,7 @@
import itertools
import logging
from multiprocessing import cpu_count
-from multiprocessing.dummy import Pool
Review Comment:
nit; we could combine the imports:
```python
from multiprocessing import cpu_count, Pool
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on pull request #4745: [python legacy] use process pool instead of thread pool for files planning
Posted by GitBox <gi...@apache.org>.
Fokko commented on PR #4745:
URL: https://github.com/apache/iceberg/pull/4745#issuecomment-1124596346
@samredai
> Looking at the function passed to the pool ([get_scans_for_manifest](https://github.com/apache/iceberg/blob/master/python_legacy/iceberg/core/data_table_scan.py#L79)) I can understand how this improves the parsing and evaluator logic. For the avro file reads, I'm not sure but I'd expect this would scale better with multithreading since that's I/O bound. @Fokko is it possible that threading may be better here beyond a certain number of files?
I think it would make sense to parallelize that as well to reduce the runtime of the function. Especially if we need to fetch files from external storage that introduces some latency, I would parallelize that kind of calls as well to improve overall throughput.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on pull request #4745: [python legacy] use process pool instead of thread pool for files planning
Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #4745:
URL: https://github.com/apache/iceberg/pull/4745#issuecomment-1130275271
I think we want to follow up and implement @samredai's suggestion to add a pool type that controls this. We can do that in a follow-up, though. In the meantime, I'll merge this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on pull request #4745: [python legacy] use process pool instead of thread pool for files planning
Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #4745:
URL: https://github.com/apache/iceberg/pull/4745#issuecomment-1130276269
Thanks, @puchengy!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] rdblue merged pull request #4745: [python legacy] use process pool instead of thread pool for files planning
Posted by GitBox <gi...@apache.org>.
rdblue merged PR #4745:
URL: https://github.com/apache/iceberg/pull/4745
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] rdblue commented on pull request #4745: [python legacy] use process pool instead of thread pool for files planning
Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #4745:
URL: https://github.com/apache/iceberg/pull/4745#issuecomment-1124377865
Sorry, I didn't mean to close this. That was an accident.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] rdblue closed pull request #4745: [python legacy] use process pool instead of thread pool for files planning
Posted by GitBox <gi...@apache.org>.
rdblue closed pull request #4745: [python legacy] use process pool instead of thread pool for files planning
URL: https://github.com/apache/iceberg/pull/4745
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] chyzzqo2 commented on pull request #4745: [python legacy] use process pool instead of thread pool for files planning
Posted by GitBox <gi...@apache.org>.
chyzzqo2 commented on PR #4745:
URL: https://github.com/apache/iceberg/pull/4745#issuecomment-1125480624
In general in python its hard to deal with libraries that make this choice for you. Parallelism via threads and processes don't mix so the choice is better left to the application rather than the library. Can we make the choice configurable?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org