You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/05/11 02:22:49 UTC

[GitHub] [iceberg] puchengy opened a new pull request, #4745: [python legacy] use process pool instead of thread pool for files planning

puchengy opened a new pull request, #4745:
URL: https://github.com/apache/iceberg/pull/4745

   Based on analysis, `plan_files()` spent most of the time on copying and file reading which are all IO bounded operations. However, current implementation is using thread pool instead of process pool which does not speed up the operation at all.
   
   This diff proposes switch to process pool instead of thread pool.
   
   Frame graph
   ![profile](https://user-images.githubusercontent.com/8072956/167751644-bb69eef7-051e-4514-afcb-7e163519cb48.svg)
   
   Command
   `py-spy record -o profile.svg -- python myprogram.py`
   
   Code
   ```
   from iceberg.hive import HiveTables
   
   conf = {"hive.metastore.uris": 'xxxx', 'iceberg.scan.plan-in-worker-pool': True, 'iceberg.worker.num-threads': 4}
   tables = HiveTables(conf)
   table = tables.load('abc.xyz')
   scan = table.new_scan()
   files = scan.plan_files()
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] samredai commented on pull request #4745: [python legacy] use process pool instead of thread pool for files planning

Posted by GitBox <gi...@apache.org>.

samredai commented on PR #4745:
URL: https://github.com/apache/iceberg/pull/4745#issuecomment-1124296011

   Looking at the function passed to the pool ([get_scans_for_manifest](https://github.com/apache/iceberg/blob/master/python_legacy/iceberg/core/data_table_scan.py#L79)) I can understand how this improves the parsing and evaluator logic. For the avro file reads, I'm not sure but I'd expect scaling to be easier with multithreading since that's I/O bound. @Fokko is it possible that threading may be better here beyond a certain number of files?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] puchengy commented on pull request #4745: [python legacy] use process pool instead of thread pool for files planning

Posted by GitBox <gi...@apache.org>.

puchengy commented on PR #4745:
URL: https://github.com/apache/iceberg/pull/4745#issuecomment-1130293044

   @rdblue Thanks I will have a follow up PR for that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] samredai commented on pull request #4745: [python legacy] use process pool instead of thread pool for files planning

Posted by GitBox <gi...@apache.org>.

samredai commented on PR #4745:
URL: https://github.com/apache/iceberg/pull/4745#issuecomment-1126363439

   @chyzzqo2 this is a great point. We already have the `iceberg.scan.plan-in-worker-pool` config property that enables this code path. Maybe we can add a config along the lines of `iceberg.scan.plan-in-worker-pool.type` with values of "threads" or "processes". Also I just noticed that "threads" is in the config property name for pool sizes ([here](https://github.com/apache/iceberg/blob/93111e83b2e56989a980d410b431f3edaa1c21c2/python_legacy/iceberg/core/util/__init__.py#L30-L32)) which can be confusing if processes are actually being used.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on a diff in pull request #4745: [python legacy] use process pool instead of thread pool for files planning

Posted by GitBox <gi...@apache.org>.

Fokko commented on code in PR #4745:
URL: https://github.com/apache/iceberg/pull/4745#discussion_r870996580


##########
python_legacy/iceberg/core/data_table_scan.py:
##########
@@ -18,7 +18,7 @@
 import itertools
 import logging
 from multiprocessing import cpu_count
-from multiprocessing.dummy import Pool

Review Comment:
   nit; we could combine the imports:
   ```python
   from multiprocessing import cpu_count, Pool
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] Fokko commented on pull request #4745: [python legacy] use process pool instead of thread pool for files planning

Posted by GitBox <gi...@apache.org>.

Fokko commented on PR #4745:
URL: https://github.com/apache/iceberg/pull/4745#issuecomment-1124596346

   @samredai 
   
   > Looking at the function passed to the pool ([get_scans_for_manifest](https://github.com/apache/iceberg/blob/master/python_legacy/iceberg/core/data_table_scan.py#L79)) I can understand how this improves the parsing and evaluator logic. For the avro file reads, I'm not sure but I'd expect this would scale better with multithreading since that's I/O bound. @Fokko is it possible that threading may be better here beyond a certain number of files?
   
   I think it would make sense to parallelize that as well to reduce the runtime of the function. Especially if we need to fetch files from external storage that introduces some latency, I would parallelize that kind of calls as well to improve overall throughput.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #4745: [python legacy] use process pool instead of thread pool for files planning

Posted by GitBox <gi...@apache.org>.

rdblue commented on PR #4745:
URL: https://github.com/apache/iceberg/pull/4745#issuecomment-1130275271

   I think we want to follow up and implement @samredai's suggestion to add a pool type that controls this. We can do that in a follow-up, though. In the meantime, I'll merge this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #4745: [python legacy] use process pool instead of thread pool for files planning

Posted by GitBox <gi...@apache.org>.

rdblue commented on PR #4745:
URL: https://github.com/apache/iceberg/pull/4745#issuecomment-1130276269

   Thanks, @puchengy!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue merged pull request #4745: [python legacy] use process pool instead of thread pool for files planning

Posted by GitBox <gi...@apache.org>.

rdblue merged PR #4745:
URL: https://github.com/apache/iceberg/pull/4745


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #4745: [python legacy] use process pool instead of thread pool for files planning

Posted by GitBox <gi...@apache.org>.

rdblue commented on PR #4745:
URL: https://github.com/apache/iceberg/pull/4745#issuecomment-1124377865

   Sorry, I didn't mean to close this. That was an accident.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue closed pull request #4745: [python legacy] use process pool instead of thread pool for files planning

Posted by GitBox <gi...@apache.org>.

rdblue closed pull request #4745: [python legacy] use process pool instead of thread pool for files planning
URL: https://github.com/apache/iceberg/pull/4745


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] chyzzqo2 commented on pull request #4745: [python legacy] use process pool instead of thread pool for files planning

Posted by GitBox <gi...@apache.org>.

chyzzqo2 commented on PR #4745:
URL: https://github.com/apache/iceberg/pull/4745#issuecomment-1125480624

   In general in python its hard to deal with libraries that make this choice for you. Parallelism via threads and processes don't mix so the choice is better left to the application rather than the library. Can we make the choice configurable?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org