You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/28 03:21:35 UTC

[GitHub] [arrow] westonpace opened a new pull request #10420: [C++] Create work stealing implementation of generalized ThreadPool

westonpace opened a new pull request #10420:
URL: https://github.com/apache/arrow/pull/10420


   Following the spirit of the literature I've done my best to keep things lock free.  Most management tasks (resizing, launching new workers, shutting down, etc.) still require locking.  In addition, when a thread runs out of work it must lock so it can safely sleep.
   
   Finally, we will probably require locking when requests come in from outside the thread pool. (at the moment we are naively assuming these requests will come from a single external thread).
   
   However, the hot path of working through tasks submitted by tasks should be able to be done with a minimal amount of synchronization (although I'm not entirely convinced this benefit is worth the complexity at the moment, see upcoming discussion in ARROW-10117).  Very early benchmarking shows ~2x speedup on this hot path.
   
   Of course, the more interesting case, is the case where work stealing saves us from losing cache locality.  I haven't gotten around to benchmarking that yet.
   
   Keeping in draft at the moment:
   
   * In addition to the above work still needing to be done the current implementation does not support incoming requests from multiple outside threads.
   * Also, resize is not currently supported.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on pull request #10420: [C++] Create work stealing implementation of generalized ThreadPool

Posted by GitBox <gi...@apache.org>.
westonpace commented on pull request #10420:
URL: https://github.com/apache/arrow/pull/10420#issuecomment-881178575


   Closing for now to keep my PR queue clean.  I'll reopen after 5.0.0 releases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm edited a comment on pull request #10420: [C++] Create work stealing implementation of generalized ThreadPool

Posted by GitBox <gi...@apache.org>.
wesm edited a comment on pull request #10420:
URL: https://github.com/apache/arrow/pull/10420#issuecomment-857717570


   I'm not able to wade deeply into the implementation details, but in general I'm excited about deploying work stealing more broadly in the codebase as a means of managing nested parallelism. This is particularly relevant not only for query execution but also controlling resource fairness between concurrently-executing file readers (CSV, Parquet, etc.) or IPC message decompression (which can also be parallelized), and anywhere else where we have introduced parallelism. 
   
   That said, there isn't necessarily a *requirement* that everything be made work-stealing. Rather, any workload (like reading a file) should have access to a non-global task queue to put its child tasks into. Whether the task queue happens to be global or not (versus a thread-local queue where idle threads will steal work) is up to the application developer. I imagine to reach this goal will require some refactoring from where we are now, but it seems to me — from first principles — like the right way to go. Let me know if this is consistent with your thinking (since I'm thinking out loud a little bit myself and you've spent more time thinking about this and working on the parallel/asynchronous computing machinery around the codebase in recent times).
   
   Might be good to discuss this on the mailing list, too, for increased visibility. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #10420: [C++] Create work stealing implementation of generalized ThreadPool

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #10420:
URL: https://github.com/apache/arrow/pull/10420#issuecomment-857717570


   I'm not able to wade deeply into the implementation details, but in general I'm excited about deploying work stealing more broadly in the codebase as a means of managing nested parallelism. This is particularly relevant not only for query execution but also controlling resource fairness between concurrently-executing file readers (CSV, Parquet, etc.) or IPC message decompression (which can also be parallelized), and anywhere else where we have introduced parallelism. 
   
   That said, there isn't necessarily a *requirement* that everything be made work-stealing. Rather, any workload (like reading a file) should have access to a non-global task queue to put its child tasks into. Whether the task queue happens to be global or not (versus a thread-local queue where idle threads will steal work) is up to the application developer. I imagine to reach this goal will require some refactoring from where we are now, but it seems to me — from first principles — like the right way to go. Let me know if this is consistent with your thinking (since I'm thinking out loud a little bit myself and you've spend more time thinking about this and working on the parallel/asynchronous computing machinery around the codebase in recent times).
   
   Might be good to discuss this on the mailing list, too, for increased visibility. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10420: [C++] Create work stealing implementation of generalized ThreadPool

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10420:
URL: https://github.com/apache/arrow/pull/10420#issuecomment-850082465


   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/master/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW
   
   Opening JIRAs ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename pull request title in the following format?
   
       ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace closed pull request #10420: [C++] Create work stealing implementation of generalized ThreadPool

Posted by GitBox <gi...@apache.org>.
westonpace closed pull request #10420:
URL: https://github.com/apache/arrow/pull/10420


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org