You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "joemarshall (via GitHub)" <gi...@apache.org> on 2023/04/17 08:36:15 UTC

[GitHub] [arrow] joemarshall opened a new issue, #35176: Arrow_threading

joemarshall opened a new issue, #35176:
URL: https://github.com/apache/arrow/issues/35176

   ### Describe the enhancement requested
   
   I've built most of arrow (pyarrow and dependencies) for emscripten. It would be good to have a way to disable threading, as a lot of emscripten use is in browsers where threading may not be available.
   
   At the moment I just put in dummy pthreads, which means some functionality in e.g.datasets fails because it assumes threading is available.
   
   
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #35176: [C++] Arrow_threading

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #35176:
URL: https://github.com/apache/arrow/issues/35176#issuecomment-1516694631

   > One thought, but if ThreadPool was to wrap a singleton SerialExecutor if ARROW_DISABLE_THREADING was set, things might just work well enough for a first attempt at an unthreaded build? I don't know whether there'd be any deadlocks in i/o vs compute tasks though?
   
   If everything is tasks and a single `SerialExecutor` is used then it should be ok (e.g. there shouldn't be deadlocks).  It will just be very slow when doing I/O because we will be sitting there waiting on I/O with our one thread while the CPU sits completely idle.
   
   That being said, I'm sure there are a few bugs / things that will need to be converted.
   
   However, I'm not sure we can just wrap the global thread pools with a serial executor.  The challenge with the serial executor is that it has to co-opt the calling thread.  This means we create the executor when the call starts.
   
   ```
   void ReadTable() {
     # Where DoReadTable is a function returning a Future
     # Note, we are not calling here, but passing it as a parameter
     RunInSerialExecutor(DoReadTable);
   }
   ```
   
   However, if we are combining I/O and CPU into a single pool...then it should be possible to create a special serial executor that creates a serial executor when a task is first submitted.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #35176: [C++] Arrow_threading

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #35176:
URL: https://github.com/apache/arrow/issues/35176#issuecomment-1516697187

   Something like...
   
   ```
   void AddTask(Task t) {
     if (instance_) {
       instance_.AddTask(t);
     } else {
       instance_ = SerialExecutor(t);
       instance_ = nullptr;
     }
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] joemarshall commented on issue #35176: [C++] Arrow_threading

Posted by "joemarshall (via GitHub)" <gi...@apache.org>.
joemarshall commented on issue #35176:
URL: https://github.com/apache/arrow/issues/35176#issuecomment-1514833849

   One thought, but if ThreadPool was to wrap a singleton SerialExecutor if ARROW_DISABLE_THREADING was set, things might just work well enough for a first attempt at an unthreaded build? I don't know whether there'd be any deadlocks in i/o vs compute tasks though? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] joemarshall commented on issue #35176: [C++] Arrow_threading

Posted by "joemarshall (via GitHub)" <gi...@apache.org>.
joemarshall commented on issue #35176:
URL: https://github.com/apache/arrow/issues/35176#issuecomment-1529773588

   I did some work on this - it's currently functional for quite a lot of things but failing some tests. I'm working through them.
   
   It has to keep the concept of multiple executors because loads of the other logic relies on that, but all active tasks from any executors are dispatched in turn whenever anything waits for any task or future. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] joemarshall commented on issue #35176: [C++] Arrow_threading

Posted by "joemarshall (via GitHub)" <gi...@apache.org>.
joemarshall commented on issue #35176:
URL: https://github.com/apache/arrow/issues/35176#issuecomment-1514814122

   On emscripten - in browser, local disk is memory based, and may or may not be synced to some kind of permanent storage (via an asynchronous syncfs call). Access to this disk is synchronous, but very quick because it is in memory. In node, you can use the real file system directly. 
   
   Network is weird, because it is hosted in browsers typically - for http / https one can call out to javascript to use the fetch api, which is asynchronous. Right now there's only async I/O for network with the exception of xmlhttprequest if you're in a web-worker, which is a hacky workaround for synchronous http access. In theory there's also a websockets wrapper which turns socket calls in C into websocket calls to the hosting server, but I don't know how well it works.
   
   Basically, as I understand it, the potential in emscripten for arrow is:
   
   1) Local file system stuff should just work, if it can be read without threads (I had code reading a parquet file which worked okay)
   
   2) Network things (e.g reading from s3) would probably require porting work for things that work over http or websockets to work. Anything with a REST api or websockets api should be fine. Things that require direct connections or making servers won't work.
   
   3) I think this means that flight is going to be quite limited in its usefulness in webassembly, so I haven't even thought about compiling that.
   
   Personally, for what I want, I just want core arrow with file support to work on emscripten - I think that is a decent starting point before getting into complexities.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #35176: [C++] Arrow_threading

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #35176:
URL: https://github.com/apache/arrow/issues/35176#issuecomment-1511774877

   I'd say this is non-trivial but should be doable.  Most non-I/O components generally have a way to disable threading.  There are probably some exceptions however and we could clean those up.
   
   I/O, on the other hand, tends to rely on the I/O thread pool being available.  What does I/O look like in emscripten?  I'm thinking of both "local disk I/O" (e.g. read a parquet file from local disk) and network I/O (e.g. read a parquet file from S3)?
   
   Does emscripten have APIs for these things?  Do they have async variants?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou closed issue #35176: [C++] Add support for disablging threading for emscripten

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou closed issue #35176: [C++] Add support for disablging threading for emscripten
URL: https://github.com/apache/arrow/issues/35176


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org