You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "paleolimbot (via GitHub)" <gi...@apache.org> on 2023/06/26 15:28:43 UTC

[GitHub] [arrow] paleolimbot opened a new pull request, #36307: GH-35649: [R] Always call `RecordBatchReader::ReadNext()` from DuckDB from the main R thread

paleolimbot opened a new pull request, #36307:
URL: https://github.com/apache/arrow/pull/36307

   
   ### Rationale for this change
   
   When passing a DuckDB result to Arrow via `to_arrow()` whose input was an Arrow dataset, calls to R code from other threads can occur in some DuckDB operations. This caused a crash or hang on Linux when attempting to combine `pivot_longer()` and `write_dataset()`.
   
   ### What changes are included in this PR?
   
   - Added a wrapper class around the `RecordBatchReader` that routes calls to `ReadNext()` through `SafeCallIntoR()`.
   
   ### Are these changes tested?
   
   TODO!
   
   ### Are there any user-facing changes?
   
   There are no user facing changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic merged pull request #36307: GH-35649: [R] Always call `RecordBatchReader::ReadNext()` from DuckDB from the main R thread

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic merged PR #36307:
URL: https://github.com/apache/arrow/pull/36307


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #36307: GH-35649: [R] Always call `RecordBatchReader::ReadNext()` from DuckDB from the main R thread

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #36307:
URL: https://github.com/apache/arrow/pull/36307#issuecomment-1607724055

   :warning: GitHub issue #35649 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] conbench-apache-arrow[bot] commented on pull request #36307: GH-35649: [R] Always call `RecordBatchReader::ReadNext()` from DuckDB from the main R thread

Posted by "conbench-apache-arrow[bot] (via GitHub)" <gi...@apache.org>.
conbench-apache-arrow[bot] commented on PR #36307:
URL: https://github.com/apache/arrow/pull/36307#issuecomment-1619364385

   Conbench analyzed the 6 benchmark runs on commit `a75299d8`.
   
   There were 9 benchmark results indicating a performance regression:
   
   - Commit Run on `ursa-thinkcentre-m75q` at [2023-07-02 07:20:16Z](http://conbench.ursa.dev/compare/runs/1e496f0bb8634c33844de1f7fb01daa5...9d00eb1500f14fe4bcd1e0aaf5efd85a/)
     - [params=<SubtractChecked, Int64Type>/size:524288/inverse_null_proportion:0, source=cpp-micro, suite=arrow-compute-scalar-arithmetic-benchmark](http://conbench.ursa.dev/compare/benchmarks/064a0fc5cf6f7c0d8000539bd918ffad...064a1259fd05769980001409506379f5)
     - [source=cpp-micro, suite=parquet-bloom-filter-benchmark](http://conbench.ursa.dev/compare/benchmarks/064a0fca53b5789180001e3a5a842d40...064a125fc6867e22800029120d734270)
   - and 7 more (see the report linked below)
   
   The [full Conbench report](https://github.com/apache/arrow/runs/14753936542) has more details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] paleolimbot commented on pull request #36307: GH-35649: [R] Always call `RecordBatchReader::ReadNext()` from DuckDB from the main R thread

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on PR #36307:
URL: https://github.com/apache/arrow/pull/36307#issuecomment-1609954314

   It's a good thing for all of our sanity!
   
   `SafeCallIntoR()` either evaluates the expression it contains once on the main R thread OR errors it can't. The `RunWithCapturedR()` wrapper launches a thread (via the IO thread pools), sets up a "listener" on the main R thread that waits for requests to evaluate R code from `SafeCallIntoR()`. It's not a perfect demo, but there's some test code that demonstrates this:
   
   https://github.com/apache/arrow/blob/8b4a548baf862c4333fab8868b416fcf30612956/r/src/safe-call-into-r-impl.cpp#L62-L76
   
   The main difference with actual code is that the call to `SafeCallIntoR()` is usually buried deep in some utility code (e.g., read bytes from an R connection by calling the base R `readBin()` function) and the top-level call usually knows nothing about this (e.g., "Read this parquet file").
   
   If `SafeCallIntoR()` is invoked from the R thread, it just evaluates the function without any of that dance. Basically, it is used everywhere we call anything from the R API (as in `#include <R.h>`) or cpp11 (because cpp11 wraps a lot of calls to R.h functions with the assumption that all of this is happening on the main thread).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic commented on pull request #36307: GH-35649: [R] Always call `RecordBatchReader::ReadNext()` from DuckDB from the main R thread

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on PR #36307:
URL: https://github.com/apache/arrow/pull/36307#issuecomment-1609762913

   This all looks good, so will approve shortly, but @paleolimbot, mainly for the sake of my own understanding, would you mind expanding on why routing calls to `ReadNext()` through `SafeCallIntoR()` is needed?  What is `SafeCallIntoR()` doing? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] PMassicotte commented on pull request #36307: GH-35649: [R] Always call `RecordBatchReader::ReadNext()` from DuckDB from the main R thread

Posted by "PMassicotte (via GitHub)" <gi...@apache.org>.
PMassicotte commented on PR #36307:
URL: https://github.com/apache/arrow/pull/36307#issuecomment-1609614284

   Thank you very much @paleolimbot for looking and fixing the issue! Really love the Arrow community!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org