You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/07 19:15:26 UTC

[GitHub] [arrow] westonpace commented on pull request #10204: [WIP] ARROW-11928: [C++] Execution engine API

westonpace commented on pull request #10204:
URL: https://github.com/apache/arrow/pull/10204#issuecomment-834711745


   This article has an interesting description of how DAGs (which implies multiple outputs) are used by Materialize to optimize query plans: https://scattered-thoughts.net/writing/materialize-decorrelation
   
   I don't know nearly enough to know how common or essential this is.
   
   As for complications, multiple outputs introduces buffering (in both pull and push models).  While you are delivering a result to consumer 1 you have to buffer the result so you can later deliver it to consumer 2.  If your query plan's bottleneck is down the consumer 1 path you could potentially accumulate results in the multicasting operator and need to trigger backpressure.
   
   That's the main complication that jumps to mind.  That being said, this "multicasting" is one of the more confusing points of Rx (Reactive).  But that may just come from the dynamic and linear way in which observers are chained.  Since you're already building a graph (that presumably is unchanging for the duration of the execution) that shouldn't be a problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org