You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/05 20:05:14 UTC

[GitHub] [arrow] westonpace commented on pull request #14158: ARROW-17762: [C++] WIP: Add ordering information to Acero

westonpace commented on PR #14158:
URL: https://github.com/apache/arrow/pull/14158#issuecomment-1265504491

   > I care about this work very much as well and hope can understand this better. If I remember correctly the high level idea is that there are nodes that requires ordering (e.g., asof join) and if the input batches are out of order (indicated by batch index), the consumer node will cache/reorder out of order batches before processing them?
   
   Yes.  If a node relies on ordering then it will resequence the batches before processing them.  I try and take care to use both "reorder" and "resequence" independently as there are two rather different problems.
   
   The first problem is when the input has no known ordering or is in a completely random order.  In that case we must "reorder" which is "not streaming" and a "pipeline breaker" and requires us to cache all data in memory (or spill) in order to assign the order.
   
   The second problem is when the input is mostly ordered but might be a bit noisy due to something like a parallel scan.  In that case we already have a sequence number and we assume the sequence number is, generally, within some max delta from the correct ordering.  In that case we only need to resequence (not reorder).  This operation is "mostly streaming" and only sometimes a "pipeline breaker".


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org