You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/28 22:04:58 UTC

[GitHub] [arrow] westonpace commented on pull request #13028: ARROW-16083]: [WIP][C++] Implement AsofJoin execution node

westonpace commented on PR #13028:
URL: https://github.com/apache/arrow/pull/13028#issuecomment-1112696020

   Thank you for the contribution.  Yes, there would need to be some significant style changes but it would be a welcome addition.
   
   I'll try and give it a more detailed look tomorrow and play around with it.  At a glance a few thoughts:
   
    * We don't want to use any blocking queues, they will need to be resizable.  The execution engine follows a thread-per-core model so any blocking thread is wasted resources (and a potential deadlock).  Instead, when the queue "limit" is surpassed, we should call PauseProducing on the inputs until the queue is sufficiently drained.
    * What happens if the key column(s) have very many values?  I didn't trace all the paths but I think that means the memo store could get quite large and become a memory pressure concern.  Maybe it is only a concern for malicious inputs and we can just reject the query as invalid.  Long term we could probably investigate just storing the row and not the batch, or, if even that is too large, spilling the memo table to disk.  I don't think any of this is something we need to solve now, just random thoughts.
    * We have some utilities for working with row-major data that we had to come up with for the hash-join.  I don't recall if these are in the current implementation, the [proposed implementation](https://github.com/apache/arrow/pull/12326), or both.  However, I bet we can find some overlap here to share utilities.  I'll try and figure out some suggestions after a detailed look.
    * What approach are you thinking wrt sequencing as discussed on the mailing list?
    * Have you given much thought to what an as-of join looks like in Substrait (for example, the current `keys` option won't work I don't think because it is name based (if I understand the meaning) and not index based)
    * Happy to give some thoughts on threading.  My gut instinct is that threading the join itself won't be critically important for most hardware but I could easily be very wrong on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org