You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/03/23 01:20:15 UTC

[GitHub] [arrow] westonpace commented on issue #34437: [R] Use FetchNode and OrderByNode

westonpace commented on issue #34437:
URL: https://github.com/apache/arrow/issues/34437#issuecomment-1480452033

   > FetchNode requires order_by first and raises a validation error if ordering is not set, which means a basic SELECT * LIMIT 10 wouldn't work.
   
   Once https://github.com/apache/arrow/issues/34698 merges then you should be able to ask the scan to emit data in a deterministic order (if it's a dataset scan this will be the order in which files are given to the dataset. If the dataset is created through discovery this is (usually, but not necessarily always) lexicographical ordered filenames)
   
   Note that in-memory sources (record batch reader, table, etc.) already declare implicit ordering.  So if your source is a table in memory then `SELECT * LIMIT 10` should work (deterministically) today.
   
   The performance penalty for doing so is pretty minor.  So this should allow you to do `SELECT * LIMIT 10` in a deterministic fashion.
   
   However, it still wouldn't support something like `SELECT * FROM left INNER JOIN right ON left.id = right.id LIMIT 10` because the join is going to randomly shuffle the data.
   
   If we really want / need to support that non-deterministic approach then we can add a boolean flag to the fetch node options to `allow_nondeterministic` or something like that.
   
   > Also, am I correct that select_k_sink_node should be dropped and that there is no replacement select_k_node?
   
   You are correct.  It's still very doable but I don't see the purpose until someone gets around to implementing the more efficient solution.  I think the only reason we had it before is because ordering and limit were both sink nodes and so you couldn't chain them.
   
   > and that caused the tests to hang.
   
   Yes.  The implementation will have to change slightly when we do a non-deterministic fetch but it's not too bad.  We can just skip the sequencing queue and call process immediately (guarded by a mutex).  Right now it's hanging because the sequencing queue just accumulates everything and never emits because it never sees the first batch (it should probably error though when it sees an unordered batch. That would be a nice cleanup).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org