You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/15 13:13:05 UTC

[GitHub] [arrow-datafusion] yjshen commented on issue #1572: Consolidate the N-way merging code and `SortPreservingMergeStream` (which has quite good tests of what is often quite tricky code, and it will be performance critical)

yjshen commented on issue #1572:
URL: https://github.com/apache/arrow-datafusion/issues/1572#issuecomment-1013680326


   Thanks @alamb for bringing it up.
   
   I propose using heap-sort for N-way merge, but consolidate all the codes we have now in `in_mem_sort` and `SortPreservingMergeStream(SPMS)` into  SPMS, and remove `in_mem_sort`. 
   
   The main reason to use `heap_sort` is to minimize the number of comparisons before emitting each record. Since we have "N" partial ordered parts already, we only need to maintain a heap of size "N" at a time. And benefit from the complexity of O(1) to take out of the smallest record, and O(log n) for new item insertion if the last smallest stream or record batch is not exhausted. 
   
   For the current sorting methods in SPMS, I think it's to compare all starting records from all streams at once for each time we emit the least. The comparison might not be an issue when N is relatively small. But would deteriorate fast as N grows.
   
   @tustvold, what's your opinion on the merging algorithm to choose?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org