You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "zanmato1984 (via GitHub)" <gi...@apache.org> on 2023/12/29 02:59:34 UTC

Re: [I] Left join doesn't preserve order of left for right missing values [arrow]

zanmato1984 commented on issue #37542:
URL: https://github.com/apache/arrow/issues/37542#issuecomment-1871692692

   This is an implementation specific behavior, i.e., depending on the underlying hash join algorithm. Currently there are two types of hash join implementation in acero, both having the similar logic. When doing left outer or full outer joins, the matched rows in the left side are first collected, by probing the hash table built from the right side. Then the unmatched rows in the left side are collected. This results in the order you observed. You can check the code below:
   
   For hash join (the normal algorithm), collecting the unmatched rows:
   https://github.com/apache/arrow/blob/bcaeaa8c2d970b81249cfba019475598e3d3109f/cpp/src/arrow/acero/hash_join.cc#L440-L445
   
   For swiss join (a variation of hash join), first collecting the matched rows:
   https://github.com/apache/arrow/blob/bcaeaa8c2d970b81249cfba019475598e3d3109f/cpp/src/arrow/acero/swiss_join.cc#L1964-L1998
   then collecting the unmatched rows:
   https://github.com/apache/arrow/blob/bcaeaa8c2d970b81249cfba019475598e3d3109f/cpp/src/arrow/acero/swiss_join.cc#L2005-L2025
   
   > I know there's no explicit guaranty on the order of the output data on the join, but it's weird that it preserve order when values are present, and changes it when they are missing.
   
   Yes, you shouldn't make any assumption of ordering in SQL except explicitly specified by `order by` clause. Hope the above information explains your concern.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org