You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/24 04:11:53 UTC

[GitHub] [arrow-datafusion] liukun4515 opened a new issue, #4356: refactor the code of the `HashJoin`

liukun4515 opened a new issue, #4356:
URL: https://github.com/apache/arrow-datafusion/issues/4356

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for this feature, in addition to  the *what*)
   
   From the bug #4247, there are bugs for hash join for right semi/anti join.
   
   The current code base has many `match` path for each `Join_type`, each `join_type` has different logic and path, it easy to produce the bugs when we add feature in the `HashJoin`.
   
   
   Proposal:
   
   split vectorization `HashJoin` to three phase:
   
   1. get the result of matched equal join : left_idx and right_idx
   2. apply non_equal filter to `left_idx and right_idx` and get the filter_left_idx with filter_right_idx
   3. according to the `Join Type` to construct the result
   ```
   according to the matched `filter_left_idx` and `filter_right_idx` to get the result
   match join_type{
      inner_join: 
      left_join:
      right_join:
      full_join:
      left_semi: set the left bitmap
      left_anti:  set the left bitmap
      right_semi: set the right bitmap, and get the result from the right bitmap(set_bit)
      right_anti:: set the right bitmap, and get the result from the right bitmap(not_set_bit)
   }
   ```
   
   **Describe the solution you'd like**
   A clear and concise description of what you want to happen.
   
   **Describe alternatives you've considered**
   A clear and concise description of any alternative solutions or features you've considered.
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] xudong963 commented on issue #4356: refactor the code of the `HashJoin`

Posted by GitBox <gi...@apache.org>.
xudong963 commented on issue #4356:
URL: https://github.com/apache/arrow-datafusion/issues/4356#issuecomment-1326562094

   > The current code base has many `match` path for each `Join_type`, each `join_type` has different logic and path, it easy to produce the bugs when we add feature in the `HashJoin`.
   
   Yes, I agree.
   
   > split vectorization `HashJoin` to three phase:
   > 
   > 1. get the result of matched equal join : left_idx and right_idx
   > 2. apply non_equal filter to `left_idx and right_idx` and get the filter_left_idx with filter_right_idx
   > 3. according to the `Join Type` to construct the result
   
   For HashJoin, there are two big phases: **build** and **probe**:
   
   1. For **build** phase, we don't care **JoinType** almost
   2. For **probe** phase, **JoinType** is the direction.  So how about spitting `match` paths at the beginning of **probe** phase
       ```rust
        match join_type {
            inner => probe_inner_join(),
            left => probe_left_join(),
            ....
        }
       ```
        In each probe method, we can process non-equi conditions and equi conditions. Non-equi conditions's results depend on **JoinType**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liukun4515 commented on issue #4356: refactor the code of the `HashJoin`

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on issue #4356:
URL: https://github.com/apache/arrow-datafusion/issues/4356#issuecomment-1327188814

   > > The current code base has many `match` path for each `Join_type`, each `join_type` has different logic and path, it easy to produce the bugs when we add feature in the `HashJoin`.
   > 
   > Yes, I agree.
   > 
   > > split vectorization `HashJoin` to three phase:
   > > 
   > > 1. get the result of matched equal join : left_idx and right_idx
   > > 2. apply non_equal filter to `left_idx and right_idx` and get the filter_left_idx with filter_right_idx
   > > 3. according to the `Join Type` to construct the result
   > 
   > For HashJoin, there are two big phases: **build** and **probe**:
   > 
   > 1. For **build** phase, we don't care **JoinType** almost
   > 2. For **probe** phase, **JoinType** is the direction.  So how about spitting `match` paths at the beginning of **probe** phase
   >    ```rust
   >     match join_type {
   >         inner => probe_inner_join(),
   >         left => probe_left_join(),
   >         ....
   >     }
   >    ```
   >    
   >    
   >        
   >          
   >        
   >    
   >          
   >        
   >    
   >        
   >      
   >    In each probe method, we can process non-equi conditions and equi conditions. Non-equi conditions's results depend on **JoinType**
   
   Probe phase has many common stage. 
   In the vectorization has join, the first stage is to get the left/right indices which are match the on join condition.
   
   Next, use the left/right indices to generate the batch result according to the join type. But some special join type should maintain the left side bitmap to generate the result finally, for example left/full/leftanti/leftsemi.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] Dandandan commented on issue #4356: refactor the code of the `HashJoin`

Posted by GitBox <gi...@apache.org>.
Dandandan commented on issue #4356:
URL: https://github.com/apache/arrow-datafusion/issues/4356#issuecomment-1326703023

   Thanks @liukun4515 for driving the effort to improve the hash join implementation! Makes sense to me to structure it this way. This will help make it easier to change/improve the implementation later on.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liukun4515 commented on issue #4356: refactor the code of the `HashJoin`

Posted by GitBox <gi...@apache.org>.
liukun4515 commented on issue #4356:
URL: https://github.com/apache/arrow-datafusion/issues/4356#issuecomment-1325934336

   After this merged https://github.com/apache/arrow-datafusion/pull/4355, I will start this task


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] liukun4515 closed issue #4356: refactor the code of the `HashJoin`

Posted by GitBox <gi...@apache.org>.
liukun4515 closed issue #4356: refactor the code of the `HashJoin`
URL: https://github.com/apache/arrow-datafusion/issues/4356


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org