You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/07 19:08:00 UTC

[GitHub] [arrow] Dandandan opened a new pull request #9937: ARROW-12279: [Rust][DataFusion] Add test for null handling in hash join (ARROW-12266)

Dandandan opened a new pull request #9937:
URL: https://github.com/apache/arrow/pull/9937


   This PR adds a (ignored) test for https://issues.apache.org/jira/browse/ARROW-12266
   
   ```
   SELECT id1, id2 FROM (SELECT null AS id1) t1
   LEFT JOIN (SELECT 0 AS id2) t2 ON id1 = id2
   ```
   
   current result:
   
   ```NULL, NULL```
   
   (should be empty result set)
   
   We should filter on nulls beforehand to make this result correct. Probably the best way to go here I think is to add a filter in the logical plan on non-null for inner / left and right joins.
   This can make things more efficient as the non-null filter can be pushed down which can lead to efficiency gains (making data-set smaller, not having to deal with nullable data in batches, or even entire files could be skipped when they only contain nulls).
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb closed pull request #9937: ARROW-12279: [Rust][DataFusion] Add test for null handling in hash join (ARROW-12266)

Posted by GitBox <gi...@apache.org>.
alamb closed pull request #9937:
URL: https://github.com/apache/arrow/pull/9937


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #9937: ARROW-12279: [Rust][DataFusion] Add test for null handling in hash join (ARROW-12266)

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #9937:
URL: https://github.com/apache/arrow/pull/9937#issuecomment-815275511


   https://issues.apache.org/jira/browse/ARROW-12279


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] Dandandan commented on pull request #9937: ARROW-12279: [Rust][DataFusion] Add test for null handling in hash join (ARROW-12266)

Posted by GitBox <gi...@apache.org>.
Dandandan commented on pull request #9937:
URL: https://github.com/apache/arrow/pull/9937#issuecomment-815445960


   > > We should filter on nulls beforehand to make this result correct. Probably the best way to go here I think is to add a filter in the logical plan on non-null for inner / left and right joins.
   > 
   > I am not sure this works for all join types (OUTER JOIN as well as , ANTI-JOIN and SEMI-JOIN which are optimizations for subqueries)
   > 
   > It might make sense to check for null when building the hash table for inner join keys (as NULL will never equal NULL)
   
   You are right, not for outer join or other joins (but we don't have them yet). For those, I think the rows have to be included, but might need some changes too wrt equality and building the hashmap. The filter approach is what Spark does fwiw. I think that makes also sense in the conceptually as joins should also support other conditions and allows for greater efficiency.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #9937: ARROW-12279: [Rust][DataFusion] Add test for null handling in hash join (ARROW-12266)

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #9937:
URL: https://github.com/apache/arrow/pull/9937#issuecomment-815288702






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org