You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Ning Zhang (JIRA)" <ji...@apache.org> on 2010/08/27 22:14:53 UTC

[jira] Created: (HIVE-1605) regression and improvements in handling NULLs in joins

regression and improvements in handling NULLs in joins
------------------------------------------------------

                 Key: HIVE-1605
                 URL: https://issues.apache.org/jira/browse/HIVE-1605
             Project: Hadoop Hive
          Issue Type: Improvement
            Reporter: Ning Zhang
            Assignee: Ning Zhang


There are regressions in sort-merge map join after HIVE-741. There are a lot of OOM exceptions in SMBMapJoinOperator. This caused by the HashMap maintained for each key to remember whether it is NULL. This takes too much memory when the tables are large. 

A second issu is in handling NULLs if the join keys are more than 1 column. This appears in regular MapJoin as well as SMBMapJoin. The code only checks if all the columns are NULL. It should return false in match if any joined value is NULL. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1605) regression and improvements in handling NULLs in joins

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904088#action_12904088 ] 

Amareshwari Sriramadasu commented on HIVE-1605:
-----------------------------------------------

MapJoinOperator.java still has a debug log. Otherwise, patch looks fine.

> regression and improvements in handling NULLs in joins
> ------------------------------------------------------
>
>                 Key: HIVE-1605
>                 URL: https://issues.apache.org/jira/browse/HIVE-1605
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: HIVE-1605.2.patch, HIVE-1605.patch
>
>
> There are regressions in sort-merge map join after HIVE-741. There are a lot of OOM exceptions in SMBMapJoinOperator. This caused by the HashMap maintained for each key to remember whether it is NULL. This takes too much memory when the tables are large. 
> A second issu is in handling NULLs if the join keys are more than 1 column. This appears in regular MapJoin as well as SMBMapJoin. The code only checks if all the columns are NULL. It should return false in match if any joined value is NULL. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1605) regression and improvements in handling NULLs in joins

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang updated HIVE-1605:
-----------------------------

    Status: Patch Available  (was: Open)

> regression and improvements in handling NULLs in joins
> ------------------------------------------------------
>
>                 Key: HIVE-1605
>                 URL: https://issues.apache.org/jira/browse/HIVE-1605
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: HIVE-1605.patch
>
>
> There are regressions in sort-merge map join after HIVE-741. There are a lot of OOM exceptions in SMBMapJoinOperator. This caused by the HashMap maintained for each key to remember whether it is NULL. This takes too much memory when the tables are large. 
> A second issu is in handling NULLs if the join keys are more than 1 column. This appears in regular MapJoin as well as SMBMapJoin. The code only checks if all the columns are NULL. It should return false in match if any joined value is NULL. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1605) regression and improvements in handling NULLs in joins

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-1605:
-----------------------------

          Status: Resolved  (was: Patch Available)
    Hadoop Flags: [Reviewed]
      Resolution: Fixed

Committed. Thanks Ning

> regression and improvements in handling NULLs in joins
> ------------------------------------------------------
>
>                 Key: HIVE-1605
>                 URL: https://issues.apache.org/jira/browse/HIVE-1605
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: HIVE-1605.2.patch, HIVE-1605.3.patch, HIVE-1605.patch
>
>
> There are regressions in sort-merge map join after HIVE-741. There are a lot of OOM exceptions in SMBMapJoinOperator. This caused by the HashMap maintained for each key to remember whether it is NULL. This takes too much memory when the tables are large. 
> A second issu is in handling NULLs if the join keys are more than 1 column. This appears in regular MapJoin as well as SMBMapJoin. The code only checks if all the columns are NULL. It should return false in match if any joined value is NULL. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1605) regression and improvements in handling NULLs in joins

Posted by "John Sichi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904384#action_12904384 ] 

John Sichi commented on HIVE-1605:
----------------------------------

@Namit:  I thought we don't need the DROP TABLE any more since Joy improved the test framework?


> regression and improvements in handling NULLs in joins
> ------------------------------------------------------
>
>                 Key: HIVE-1605
>                 URL: https://issues.apache.org/jira/browse/HIVE-1605
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: HIVE-1605.2.patch, HIVE-1605.3.patch, HIVE-1605.patch
>
>
> There are regressions in sort-merge map join after HIVE-741. There are a lot of OOM exceptions in SMBMapJoinOperator. This caused by the HashMap maintained for each key to remember whether it is NULL. This takes too much memory when the tables are large. 
> A second issu is in handling NULLs if the join keys are more than 1 column. This appears in regular MapJoin as well as SMBMapJoin. The code only checks if all the columns are NULL. It should return false in match if any joined value is NULL. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1605) regression and improvements in handling NULLs in joins

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904085#action_12904085 ] 

Amareshwari Sriramadasu commented on HIVE-1605:
-----------------------------------------------

Ning, Thanks for looking into this. A couple of minor comments:
* Patch has many debug logs and commented code. do you want to remove them?
* Do you want to remove hasAllNulls method from AbstractMapJoinOperator.java?




> regression and improvements in handling NULLs in joins
> ------------------------------------------------------
>
>                 Key: HIVE-1605
>                 URL: https://issues.apache.org/jira/browse/HIVE-1605
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: HIVE-1605.patch
>
>
> There are regressions in sort-merge map join after HIVE-741. There are a lot of OOM exceptions in SMBMapJoinOperator. This caused by the HashMap maintained for each key to remember whether it is NULL. This takes too much memory when the tables are large. 
> A second issu is in handling NULLs if the join keys are more than 1 column. This appears in regular MapJoin as well as SMBMapJoin. The code only checks if all the columns are NULL. It should return false in match if any joined value is NULL. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1605) regression and improvements in handling NULLs in joins

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903584#action_12903584 ] 

Ning Zhang commented on HIVE-1605:
----------------------------------

Came up with a patch along with some other performance improvement in SMBMapJoinOperator. Still running tests. 

> regression and improvements in handling NULLs in joins
> ------------------------------------------------------
>
>                 Key: HIVE-1605
>                 URL: https://issues.apache.org/jira/browse/HIVE-1605
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>
> There are regressions in sort-merge map join after HIVE-741. There are a lot of OOM exceptions in SMBMapJoinOperator. This caused by the HashMap maintained for each key to remember whether it is NULL. This takes too much memory when the tables are large. 
> A second issu is in handling NULLs if the join keys are more than 1 column. This appears in regular MapJoin as well as SMBMapJoin. The code only checks if all the columns are NULL. It should return false in match if any joined value is NULL. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1605) regression and improvements in handling NULLs in joins

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang updated HIVE-1605:
-----------------------------

    Attachment: HIVE-1605.patch

Passed all test except scriptfile1.q in TestMinimrCliDriver in hadoop 0,20. This test also failed on trunk. 

> regression and improvements in handling NULLs in joins
> ------------------------------------------------------
>
>                 Key: HIVE-1605
>                 URL: https://issues.apache.org/jira/browse/HIVE-1605
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: HIVE-1605.patch
>
>
> There are regressions in sort-merge map join after HIVE-741. There are a lot of OOM exceptions in SMBMapJoinOperator. This caused by the HashMap maintained for each key to remember whether it is NULL. This takes too much memory when the tables are large. 
> A second issu is in handling NULLs if the join keys are more than 1 column. This appears in regular MapJoin as well as SMBMapJoin. The code only checks if all the columns are NULL. It should return false in match if any joined value is NULL. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1605) regression and improvements in handling NULLs in joins

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904394#action_12904394 ] 

Namit Jain commented on HIVE-1605:
----------------------------------

That's right - the changes look good in that case

> regression and improvements in handling NULLs in joins
> ------------------------------------------------------
>
>                 Key: HIVE-1605
>                 URL: https://issues.apache.org/jira/browse/HIVE-1605
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: HIVE-1605.2.patch, HIVE-1605.3.patch, HIVE-1605.patch
>
>
> There are regressions in sort-merge map join after HIVE-741. There are a lot of OOM exceptions in SMBMapJoinOperator. This caused by the HashMap maintained for each key to remember whether it is NULL. This takes too much memory when the tables are large. 
> A second issu is in handling NULLs if the join keys are more than 1 column. This appears in regular MapJoin as well as SMBMapJoin. The code only checks if all the columns are NULL. It should return false in match if any joined value is NULL. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1605) regression and improvements in handling NULLs in joins

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904376#action_12904376 ] 

Namit Jain commented on HIVE-1605:
----------------------------------

Ning, can you add the DROP TABLEs at the beginning and end of the test ?

> regression and improvements in handling NULLs in joins
> ------------------------------------------------------
>
>                 Key: HIVE-1605
>                 URL: https://issues.apache.org/jira/browse/HIVE-1605
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: HIVE-1605.2.patch, HIVE-1605.3.patch, HIVE-1605.patch
>
>
> There are regressions in sort-merge map join after HIVE-741. There are a lot of OOM exceptions in SMBMapJoinOperator. This caused by the HashMap maintained for each key to remember whether it is NULL. This takes too much memory when the tables are large. 
> A second issu is in handling NULLs if the join keys are more than 1 column. This appears in regular MapJoin as well as SMBMapJoin. The code only checks if all the columns are NULL. It should return false in match if any joined value is NULL. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1605) regression and improvements in handling NULLs in joins

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang updated HIVE-1605:
-----------------------------

    Attachment: HIVE-1605.3.patch

Uploading hive-1605.3.patch. thanks amareshwari.

> regression and improvements in handling NULLs in joins
> ------------------------------------------------------
>
>                 Key: HIVE-1605
>                 URL: https://issues.apache.org/jira/browse/HIVE-1605
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: HIVE-1605.2.patch, HIVE-1605.3.patch, HIVE-1605.patch
>
>
> There are regressions in sort-merge map join after HIVE-741. There are a lot of OOM exceptions in SMBMapJoinOperator. This caused by the HashMap maintained for each key to remember whether it is NULL. This takes too much memory when the tables are large. 
> A second issu is in handling NULLs if the join keys are more than 1 column. This appears in regular MapJoin as well as SMBMapJoin. The code only checks if all the columns are NULL. It should return false in match if any joined value is NULL. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1605) regression and improvements in handling NULLs in joins

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang updated HIVE-1605:
-----------------------------

    Attachment: HIVE-1605.2.patch

Thanks Amareshwari for the review. Attached HIVE-1605.2.patch address the issues.

> regression and improvements in handling NULLs in joins
> ------------------------------------------------------
>
>                 Key: HIVE-1605
>                 URL: https://issues.apache.org/jira/browse/HIVE-1605
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: HIVE-1605.2.patch, HIVE-1605.patch
>
>
> There are regressions in sort-merge map join after HIVE-741. There are a lot of OOM exceptions in SMBMapJoinOperator. This caused by the HashMap maintained for each key to remember whether it is NULL. This takes too much memory when the tables are large. 
> A second issu is in handling NULLs if the join keys are more than 1 column. This appears in regular MapJoin as well as SMBMapJoin. The code only checks if all the columns are NULL. It should return false in match if any joined value is NULL. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.