You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "He Yongqiang (JIRA)" <ji...@apache.org> on 2011/04/05 23:43:05 UTC

[jira] [Created] (HIVE-2095) auto convert map join should not be triggered if the input size is bigger than a configured value.

auto convert map join should not be triggered if the input size is bigger than a configured value.
--------------------------------------------------------------------------------------------------

                 Key: HIVE-2095
                 URL: https://issues.apache.org/jira/browse/HIVE-2095
             Project: Hive
          Issue Type: Bug
            Reporter: He Yongqiang
            Assignee: He Yongqiang


If auto convert join is set to true, it should fall back to common join if the input size of each join table is bigger than a configured value.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2095) auto convert map join bug

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017291#comment-13017291 ] 

He Yongqiang commented on HIVE-2095:
------------------------------------

Uploading a new patch to address namit's comments.

Note, there is an existing bug in hive that cause results of auto_join29.q is not correct. 
Let's file another jira for it.
basically, if the outer join filter is enabled, the query "SELECT /*+mapjoin(src1, src2)*/ * FROM src src1 RIGHT OUTER JOIN src src2 ON (src1.key = src2.key AND src1.key < 10 AND src2.key > 10) JOIN src src3 ON (src2.key = src3.key AND src3.key < 10) SORT BY src1.key, src1.value, src2.key, src2.value, src3.key, src3.value;" will give wrong results in today's hive.

> auto convert map join bug
> -------------------------
>
>                 Key: HIVE-2095
>                 URL: https://issues.apache.org/jira/browse/HIVE-2095
>             Project: Hive
>          Issue Type: Bug
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: HIVE-2095.1.patch, HIVE-2095.2.patch
>
>
> 1) 
> when considering to choose one table as the big table candidate for a map join, if at compile time, hive can find out that the total known size of all other tables excluding the big table in consideration is bigger than a configured value, this big table candidate is a bad one, and should not put into plan. Otherwise, at runtime to filter this out may cause more time.
> 2)
> added a null check for back up tasks. Otherwise will see NullPointerException
> 3)
> CommonJoinResolver needs to know a full mapping of pathToAliases. Otherwise it will make wrong decision.
> 4)
> changes made to the ConditionalResolverCommonJoin: added pathToAliases, aliasToSize (alias's input size that is known at compile time, by inputSummary), and intermediate dir path.
> So the logic is, go over all the pathToAliases, and for each path, if it is from intermediate dir path, add this path's size to all aliases. And finally based on the size information and others like aliasToTask to choose the big table. 
> 5)
> Conditional task's children contains wrong options, which may cause join fail or incorrect results. Basically when getting all possible children for the conditional task, should use a whitelist of big tables. Only tables in this while list can be considered as a big table.
> Here is the logic:
> +   * Get a list of big table candidates. Only the tables in the returned set can
> +   * be used as big table in the join operation.
> +   * 
> +   * The logic here is to scan the join condition array from left to right. If
> +   * see a inner join and the bigTableCandidates is empty, add both side of this
> +   * inner join to big table candidates. If see a left outer join, and the
> +   * bigTableCandidates is empty, add the left side to it, and if the
> +   * bigTableCandidates is not empty, do nothing (which means the
> +   * bigTableCandidates is from left side). If see a right outer join, clear the
> +   * bigTableCandidates, and add right side to the bigTableCandidates, it means
> +   * the right side of a right outer join always win. If see a full outer join,
> +   * return null immediately (no one can be the big table, can not do a
> +   * mapjoin).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2095) auto convert map join bug

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-2095:
-------------------------------

    Status: Patch Available  (was: Open)

> auto convert map join bug
> -------------------------
>
>                 Key: HIVE-2095
>                 URL: https://issues.apache.org/jira/browse/HIVE-2095
>             Project: Hive
>          Issue Type: Bug
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: HIVE-2095.1.patch
>
>
> 1) 
> when considering to choose one table as the big table candidate for a map join, if at compile time, hive can find out that the total known size of all other tables excluding the big table in consideration is bigger than a configured value, this big table candidate is a bad one, and should not put into plan. Otherwise, at runtime to filter this out may cause more time.
> 2)
> added a null check for back up tasks. Otherwise will see NullPointerException
> 3)
> CommonJoinResolver needs to know a full mapping of pathToAliases. Otherwise it will make wrong decision.
> 4)
> changes made to the ConditionalResolverCommonJoin: added pathToAliases, aliasToSize (alias's input size that is known at compile time, by inputSummary), and intermediate dir path.
> So the logic is, go over all the pathToAliases, and for each path, if it is from intermediate dir path, add this path's size to all aliases. And finally based on the size information and others like aliasToTask to choose the big table. 
> 5)
> Conditional task's children contains wrong options, which may cause join fail or incorrect results. Basically when getting all possible children for the conditional task, should use a whitelist of big tables. Only tables in this while list can be considered as a big table.
> Here is the logic:
> +   * Get a list of big table candidates. Only the tables in the returned set can
> +   * be used as big table in the join operation.
> +   * 
> +   * The logic here is to scan the join condition array from left to right. If
> +   * see a inner join and the bigTableCandidates is empty, add both side of this
> +   * inner join to big table candidates. If see a left outer join, and the
> +   * bigTableCandidates is empty, add the left side to it, and if the
> +   * bigTableCandidates is not empty, do nothing (which means the
> +   * bigTableCandidates is from left side). If see a right outer join, clear the
> +   * bigTableCandidates, and add right side to the bigTableCandidates, it means
> +   * the right side of a right outer join always win. If see a full outer join,
> +   * return null immediately (no one can be the big table, can not do a
> +   * mapjoin).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2095) auto convert map join bug

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-2095:
-----------------------------

    Status: Open  (was: Patch Available)

> auto convert map join bug
> -------------------------
>
>                 Key: HIVE-2095
>                 URL: https://issues.apache.org/jira/browse/HIVE-2095
>             Project: Hive
>          Issue Type: Bug
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: HIVE-2095.1.patch
>
>
> 1) 
> when considering to choose one table as the big table candidate for a map join, if at compile time, hive can find out that the total known size of all other tables excluding the big table in consideration is bigger than a configured value, this big table candidate is a bad one, and should not put into plan. Otherwise, at runtime to filter this out may cause more time.
> 2)
> added a null check for back up tasks. Otherwise will see NullPointerException
> 3)
> CommonJoinResolver needs to know a full mapping of pathToAliases. Otherwise it will make wrong decision.
> 4)
> changes made to the ConditionalResolverCommonJoin: added pathToAliases, aliasToSize (alias's input size that is known at compile time, by inputSummary), and intermediate dir path.
> So the logic is, go over all the pathToAliases, and for each path, if it is from intermediate dir path, add this path's size to all aliases. And finally based on the size information and others like aliasToTask to choose the big table. 
> 5)
> Conditional task's children contains wrong options, which may cause join fail or incorrect results. Basically when getting all possible children for the conditional task, should use a whitelist of big tables. Only tables in this while list can be considered as a big table.
> Here is the logic:
> +   * Get a list of big table candidates. Only the tables in the returned set can
> +   * be used as big table in the join operation.
> +   * 
> +   * The logic here is to scan the join condition array from left to right. If
> +   * see a inner join and the bigTableCandidates is empty, add both side of this
> +   * inner join to big table candidates. If see a left outer join, and the
> +   * bigTableCandidates is empty, add the left side to it, and if the
> +   * bigTableCandidates is not empty, do nothing (which means the
> +   * bigTableCandidates is from left side). If see a right outer join, clear the
> +   * bigTableCandidates, and add right side to the bigTableCandidates, it means
> +   * the right side of a right outer join always win. If see a full outer join,
> +   * return null immediately (no one can be the big table, can not do a
> +   * mapjoin).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2095) auto convert map join bug

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-2095:
---------------------------------

          Component/s: Query Processor
    Affects Version/s: 0.7.0
        Fix Version/s: 0.8.0

> auto convert map join bug
> -------------------------
>
>                 Key: HIVE-2095
>                 URL: https://issues.apache.org/jira/browse/HIVE-2095
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.7.0
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>             Fix For: 0.8.0
>
>         Attachments: HIVE-2095.1.patch, HIVE-2095.2.patch
>
>
> 1) 
> when considering to choose one table as the big table candidate for a map join, if at compile time, hive can find out that the total known size of all other tables excluding the big table in consideration is bigger than a configured value, this big table candidate is a bad one, and should not put into plan. Otherwise, at runtime to filter this out may cause more time.
> 2)
> added a null check for back up tasks. Otherwise will see NullPointerException
> 3)
> CommonJoinResolver needs to know a full mapping of pathToAliases. Otherwise it will make wrong decision.
> 4)
> changes made to the ConditionalResolverCommonJoin: added pathToAliases, aliasToSize (alias's input size that is known at compile time, by inputSummary), and intermediate dir path.
> So the logic is, go over all the pathToAliases, and for each path, if it is from intermediate dir path, add this path's size to all aliases. And finally based on the size information and others like aliasToTask to choose the big table. 
> 5)
> Conditional task's children contains wrong options, which may cause join fail or incorrect results. Basically when getting all possible children for the conditional task, should use a whitelist of big tables. Only tables in this while list can be considered as a big table.
> Here is the logic:
> +   * Get a list of big table candidates. Only the tables in the returned set can
> +   * be used as big table in the join operation.
> +   * 
> +   * The logic here is to scan the join condition array from left to right. If
> +   * see a inner join and the bigTableCandidates is empty, add both side of this
> +   * inner join to big table candidates. If see a left outer join, and the
> +   * bigTableCandidates is empty, add the left side to it, and if the
> +   * bigTableCandidates is not empty, do nothing (which means the
> +   * bigTableCandidates is from left side). If see a right outer join, clear the
> +   * bigTableCandidates, and add right side to the bigTableCandidates, it means
> +   * the right side of a right outer join always win. If see a full outer join,
> +   * return null immediately (no one can be the big table, can not do a
> +   * mapjoin).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2095) auto convert map join bug

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-2095:
-------------------------------

    Attachment: HIVE-2095.1.patch

> auto convert map join bug
> -------------------------
>
>                 Key: HIVE-2095
>                 URL: https://issues.apache.org/jira/browse/HIVE-2095
>             Project: Hive
>          Issue Type: Bug
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: HIVE-2095.1.patch
>
>
> 1) 
> when considering to choose one table as the big table candidate for a map join, if at compile time, hive can find out that the total known size of all other tables excluding the big table in consideration is bigger than a configured value, this big table candidate is a bad one, and should not put into plan. Otherwise, at runtime to filter this out may cause more time.
> 2)
> added a null check for back up tasks. Otherwise will see NullPointerException
> 3)
> CommonJoinResolver needs to know a full mapping of pathToAliases. Otherwise it will make wrong decision.
> 4)
> changes made to the ConditionalResolverCommonJoin: added pathToAliases, aliasToSize (alias's input size that is known at compile time, by inputSummary), and intermediate dir path.
> So the logic is, go over all the pathToAliases, and for each path, if it is from intermediate dir path, add this path's size to all aliases. And finally based on the size information and others like aliasToTask to choose the big table. 
> 5)
> Conditional task's children contains wrong options, which may cause join fail or incorrect results. Basically when getting all possible children for the conditional task, should use a whitelist of big tables. Only tables in this while list can be considered as a big table.
> Here is the logic:
> +   * Get a list of big table candidates. Only the tables in the returned set can
> +   * be used as big table in the join operation.
> +   * 
> +   * The logic here is to scan the join condition array from left to right. If
> +   * see a inner join and the bigTableCandidates is empty, add both side of this
> +   * inner join to big table candidates. If see a left outer join, and the
> +   * bigTableCandidates is empty, add the left side to it, and if the
> +   * bigTableCandidates is not empty, do nothing (which means the
> +   * bigTableCandidates is from left side). If see a right outer join, clear the
> +   * bigTableCandidates, and add right side to the bigTableCandidates, it means
> +   * the right side of a right outer join always win. If see a full outer join,
> +   * return null immediately (no one can be the big table, can not do a
> +   * mapjoin).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2095) auto convert map join bug

Posted by "Aniket Mokashi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aniket Mokashi updated HIVE-2095:
---------------------------------

    Description: 
1) 
when considering to choose one table as the big table candidate for a map join, if at compile time, hive can find out that the total known size of all other tables excluding the big table in consideration is bigger than a configured value, this big table candidate is a bad one, and should not put into plan. Otherwise, at runtime to filter this out may cause more time.

2)
added a null check for back up tasks. Otherwise will see NullPointerException

3)
CommonJoinResolver needs to know a full mapping of pathToAliases. Otherwise it will make wrong decision.

4)
changes made to the ConditionalResolverCommonJoin: added pathToAliases, aliasToSize (alias's input size that is known at compile time, by inputSummary), and intermediate dir path.
So the logic is, go over all the pathToAliases, and for each path, if it is from intermediate dir path, add this path's size to all aliases. And finally based on the size information and others like aliasToTask to choose the big table. 

5)
Conditional task's children contains wrong options, which may cause join fail or incorrect results. Basically when getting all possible children for the conditional task, should use a whitelist of big tables. Only tables in this while list can be considered as a big table.
Here is the logic:

* Get a list of big table candidates. Only the tables in the returned set can be used as big table in the join operation.
* The logic here is to scan the join condition array from left to right. 
** If see a inner join and the bigTableCandidates is empty, add both side of this inner join to big table candidates. 
** If see a left outer join, and the bigTableCandidates is empty, add the left side to it, and 
** if the bigTableCandidates is not empty, do nothing (which means the bigTableCandidates is from left side). 
** If see a right outer join, clear the bigTableCandidates, and add right side to the bigTableCandidates, it means the right side of a right outer join always win. 
** If see a full outer join, return null immediately (no one can be the big table, can not do a mapjoin).


  was:
1) 
when considering to choose one table as the big table candidate for a map join, if at compile time, hive can find out that the total known size of all other tables excluding the big table in consideration is bigger than a configured value, this big table candidate is a bad one, and should not put into plan. Otherwise, at runtime to filter this out may cause more time.

2)
added a null check for back up tasks. Otherwise will see NullPointerException

3)
CommonJoinResolver needs to know a full mapping of pathToAliases. Otherwise it will make wrong decision.

4)
changes made to the ConditionalResolverCommonJoin: added pathToAliases, aliasToSize (alias's input size that is known at compile time, by inputSummary), and intermediate dir path.
So the logic is, go over all the pathToAliases, and for each path, if it is from intermediate dir path, add this path's size to all aliases. And finally based on the size information and others like aliasToTask to choose the big table. 

5)
Conditional task's children contains wrong options, which may cause join fail or incorrect results. Basically when getting all possible children for the conditional task, should use a whitelist of big tables. Only tables in this while list can be considered as a big table.
Here is the logic:

+   * Get a list of big table candidates. Only the tables in the returned set can
+   * be used as big table in the join operation.
+   * 
+   * The logic here is to scan the join condition array from left to right. If
+   * see a inner join and the bigTableCandidates is empty, add both side of this
+   * inner join to big table candidates. If see a left outer join, and the
+   * bigTableCandidates is empty, add the left side to it, and if the
+   * bigTableCandidates is not empty, do nothing (which means the
+   * bigTableCandidates is from left side). If see a right outer join, clear the
+   * bigTableCandidates, and add right side to the bigTableCandidates, it means
+   * the right side of a right outer join always win. If see a full outer join,
+   * return null immediately (no one can be the big table, can not do a
+   * mapjoin).


    
> auto convert map join bug
> -------------------------
>
>                 Key: HIVE-2095
>                 URL: https://issues.apache.org/jira/browse/HIVE-2095
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.7.0
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>             Fix For: 0.8.0
>
>         Attachments: HIVE-2095.1.patch, HIVE-2095.2.patch
>
>
> 1) 
> when considering to choose one table as the big table candidate for a map join, if at compile time, hive can find out that the total known size of all other tables excluding the big table in consideration is bigger than a configured value, this big table candidate is a bad one, and should not put into plan. Otherwise, at runtime to filter this out may cause more time.
> 2)
> added a null check for back up tasks. Otherwise will see NullPointerException
> 3)
> CommonJoinResolver needs to know a full mapping of pathToAliases. Otherwise it will make wrong decision.
> 4)
> changes made to the ConditionalResolverCommonJoin: added pathToAliases, aliasToSize (alias's input size that is known at compile time, by inputSummary), and intermediate dir path.
> So the logic is, go over all the pathToAliases, and for each path, if it is from intermediate dir path, add this path's size to all aliases. And finally based on the size information and others like aliasToTask to choose the big table. 
> 5)
> Conditional task's children contains wrong options, which may cause join fail or incorrect results. Basically when getting all possible children for the conditional task, should use a whitelist of big tables. Only tables in this while list can be considered as a big table.
> Here is the logic:
> * Get a list of big table candidates. Only the tables in the returned set can be used as big table in the join operation.
> * The logic here is to scan the join condition array from left to right. 
> ** If see a inner join and the bigTableCandidates is empty, add both side of this inner join to big table candidates. 
> ** If see a left outer join, and the bigTableCandidates is empty, add the left side to it, and 
> ** if the bigTableCandidates is not empty, do nothing (which means the bigTableCandidates is from left side). 
> ** If see a right outer join, clear the bigTableCandidates, and add right side to the bigTableCandidates, it means the right side of a right outer join always win. 
> ** If see a full outer join, return null immediately (no one can be the big table, can not do a mapjoin).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2095) auto convert map join bug

Posted by "Liyin Tang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017208#comment-13017208 ] 

Liyin Tang commented on HIVE-2095:
----------------------------------

it looks good to me. Thanks Yongqiang

> auto convert map join bug
> -------------------------
>
>                 Key: HIVE-2095
>                 URL: https://issues.apache.org/jira/browse/HIVE-2095
>             Project: Hive
>          Issue Type: Bug
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: HIVE-2095.1.patch
>
>
> 1) 
> when considering to choose one table as the big table candidate for a map join, if at compile time, hive can find out that the total known size of all other tables excluding the big table in consideration is bigger than a configured value, this big table candidate is a bad one, and should not put into plan. Otherwise, at runtime to filter this out may cause more time.
> 2)
> added a null check for back up tasks. Otherwise will see NullPointerException
> 3)
> CommonJoinResolver needs to know a full mapping of pathToAliases. Otherwise it will make wrong decision.
> 4)
> changes made to the ConditionalResolverCommonJoin: added pathToAliases, aliasToSize (alias's input size that is known at compile time, by inputSummary), and intermediate dir path.
> So the logic is, go over all the pathToAliases, and for each path, if it is from intermediate dir path, add this path's size to all aliases. And finally based on the size information and others like aliasToTask to choose the big table. 
> 5)
> Conditional task's children contains wrong options, which may cause join fail or incorrect results. Basically when getting all possible children for the conditional task, should use a whitelist of big tables. Only tables in this while list can be considered as a big table.
> Here is the logic:
> +   * Get a list of big table candidates. Only the tables in the returned set can
> +   * be used as big table in the join operation.
> +   * 
> +   * The logic here is to scan the join condition array from left to right. If
> +   * see a inner join and the bigTableCandidates is empty, add both side of this
> +   * inner join to big table candidates. If see a left outer join, and the
> +   * bigTableCandidates is empty, add the left side to it, and if the
> +   * bigTableCandidates is not empty, do nothing (which means the
> +   * bigTableCandidates is from left side). If see a right outer join, clear the
> +   * bigTableCandidates, and add right side to the bigTableCandidates, it means
> +   * the right side of a right outer join always win. If see a full outer join,
> +   * return null immediately (no one can be the big table, can not do a
> +   * mapjoin).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2095) auto convert map join bug

Posted by "Matt Kleiderman (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13449933#comment-13449933 ] 

Matt Kleiderman commented on HIVE-2095:
---------------------------------------

I think I'm hitting this issue with an 0.7.1 installation - can you provide information about how big the tables need to be in order to trigger the NullPointerException?
                
> auto convert map join bug
> -------------------------
>
>                 Key: HIVE-2095
>                 URL: https://issues.apache.org/jira/browse/HIVE-2095
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.7.0
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>             Fix For: 0.8.0
>
>         Attachments: HIVE-2095.1.patch, HIVE-2095.2.patch
>
>
> 1) 
> when considering to choose one table as the big table candidate for a map join, if at compile time, hive can find out that the total known size of all other tables excluding the big table in consideration is bigger than a configured value, this big table candidate is a bad one, and should not put into plan. Otherwise, at runtime to filter this out may cause more time.
> 2)
> added a null check for back up tasks. Otherwise will see NullPointerException
> 3)
> CommonJoinResolver needs to know a full mapping of pathToAliases. Otherwise it will make wrong decision.
> 4)
> changes made to the ConditionalResolverCommonJoin: added pathToAliases, aliasToSize (alias's input size that is known at compile time, by inputSummary), and intermediate dir path.
> So the logic is, go over all the pathToAliases, and for each path, if it is from intermediate dir path, add this path's size to all aliases. And finally based on the size information and others like aliasToTask to choose the big table. 
> 5)
> Conditional task's children contains wrong options, which may cause join fail or incorrect results. Basically when getting all possible children for the conditional task, should use a whitelist of big tables. Only tables in this while list can be considered as a big table.
> Here is the logic:
> +   * Get a list of big table candidates. Only the tables in the returned set can
> +   * be used as big table in the join operation.
> +   * 
> +   * The logic here is to scan the join condition array from left to right. If
> +   * see a inner join and the bigTableCandidates is empty, add both side of this
> +   * inner join to big table candidates. If see a left outer join, and the
> +   * bigTableCandidates is empty, add the left side to it, and if the
> +   * bigTableCandidates is not empty, do nothing (which means the
> +   * bigTableCandidates is from left side). If see a right outer join, clear the
> +   * bigTableCandidates, and add right side to the bigTableCandidates, it means
> +   * the right side of a right outer join always win. If see a full outer join,
> +   * return null immediately (no one can be the big table, can not do a
> +   * mapjoin).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2095) auto convert map join bug

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-2095:
-------------------------------

    Description: 
1) 
when considering to choose one table as the big table candidate for a map join, if at compile time, hive can find out that the total known size of all other tables excluding the big table in consideration is bigger than a configured value, this big table candidate is a bad one, and should not put into plan. Otherwise, at runtime to filter this out may cause more time.

2)
added a null check for back up tasks. Otherwise will see NullPointerException

3)
CommonJoinResolver needs to know a full mapping of pathToAliases. Otherwise it will make wrong decision.

4)
changes made to the ConditionalResolverCommonJoin: added pathToAliases, aliasToSize (alias's input size that is known at compile time, by inputSummary), and intermediate dir path.
So the logic is, go over all the pathToAliases, and for each path, if it is from intermediate dir path, add this path's size to all aliases. And finally based on the size information and others like aliasToTask to choose the big table. 

5)
Conditional task's children contains wrong options, which may cause join fail or incorrect results. Basically when getting all possible children for the conditional task, should use a whitelist of big tables. Only tables in this while list can be considered as a big table.
Here is the logic:

+   * Get a list of big table candidates. Only the tables in the returned set can
+   * be used as big table in the join operation.
+   * 
+   * The logic here is to scan the join condition array from left to right. If
+   * see a inner join and the bigTableCandidates is empty, add both side of this
+   * inner join to big table candidates. If see a left outer join, and the
+   * bigTableCandidates is empty, add the left side to it, and if the
+   * bigTableCandidates is not empty, do nothing (which means the
+   * bigTableCandidates is from left side). If see a right outer join, clear the
+   * bigTableCandidates, and add right side to the bigTableCandidates, it means
+   * the right side of a right outer join always win. If see a full outer join,
+   * return null immediately (no one can be the big table, can not do a
+   * mapjoin).


  was:If auto convert join is set to true, it should fall back to common join if the input size of each join table is bigger than a configured value.

        Summary: auto convert map join bug  (was: auto convert map join should not be triggered if the input size is bigger than a configured value.)

> auto convert map join bug
> -------------------------
>
>                 Key: HIVE-2095
>                 URL: https://issues.apache.org/jira/browse/HIVE-2095
>             Project: Hive
>          Issue Type: Bug
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: HIVE-2095.1.patch
>
>
> 1) 
> when considering to choose one table as the big table candidate for a map join, if at compile time, hive can find out that the total known size of all other tables excluding the big table in consideration is bigger than a configured value, this big table candidate is a bad one, and should not put into plan. Otherwise, at runtime to filter this out may cause more time.
> 2)
> added a null check for back up tasks. Otherwise will see NullPointerException
> 3)
> CommonJoinResolver needs to know a full mapping of pathToAliases. Otherwise it will make wrong decision.
> 4)
> changes made to the ConditionalResolverCommonJoin: added pathToAliases, aliasToSize (alias's input size that is known at compile time, by inputSummary), and intermediate dir path.
> So the logic is, go over all the pathToAliases, and for each path, if it is from intermediate dir path, add this path's size to all aliases. And finally based on the size information and others like aliasToTask to choose the big table. 
> 5)
> Conditional task's children contains wrong options, which may cause join fail or incorrect results. Basically when getting all possible children for the conditional task, should use a whitelist of big tables. Only tables in this while list can be considered as a big table.
> Here is the logic:
> +   * Get a list of big table candidates. Only the tables in the returned set can
> +   * be used as big table in the join operation.
> +   * 
> +   * The logic here is to scan the join condition array from left to right. If
> +   * see a inner join and the bigTableCandidates is empty, add both side of this
> +   * inner join to big table candidates. If see a left outer join, and the
> +   * bigTableCandidates is empty, add the left side to it, and if the
> +   * bigTableCandidates is not empty, do nothing (which means the
> +   * bigTableCandidates is from left side). If see a right outer join, clear the
> +   * bigTableCandidates, and add right side to the bigTableCandidates, it means
> +   * the right side of a right outer join always win. If see a full outer join,
> +   * return null immediately (no one can be the big table, can not do a
> +   * mapjoin).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2095) auto convert map join bug

Posted by "Liyin Tang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016871#comment-13016871 ] 

Liyin Tang commented on HIVE-2095:
----------------------------------

I will take a look

> auto convert map join bug
> -------------------------
>
>                 Key: HIVE-2095
>                 URL: https://issues.apache.org/jira/browse/HIVE-2095
>             Project: Hive
>          Issue Type: Bug
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: HIVE-2095.1.patch
>
>
> 1) 
> when considering to choose one table as the big table candidate for a map join, if at compile time, hive can find out that the total known size of all other tables excluding the big table in consideration is bigger than a configured value, this big table candidate is a bad one, and should not put into plan. Otherwise, at runtime to filter this out may cause more time.
> 2)
> added a null check for back up tasks. Otherwise will see NullPointerException
> 3)
> CommonJoinResolver needs to know a full mapping of pathToAliases. Otherwise it will make wrong decision.
> 4)
> changes made to the ConditionalResolverCommonJoin: added pathToAliases, aliasToSize (alias's input size that is known at compile time, by inputSummary), and intermediate dir path.
> So the logic is, go over all the pathToAliases, and for each path, if it is from intermediate dir path, add this path's size to all aliases. And finally based on the size information and others like aliasToTask to choose the big table. 
> 5)
> Conditional task's children contains wrong options, which may cause join fail or incorrect results. Basically when getting all possible children for the conditional task, should use a whitelist of big tables. Only tables in this while list can be considered as a big table.
> Here is the logic:
> +   * Get a list of big table candidates. Only the tables in the returned set can
> +   * be used as big table in the join operation.
> +   * 
> +   * The logic here is to scan the join condition array from left to right. If
> +   * see a inner join and the bigTableCandidates is empty, add both side of this
> +   * inner join to big table candidates. If see a left outer join, and the
> +   * bigTableCandidates is empty, add the left side to it, and if the
> +   * bigTableCandidates is not empty, do nothing (which means the
> +   * bigTableCandidates is from left side). If see a right outer join, clear the
> +   * bigTableCandidates, and add right side to the bigTableCandidates, it means
> +   * the right side of a right outer join always win. If see a full outer join,
> +   * return null immediately (no one can be the big table, can not do a
> +   * mapjoin).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HIVE-2095) auto convert map join bug

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain resolved HIVE-2095.
------------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]

Committed. Thanks Yongqiang

> auto convert map join bug
> -------------------------
>
>                 Key: HIVE-2095
>                 URL: https://issues.apache.org/jira/browse/HIVE-2095
>             Project: Hive
>          Issue Type: Bug
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: HIVE-2095.1.patch, HIVE-2095.2.patch
>
>
> 1) 
> when considering to choose one table as the big table candidate for a map join, if at compile time, hive can find out that the total known size of all other tables excluding the big table in consideration is bigger than a configured value, this big table candidate is a bad one, and should not put into plan. Otherwise, at runtime to filter this out may cause more time.
> 2)
> added a null check for back up tasks. Otherwise will see NullPointerException
> 3)
> CommonJoinResolver needs to know a full mapping of pathToAliases. Otherwise it will make wrong decision.
> 4)
> changes made to the ConditionalResolverCommonJoin: added pathToAliases, aliasToSize (alias's input size that is known at compile time, by inputSummary), and intermediate dir path.
> So the logic is, go over all the pathToAliases, and for each path, if it is from intermediate dir path, add this path's size to all aliases. And finally based on the size information and others like aliasToTask to choose the big table. 
> 5)
> Conditional task's children contains wrong options, which may cause join fail or incorrect results. Basically when getting all possible children for the conditional task, should use a whitelist of big tables. Only tables in this while list can be considered as a big table.
> Here is the logic:
> +   * Get a list of big table candidates. Only the tables in the returned set can
> +   * be used as big table in the join operation.
> +   * 
> +   * The logic here is to scan the join condition array from left to right. If
> +   * see a inner join and the bigTableCandidates is empty, add both side of this
> +   * inner join to big table candidates. If see a left outer join, and the
> +   * bigTableCandidates is empty, add the left side to it, and if the
> +   * bigTableCandidates is not empty, do nothing (which means the
> +   * bigTableCandidates is from left side). If see a right outer join, clear the
> +   * bigTableCandidates, and add right side to the bigTableCandidates, it means
> +   * the right side of a right outer join always win. If see a full outer join,
> +   * return null immediately (no one can be the big table, can not do a
> +   * mapjoin).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2095) auto convert map join bug

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017064#comment-13017064 ] 

He Yongqiang commented on HIVE-2095:
------------------------------------

https://reviews.apache.org/r/559/

> auto convert map join bug
> -------------------------
>
>                 Key: HIVE-2095
>                 URL: https://issues.apache.org/jira/browse/HIVE-2095
>             Project: Hive
>          Issue Type: Bug
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: HIVE-2095.1.patch
>
>
> 1) 
> when considering to choose one table as the big table candidate for a map join, if at compile time, hive can find out that the total known size of all other tables excluding the big table in consideration is bigger than a configured value, this big table candidate is a bad one, and should not put into plan. Otherwise, at runtime to filter this out may cause more time.
> 2)
> added a null check for back up tasks. Otherwise will see NullPointerException
> 3)
> CommonJoinResolver needs to know a full mapping of pathToAliases. Otherwise it will make wrong decision.
> 4)
> changes made to the ConditionalResolverCommonJoin: added pathToAliases, aliasToSize (alias's input size that is known at compile time, by inputSummary), and intermediate dir path.
> So the logic is, go over all the pathToAliases, and for each path, if it is from intermediate dir path, add this path's size to all aliases. And finally based on the size information and others like aliasToTask to choose the big table. 
> 5)
> Conditional task's children contains wrong options, which may cause join fail or incorrect results. Basically when getting all possible children for the conditional task, should use a whitelist of big tables. Only tables in this while list can be considered as a big table.
> Here is the logic:
> +   * Get a list of big table candidates. Only the tables in the returned set can
> +   * be used as big table in the join operation.
> +   * 
> +   * The logic here is to scan the join condition array from left to right. If
> +   * see a inner join and the bigTableCandidates is empty, add both side of this
> +   * inner join to big table candidates. If see a left outer join, and the
> +   * bigTableCandidates is empty, add the left side to it, and if the
> +   * bigTableCandidates is not empty, do nothing (which means the
> +   * bigTableCandidates is from left side). If see a right outer join, clear the
> +   * bigTableCandidates, and add right side to the bigTableCandidates, it means
> +   * the right side of a right outer join always win. If see a full outer join,
> +   * return null immediately (no one can be the big table, can not do a
> +   * mapjoin).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2095) auto convert map join bug

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-2095:
-------------------------------

    Attachment: HIVE-2095.2.patch

> auto convert map join bug
> -------------------------
>
>                 Key: HIVE-2095
>                 URL: https://issues.apache.org/jira/browse/HIVE-2095
>             Project: Hive
>          Issue Type: Bug
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: HIVE-2095.1.patch, HIVE-2095.2.patch
>
>
> 1) 
> when considering to choose one table as the big table candidate for a map join, if at compile time, hive can find out that the total known size of all other tables excluding the big table in consideration is bigger than a configured value, this big table candidate is a bad one, and should not put into plan. Otherwise, at runtime to filter this out may cause more time.
> 2)
> added a null check for back up tasks. Otherwise will see NullPointerException
> 3)
> CommonJoinResolver needs to know a full mapping of pathToAliases. Otherwise it will make wrong decision.
> 4)
> changes made to the ConditionalResolverCommonJoin: added pathToAliases, aliasToSize (alias's input size that is known at compile time, by inputSummary), and intermediate dir path.
> So the logic is, go over all the pathToAliases, and for each path, if it is from intermediate dir path, add this path's size to all aliases. And finally based on the size information and others like aliasToTask to choose the big table. 
> 5)
> Conditional task's children contains wrong options, which may cause join fail or incorrect results. Basically when getting all possible children for the conditional task, should use a whitelist of big tables. Only tables in this while list can be considered as a big table.
> Here is the logic:
> +   * Get a list of big table candidates. Only the tables in the returned set can
> +   * be used as big table in the join operation.
> +   * 
> +   * The logic here is to scan the join condition array from left to right. If
> +   * see a inner join and the bigTableCandidates is empty, add both side of this
> +   * inner join to big table candidates. If see a left outer join, and the
> +   * bigTableCandidates is empty, add the left side to it, and if the
> +   * bigTableCandidates is not empty, do nothing (which means the
> +   * bigTableCandidates is from left side). If see a right outer join, clear the
> +   * bigTableCandidates, and add right side to the bigTableCandidates, it means
> +   * the right side of a right outer join always win. If see a full outer join,
> +   * return null immediately (no one can be the big table, can not do a
> +   * mapjoin).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2095) auto convert map join bug

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13017049#comment-13017049 ] 

Namit Jain commented on HIVE-2095:
----------------------------------

Can you also create a review-board request ?

> auto convert map join bug
> -------------------------
>
>                 Key: HIVE-2095
>                 URL: https://issues.apache.org/jira/browse/HIVE-2095
>             Project: Hive
>          Issue Type: Bug
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: HIVE-2095.1.patch
>
>
> 1) 
> when considering to choose one table as the big table candidate for a map join, if at compile time, hive can find out that the total known size of all other tables excluding the big table in consideration is bigger than a configured value, this big table candidate is a bad one, and should not put into plan. Otherwise, at runtime to filter this out may cause more time.
> 2)
> added a null check for back up tasks. Otherwise will see NullPointerException
> 3)
> CommonJoinResolver needs to know a full mapping of pathToAliases. Otherwise it will make wrong decision.
> 4)
> changes made to the ConditionalResolverCommonJoin: added pathToAliases, aliasToSize (alias's input size that is known at compile time, by inputSummary), and intermediate dir path.
> So the logic is, go over all the pathToAliases, and for each path, if it is from intermediate dir path, add this path's size to all aliases. And finally based on the size information and others like aliasToTask to choose the big table. 
> 5)
> Conditional task's children contains wrong options, which may cause join fail or incorrect results. Basically when getting all possible children for the conditional task, should use a whitelist of big tables. Only tables in this while list can be considered as a big table.
> Here is the logic:
> +   * Get a list of big table candidates. Only the tables in the returned set can
> +   * be used as big table in the join operation.
> +   * 
> +   * The logic here is to scan the join condition array from left to right. If
> +   * see a inner join and the bigTableCandidates is empty, add both side of this
> +   * inner join to big table candidates. If see a left outer join, and the
> +   * bigTableCandidates is empty, add the left side to it, and if the
> +   * bigTableCandidates is not empty, do nothing (which means the
> +   * bigTableCandidates is from left side). If see a right outer join, clear the
> +   * bigTableCandidates, and add right side to the bigTableCandidates, it means
> +   * the right side of a right outer join always win. If see a full outer join,
> +   * return null immediately (no one can be the big table, can not do a
> +   * mapjoin).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira