You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Kevin Wilfong (Created) (JIRA)" <ji...@apache.org> on 2011/12/02 03:14:40 UTC

[jira] [Created] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

Allow multiple group bys with the same input data and spray keys to be run on the same reducer.
-----------------------------------------------------------------------------------------------

                 Key: HIVE-2621
                 URL: https://issues.apache.org/jira/browse/HIVE-2621
             Project: Hive
          Issue Type: New Feature
            Reporter: Kevin Wilfong
            Assignee: Kevin Wilfong


Currently, when a user runs a query, such as a multi-insert, where each insertion subclause consists of a simple query followed by a group by, the group bys for each clause are run on a separate reducer.  This requires writing the data for each group by clause to an intermediate file, and then reading it back.  This uses a significant amount of the total CPU consumed by the query for an otherwise simple query.

If the subclauses are grouped by their distinct expressions and group by keys, with all of the group by expressions for a group of subclauses run on a single reducer, this would reduce the amount of reading/writing to intermediate files for some queries.

To do this, for each group of subclauses, in the mapper we would execute a the filters for each subclause 'or'd together (provided each subclause has a filter) followed by a reduce sink.  In the reducer, the child operators would be each subclauses filter followed by the group by and any subsequent operations.

Note that this would require turning off map aggregation, so we would need to make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

Posted by "Phabricator (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175775#comment-13175775 ] 

Phabricator commented on HIVE-2621:
-----------------------------------

heyongqiang has accepted the revision "HIVE-2621 [jira] Allow multiple group bys with the same input data and spray keys to be run on the same reducer.".

REVISION DETAIL
  https://reviews.facebook.net/D567

                
> Allow multiple group bys with the same input data and spray keys to be run on the same reducer.
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2621
>                 URL: https://issues.apache.org/jira/browse/HIVE-2621
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch, HIVE-2621.D567.4.patch
>
>
> Currently, when a user runs a query, such as a multi-insert, where each insertion subclause consists of a simple query followed by a group by, the group bys for each clause are run on a separate reducer.  This requires writing the data for each group by clause to an intermediate file, and then reading it back.  This uses a significant amount of the total CPU consumed by the query for an otherwise simple query.
> If the subclauses are grouped by their distinct expressions and group by keys, with all of the group by expressions for a group of subclauses run on a single reducer, this would reduce the amount of reading/writing to intermediate files for some queries.
> To do this, for each group of subclauses, in the mapper we would execute a the filters for each subclause 'or'd together (provided each subclause has a filter) followed by a reduce sink.  In the reducer, the child operators would be each subclauses filter followed by the group by and any subsequent operations.
> Note that this would require turning off map aggregation, so we would need to make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

Posted by "Phabricator (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Phabricator updated HIVE-2621:
------------------------------

    Attachment: HIVE-2621.D567.4.patch

kevinwilfong updated the revision "HIVE-2621 [jira] Allow multiple group bys with the same input data and spray keys to be run on the same reducer.".
Reviewers: JIRA

  Addressed Namit's and Yongqiang's comments as follows:

  Removed code for singlemrMultiGroupBy optimization, and all related methods, as the new code should produce similar results and can handle more cases, such as filters.

  Shared code between getCommonDistinctExprs and getCommonGroupByDestGroups, as well as between genCommonGroupByPlanReduceSinkOperator and genGroupByPlanReduceSinkOperator.

  Added comments where requested.

  Deduplicated filters in common filter used before a common group by reduce sink.

REVISION DETAIL
  https://reviews.facebook.net/D567

AFFECTED FILES
  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
  ql/src/test/results/clientpositive/groupby9.q.out
  ql/src/test/results/clientpositive/groupby7_noskew_multi_single_reducer.q.out
  ql/src/test/results/clientpositive/groupby10.q.out
  ql/src/test/results/clientpositive/parallel.q.out
  ql/src/test/results/clientpositive/groupby_multi_single_reducer.q.out
  ql/src/test/results/clientpositive/multigroupby_singlemr.q.out
  ql/src/test/results/clientpositive/multi_insert.q.out
  ql/src/test/results/clientpositive/groupby8.q.out
  ql/src/test/results/clientpositive/groupby_complex_types_multi_single_reducer.q.out
  ql/src/test/results/clientpositive/groupby7_map_multi_single_reducer.q.out
  ql/src/test/queries/clientpositive/groupby7_noskew.q
  ql/src/test/queries/clientpositive/groupby10.q
  ql/src/test/queries/clientpositive/groupby_multi_single_reducer.q
  ql/src/test/queries/clientpositive/multigroupby_singlemr.q
  ql/src/test/queries/clientpositive/groupby7_map.q
  ql/src/test/queries/clientpositive/groupby8.q
  ql/src/test/queries/clientpositive/groupby9.q
  ql/src/test/queries/clientpositive/groupby7_noskew_multi_single_reducer.q
  ql/src/test/queries/clientpositive/groupby7_map_multi_single_reducer.q
  ql/src/test/queries/clientpositive/groupby_complex_types_multi_single_reducer.q
  ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDesc.java
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java

                
> Allow multiple group bys with the same input data and spray keys to be run on the same reducer.
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2621
>                 URL: https://issues.apache.org/jira/browse/HIVE-2621
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch, HIVE-2621.D567.4.patch
>
>
> Currently, when a user runs a query, such as a multi-insert, where each insertion subclause consists of a simple query followed by a group by, the group bys for each clause are run on a separate reducer.  This requires writing the data for each group by clause to an intermediate file, and then reading it back.  This uses a significant amount of the total CPU consumed by the query for an otherwise simple query.
> If the subclauses are grouped by their distinct expressions and group by keys, with all of the group by expressions for a group of subclauses run on a single reducer, this would reduce the amount of reading/writing to intermediate files for some queries.
> To do this, for each group of subclauses, in the mapper we would execute a the filters for each subclause 'or'd together (provided each subclause has a filter) followed by a reduce sink.  In the reducer, the child operators would be each subclauses filter followed by the group by and any subsequent operations.
> Note that this would require turning off map aggregation, so we would need to make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

Posted by "Phabricator (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161371#comment-13161371 ] 

Phabricator commented on HIVE-2621:
-----------------------------------

kevinwilfong has commented on the revision "HIVE-2621 [jira] Allow multiple group bys with the same input data and spray keys to be run on the same reducer.".

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:6296 The code in this method should be the same as what followed the code to generate a group by plan in the existing code.  The diff just didn't seem to match them up.
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:5693 The string was not being used in this method.

REVISION DETAIL
  https://reviews.facebook.net/D567

                
> Allow multiple group bys with the same input data and spray keys to be run on the same reducer.
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2621
>                 URL: https://issues.apache.org/jira/browse/HIVE-2621
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch
>
>
> Currently, when a user runs a query, such as a multi-insert, where each insertion subclause consists of a simple query followed by a group by, the group bys for each clause are run on a separate reducer.  This requires writing the data for each group by clause to an intermediate file, and then reading it back.  This uses a significant amount of the total CPU consumed by the query for an otherwise simple query.
> If the subclauses are grouped by their distinct expressions and group by keys, with all of the group by expressions for a group of subclauses run on a single reducer, this would reduce the amount of reading/writing to intermediate files for some queries.
> To do this, for each group of subclauses, in the mapper we would execute a the filters for each subclause 'or'd together (provided each subclause has a filter) followed by a reduce sink.  In the reducer, the child operators would be each subclauses filter followed by the group by and any subsequent operations.
> Note that this would require turning off map aggregation, so we would need to make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

Posted by "Phabricator (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Phabricator updated HIVE-2621:
------------------------------

    Attachment: HIVE-2621.D567.1.patch

kevinwilfong requested code review of "HIVE-2621 [jira] Allow multiple group bys with the same input data and spray keys to be run on the same reducer.".
Reviewers: JIRA

  The meaningful changes are all in how the plan is generated.

  If the conf variable has been set, the subclauses are first grouped by their group by keys and distinct keys.  To facilitate this I added a wrapper class to ExprNodeDesc which makes equals like the isSame method.

  If the conf variable is not set, I create a single group of all the subqueries.

  Then, provided certain conditions are met, e.g. the conf variable is set, there is a group by and there are aggregations, the skew conf variable hasn't been set, I create the new plan for each group, otherwise the old plan is produced.

  To start I generate the common filter by 'or'ing the group's clauses' filters.  This goes into a select operator, which goes into a new reduce operator.  The reduce operator is like the typical 1 MR group by reduce operator, except that to generate the reduce values it loops over each of the group's subclauses' aggregations and the columns used in the where clauses.

  This goes into a forward operator and for each subclause the forward operator has a child filter operator, if the subclause has a filter, and a group by operator.  Each group by operator is followed by the operators which would normally follow it in a plan.

TEST PLAN
  I added some unit tests.

  I verified these unit tests and the old unit tests all passed.

  I created a sample query which consisted of a multi-insert from a table with 1,000,000 rows, going into 6 tables, each of which's subclause consisted of a group by, and a count distinct, as well as some other aggregations and havings.  The subclauses were constructed such that they could be grouped into two reducers using the new plan.  I also ensured that the data was such that map aggregation was turned of early using the existing plan.  I verified that this query saw a significant improvement in its CPU usage.

REVISION DETAIL
  https://reviews.facebook.net/D567

AFFECTED FILES
  conf/hive-default.xml
  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
  ql/src/test/results/clientpositive/groupby7_noskew_multi_single_reducer.q.out
  ql/src/test/results/clientpositive/groupby_multi_single_reducer.q.out
  ql/src/test/results/clientpositive/groupby_complex_types_multi_single_reducer.q.out
  ql/src/test/queries/clientpositive/groupby_multi_single_reducer.q
  ql/src/test/queries/clientpositive/groupby7_noskew_multi_single_reducer.q
  ql/src/test/queries/clientpositive/groupby_complex_types_multi_single_reducer.q
  ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDesc.java
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java

MANAGE HERALD DIFFERENTIAL RULES
  https://reviews.facebook.net/herald/view/differential/

WHY DID I GET THIS EMAIL?
  https://reviews.facebook.net/herald/transcript/1269/

Tip: use the X-Herald-Rules header to filter Herald messages in your client.

                
> Allow multiple group bys with the same input data and spray keys to be run on the same reducer.
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2621
>                 URL: https://issues.apache.org/jira/browse/HIVE-2621
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch
>
>
> Currently, when a user runs a query, such as a multi-insert, where each insertion subclause consists of a simple query followed by a group by, the group bys for each clause are run on a separate reducer.  This requires writing the data for each group by clause to an intermediate file, and then reading it back.  This uses a significant amount of the total CPU consumed by the query for an otherwise simple query.
> If the subclauses are grouped by their distinct expressions and group by keys, with all of the group by expressions for a group of subclauses run on a single reducer, this would reduce the amount of reading/writing to intermediate files for some queries.
> To do this, for each group of subclauses, in the mapper we would execute a the filters for each subclause 'or'd together (provided each subclause has a filter) followed by a reduce sink.  In the reducer, the child operators would be each subclauses filter followed by the group by and any subsequent operations.
> Note that this would require turning off map aggregation, so we would need to make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

Posted by "Phabricator (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175207#comment-13175207 ] 

Phabricator commented on HIVE-2621:
-----------------------------------

heyongqiang has commented on the revision "HIVE-2621 [jira] Allow multiple group bys with the same input data and spray keys to be run on the same reducer.".

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:3411 can we do some simple de-duplication here?

REVISION DETAIL
  https://reviews.facebook.net/D567

                
> Allow multiple group bys with the same input data and spray keys to be run on the same reducer.
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2621
>                 URL: https://issues.apache.org/jira/browse/HIVE-2621
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch
>
>
> Currently, when a user runs a query, such as a multi-insert, where each insertion subclause consists of a simple query followed by a group by, the group bys for each clause are run on a separate reducer.  This requires writing the data for each group by clause to an intermediate file, and then reading it back.  This uses a significant amount of the total CPU consumed by the query for an otherwise simple query.
> If the subclauses are grouped by their distinct expressions and group by keys, with all of the group by expressions for a group of subclauses run on a single reducer, this would reduce the amount of reading/writing to intermediate files for some queries.
> To do this, for each group of subclauses, in the mapper we would execute a the filters for each subclause 'or'd together (provided each subclause has a filter) followed by a reduce sink.  In the reducer, the child operators would be each subclauses filter followed by the group by and any subsequent operations.
> Note that this would require turning off map aggregation, so we would need to make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

Posted by "Kevin Wilfong (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161369#comment-13161369 ] 

Kevin Wilfong commented on HIVE-2621:
-------------------------------------

diff is here https://reviews.facebook.net/D567
                
> Allow multiple group bys with the same input data and spray keys to be run on the same reducer.
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2621
>                 URL: https://issues.apache.org/jira/browse/HIVE-2621
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch
>
>
> Currently, when a user runs a query, such as a multi-insert, where each insertion subclause consists of a simple query followed by a group by, the group bys for each clause are run on a separate reducer.  This requires writing the data for each group by clause to an intermediate file, and then reading it back.  This uses a significant amount of the total CPU consumed by the query for an otherwise simple query.
> If the subclauses are grouped by their distinct expressions and group by keys, with all of the group by expressions for a group of subclauses run on a single reducer, this would reduce the amount of reading/writing to intermediate files for some queries.
> To do this, for each group of subclauses, in the mapper we would execute a the filters for each subclause 'or'd together (provided each subclause has a filter) followed by a reduce sink.  In the reducer, the child operators would be each subclauses filter followed by the group by and any subsequent operations.
> Note that this would require turning off map aggregation, so we would need to make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

Posted by "Kevin Wilfong (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kevin Wilfong updated HIVE-2621:
--------------------------------

    Attachment: HIVE-2621.1.patch.txt
    
> Allow multiple group bys with the same input data and spray keys to be run on the same reducer.
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2621
>                 URL: https://issues.apache.org/jira/browse/HIVE-2621
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch
>
>
> Currently, when a user runs a query, such as a multi-insert, where each insertion subclause consists of a simple query followed by a group by, the group bys for each clause are run on a separate reducer.  This requires writing the data for each group by clause to an intermediate file, and then reading it back.  This uses a significant amount of the total CPU consumed by the query for an otherwise simple query.
> If the subclauses are grouped by their distinct expressions and group by keys, with all of the group by expressions for a group of subclauses run on a single reducer, this would reduce the amount of reading/writing to intermediate files for some queries.
> To do this, for each group of subclauses, in the mapper we would execute a the filters for each subclause 'or'd together (provided each subclause has a filter) followed by a reduce sink.  In the reducer, the child operators would be each subclauses filter followed by the group by and any subsequent operations.
> Note that this would require turning off map aggregation, so we would need to make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

Posted by "Kevin Wilfong (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kevin Wilfong updated HIVE-2621:
--------------------------------

    Status: Patch Available  (was: Open)
    
> Allow multiple group bys with the same input data and spray keys to be run on the same reducer.
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2621
>                 URL: https://issues.apache.org/jira/browse/HIVE-2621
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch
>
>
> Currently, when a user runs a query, such as a multi-insert, where each insertion subclause consists of a simple query followed by a group by, the group bys for each clause are run on a separate reducer.  This requires writing the data for each group by clause to an intermediate file, and then reading it back.  This uses a significant amount of the total CPU consumed by the query for an otherwise simple query.
> If the subclauses are grouped by their distinct expressions and group by keys, with all of the group by expressions for a group of subclauses run on a single reducer, this would reduce the amount of reading/writing to intermediate files for some queries.
> To do this, for each group of subclauses, in the mapper we would execute a the filters for each subclause 'or'd together (provided each subclause has a filter) followed by a reduce sink.  In the reducer, the child operators would be each subclauses filter followed by the group by and any subsequent operations.
> Note that this would require turning off map aggregation, so we would need to make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

Posted by "He Yongqiang (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-2621:
-------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

committed, thanks Kevin!
                
> Allow multiple group bys with the same input data and spray keys to be run on the same reducer.
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2621
>                 URL: https://issues.apache.org/jira/browse/HIVE-2621
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch, HIVE-2621.D567.4.patch
>
>
> Currently, when a user runs a query, such as a multi-insert, where each insertion subclause consists of a simple query followed by a group by, the group bys for each clause are run on a separate reducer.  This requires writing the data for each group by clause to an intermediate file, and then reading it back.  This uses a significant amount of the total CPU consumed by the query for an otherwise simple query.
> If the subclauses are grouped by their distinct expressions and group by keys, with all of the group by expressions for a group of subclauses run on a single reducer, this would reduce the amount of reading/writing to intermediate files for some queries.
> To do this, for each group of subclauses, in the mapper we would execute a the filters for each subclause 'or'd together (provided each subclause has a filter) followed by a reduce sink.  In the reducer, the child operators would be each subclauses filter followed by the group by and any subsequent operations.
> Note that this would require turning off map aggregation, so we would need to make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

Posted by "Namit Jain (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174891#comment-13174891 ] 

Namit Jain commented on HIVE-2621:
----------------------------------

Let me take a look at the code again:

But the general flow should be as follows:

if  hive.multigroupby.singlereducer is true (which should always be),
  find common distincts. 
    (or the check hive.multigroupby.singlereducer can be done inside find common distincts function itself)
  if common distincts == null
     old (current) approach - map side aggr should be used
  else:
     new code path

What do you think ? That way, we are guaranteed that the existing behavior is not changed.
This new parameter is only affecting distincts, and we it is very easy to turn it off

I know the code is kind of messy here, but can you spend some time to modularize it,
and reuse as much as possible ?


                
> Allow multiple group bys with the same input data and spray keys to be run on the same reducer.
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2621
>                 URL: https://issues.apache.org/jira/browse/HIVE-2621
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch
>
>
> Currently, when a user runs a query, such as a multi-insert, where each insertion subclause consists of a simple query followed by a group by, the group bys for each clause are run on a separate reducer.  This requires writing the data for each group by clause to an intermediate file, and then reading it back.  This uses a significant amount of the total CPU consumed by the query for an otherwise simple query.
> If the subclauses are grouped by their distinct expressions and group by keys, with all of the group by expressions for a group of subclauses run on a single reducer, this would reduce the amount of reading/writing to intermediate files for some queries.
> To do this, for each group of subclauses, in the mapper we would execute a the filters for each subclause 'or'd together (provided each subclause has a filter) followed by a reduce sink.  In the reducer, the child operators would be each subclauses filter followed by the group by and any subsequent operations.
> Note that this would require turning off map aggregation, so we would need to make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

Posted by "Phabricator (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174922#comment-13174922 ] 

Phabricator commented on HIVE-2621:
-----------------------------------

njain has commented on the revision "HIVE-2621 [jira] Allow multiple group bys with the same input data and spray keys to be run on the same reducer.".

  otherwise it looks good

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:5925 can you give an example in the comments ?

  Sorry, but it is not clear to me.

  Do you want to return 2 lists - one for the common distincts ?
  I am missing something: what else do you want to return ?


  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:3469 thanks for creating this function and re-using this code

REVISION DETAIL
  https://reviews.facebook.net/D567

                
> Allow multiple group bys with the same input data and spray keys to be run on the same reducer.
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2621
>                 URL: https://issues.apache.org/jira/browse/HIVE-2621
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch
>
>
> Currently, when a user runs a query, such as a multi-insert, where each insertion subclause consists of a simple query followed by a group by, the group bys for each clause are run on a separate reducer.  This requires writing the data for each group by clause to an intermediate file, and then reading it back.  This uses a significant amount of the total CPU consumed by the query for an otherwise simple query.
> If the subclauses are grouped by their distinct expressions and group by keys, with all of the group by expressions for a group of subclauses run on a single reducer, this would reduce the amount of reading/writing to intermediate files for some queries.
> To do this, for each group of subclauses, in the mapper we would execute a the filters for each subclause 'or'd together (provided each subclause has a filter) followed by a reduce sink.  In the reducer, the child operators would be each subclauses filter followed by the group by and any subsequent operations.
> Note that this would require turning off map aggregation, so we would need to make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179136#comment-13179136 ] 

Hudson commented on HIVE-2621:
------------------------------

Integrated in Hive-trunk-h0.21 #1182 (See [https://builds.apache.org/job/Hive-trunk-h0.21/1182/])
    HIVE-2621:Allow multiple group bys with the same input data and spray keys to be run on the same reducer. (Kevin via He Yongqiang)

heyongqiang : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1226903
Files : 
* /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDesc.java
* /hive/trunk/ql/src/test/queries/clientpositive/groupby10.q
* /hive/trunk/ql/src/test/queries/clientpositive/groupby7_map.q
* /hive/trunk/ql/src/test/queries/clientpositive/groupby7_map_multi_single_reducer.q
* /hive/trunk/ql/src/test/queries/clientpositive/groupby7_noskew.q
* /hive/trunk/ql/src/test/queries/clientpositive/groupby7_noskew_multi_single_reducer.q
* /hive/trunk/ql/src/test/queries/clientpositive/groupby8.q
* /hive/trunk/ql/src/test/queries/clientpositive/groupby9.q
* /hive/trunk/ql/src/test/queries/clientpositive/groupby_complex_types_multi_single_reducer.q
* /hive/trunk/ql/src/test/queries/clientpositive/groupby_multi_single_reducer.q
* /hive/trunk/ql/src/test/queries/clientpositive/multigroupby_singlemr.q
* /hive/trunk/ql/src/test/results/clientpositive/groupby10.q.out
* /hive/trunk/ql/src/test/results/clientpositive/groupby7_map_multi_single_reducer.q.out
* /hive/trunk/ql/src/test/results/clientpositive/groupby7_noskew_multi_single_reducer.q.out
* /hive/trunk/ql/src/test/results/clientpositive/groupby8.q.out
* /hive/trunk/ql/src/test/results/clientpositive/groupby9.q.out
* /hive/trunk/ql/src/test/results/clientpositive/groupby_complex_types_multi_single_reducer.q.out
* /hive/trunk/ql/src/test/results/clientpositive/groupby_multi_single_reducer.q.out
* /hive/trunk/ql/src/test/results/clientpositive/multi_insert.q.out
* /hive/trunk/ql/src/test/results/clientpositive/multigroupby_singlemr.q.out
* /hive/trunk/ql/src/test/results/clientpositive/parallel.q.out

                
> Allow multiple group bys with the same input data and spray keys to be run on the same reducer.
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2621
>                 URL: https://issues.apache.org/jira/browse/HIVE-2621
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch, HIVE-2621.D567.4.patch
>
>
> Currently, when a user runs a query, such as a multi-insert, where each insertion subclause consists of a simple query followed by a group by, the group bys for each clause are run on a separate reducer.  This requires writing the data for each group by clause to an intermediate file, and then reading it back.  This uses a significant amount of the total CPU consumed by the query for an otherwise simple query.
> If the subclauses are grouped by their distinct expressions and group by keys, with all of the group by expressions for a group of subclauses run on a single reducer, this would reduce the amount of reading/writing to intermediate files for some queries.
> To do this, for each group of subclauses, in the mapper we would execute a the filters for each subclause 'or'd together (provided each subclause has a filter) followed by a reduce sink.  In the reducer, the child operators would be each subclauses filter followed by the group by and any subsequent operations.
> Note that this would require turning off map aggregation, so we would need to make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

Posted by "Phabricator (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175209#comment-13175209 ] 

Phabricator commented on HIVE-2621:
-----------------------------------

heyongqiang has commented on the revision "HIVE-2621 [jira] Allow multiple group bys with the same input data and spray keys to be run on the same reducer.".

  otherwise, looks good to me.

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:3015 Instead of a lot duplicate code here, can we just pass one dest to  genGroupByPlanReduceSinkOperator()?

REVISION DETAIL
  https://reviews.facebook.net/D567

                
> Allow multiple group bys with the same input data and spray keys to be run on the same reducer.
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2621
>                 URL: https://issues.apache.org/jira/browse/HIVE-2621
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch
>
>
> Currently, when a user runs a query, such as a multi-insert, where each insertion subclause consists of a simple query followed by a group by, the group bys for each clause are run on a separate reducer.  This requires writing the data for each group by clause to an intermediate file, and then reading it back.  This uses a significant amount of the total CPU consumed by the query for an otherwise simple query.
> If the subclauses are grouped by their distinct expressions and group by keys, with all of the group by expressions for a group of subclauses run on a single reducer, this would reduce the amount of reading/writing to intermediate files for some queries.
> To do this, for each group of subclauses, in the mapper we would execute a the filters for each subclause 'or'd together (provided each subclause has a filter) followed by a reduce sink.  In the reducer, the child operators would be each subclauses filter followed by the group by and any subsequent operations.
> Note that this would require turning off map aggregation, so we would need to make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

Posted by "Kevin Wilfong (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174945#comment-13174945 ] 

Kevin Wilfong commented on HIVE-2621:
-------------------------------------

There are currently two ways of getting common distincts, the current way checks that all distinct expressions in the subqueries are the same.  My new code doesn't depend on this, it tries to construct subsets of the subqueries such that this is true for each subset.

The advantage of doing it in the form
if (optimizeMultiGroupBy) {
  ...
} else {
  <group queries by common distinct and group by expressions>
  for each group:
    if (size of group > 1 && etc.) {
      <new code>
    } else {
      <old code>
    }
}

is that the block of code inside the optimizeMultiGroupBy if statement can produce 2 map reduce jobs where the new code might produce many.

After looking at it more carefully, I can get rid of the singlemrMultiGroupBy if statement and the code within the block because it produces the same result that my new code would except that the new code can handle filters as well.

After removing that code, the only remaining code above the if statement will be the poorly named getCommonDistinctExprs (as it only returns the common distinct expressions provided a lot of conditions are met including a requirement that all the distinct expressions are common), which I should be able to modify to use my new code.
                
> Allow multiple group bys with the same input data and spray keys to be run on the same reducer.
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2621
>                 URL: https://issues.apache.org/jira/browse/HIVE-2621
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch
>
>
> Currently, when a user runs a query, such as a multi-insert, where each insertion subclause consists of a simple query followed by a group by, the group bys for each clause are run on a separate reducer.  This requires writing the data for each group by clause to an intermediate file, and then reading it back.  This uses a significant amount of the total CPU consumed by the query for an otherwise simple query.
> If the subclauses are grouped by their distinct expressions and group by keys, with all of the group by expressions for a group of subclauses run on a single reducer, this would reduce the amount of reading/writing to intermediate files for some queries.
> To do this, for each group of subclauses, in the mapper we would execute a the filters for each subclause 'or'd together (provided each subclause has a filter) followed by a reduce sink.  In the reducer, the child operators would be each subclauses filter followed by the group by and any subsequent operations.
> Note that this would require turning off map aggregation, so we would need to make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

Posted by "Phabricator (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174671#comment-13174671 ] 

Phabricator commented on HIVE-2621:
-----------------------------------

njain has commented on the revision "HIVE-2621 [jira] Allow multiple group bys with the same input data and spray keys to be run on the same reducer.".

INLINE COMMENTS
  ql/src/test/queries/clientpositive/groupby7_noskew_multi_single_reducer.q:12 This does not look right.

  We would like to make hive.multigroupby.singlereducer as true by default.

  But, we are un-necessarily generating 3 MR jobs for this query (with no distinct). I think, we can get it in 2 MR jobs today (not 100% sure)
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:6273 It would be good to merge the code path with the above if block

  (optimizeMultiGroupBy).

  The common distinct expression should return the common distinct
  checking for the parameter HIVEMULTIGROUPBYSINGLEREDUCER.

  Or, it might be simpler to remove the above if block (the optimizeMultiGroupby should be covered by this block).
  Anyway, the above if block (6253-6272) seems broken
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:6211 I think this code can be simplified.

  The function getCommonDistinctExprs can be removed

REVISION DETAIL
  https://reviews.facebook.net/D567

                
> Allow multiple group bys with the same input data and spray keys to be run on the same reducer.
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2621
>                 URL: https://issues.apache.org/jira/browse/HIVE-2621
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch
>
>
> Currently, when a user runs a query, such as a multi-insert, where each insertion subclause consists of a simple query followed by a group by, the group bys for each clause are run on a separate reducer.  This requires writing the data for each group by clause to an intermediate file, and then reading it back.  This uses a significant amount of the total CPU consumed by the query for an otherwise simple query.
> If the subclauses are grouped by their distinct expressions and group by keys, with all of the group by expressions for a group of subclauses run on a single reducer, this would reduce the amount of reading/writing to intermediate files for some queries.
> To do this, for each group of subclauses, in the mapper we would execute a the filters for each subclause 'or'd together (provided each subclause has a filter) followed by a reduce sink.  In the reducer, the child operators would be each subclauses filter followed by the group by and any subsequent operations.
> Note that this would require turning off map aggregation, so we would need to make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

Posted by "Phabricator (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174927#comment-13174927 ] 

Phabricator commented on HIVE-2621:
-----------------------------------

kevinwilfong has commented on the revision "HIVE-2621 [jira] Allow multiple group bys with the same input data and spray keys to be run on the same reducer.".

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:5925 I return a list of lists of clause names (which map to subqueries) where the queries mapped to by each clause name in a list all have the same distinct and group by keys.

  It doesn't return common distincts.

  I'll try to make the comment clearer.

REVISION DETAIL
  https://reviews.facebook.net/D567

                
> Allow multiple group bys with the same input data and spray keys to be run on the same reducer.
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2621
>                 URL: https://issues.apache.org/jira/browse/HIVE-2621
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch
>
>
> Currently, when a user runs a query, such as a multi-insert, where each insertion subclause consists of a simple query followed by a group by, the group bys for each clause are run on a separate reducer.  This requires writing the data for each group by clause to an intermediate file, and then reading it back.  This uses a significant amount of the total CPU consumed by the query for an otherwise simple query.
> If the subclauses are grouped by their distinct expressions and group by keys, with all of the group by expressions for a group of subclauses run on a single reducer, this would reduce the amount of reading/writing to intermediate files for some queries.
> To do this, for each group of subclauses, in the mapper we would execute a the filters for each subclause 'or'd together (provided each subclause has a filter) followed by a reduce sink.  In the reducer, the child operators would be each subclauses filter followed by the group by and any subsequent operations.
> Note that this would require turning off map aggregation, so we would need to make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

Posted by "Phabricator (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174730#comment-13174730 ] 

Phabricator commented on HIVE-2621:
-----------------------------------

kevinwilfong has commented on the revision "HIVE-2621 [jira] Allow multiple group bys with the same input data and spray keys to be run on the same reducer.".

INLINE COMMENTS
  ql/src/test/queries/clientpositive/groupby7_noskew_multi_single_reducer.q:12 I actually had it in one MR job originally, but the addition of the limits caused the addition of an extra job for each insert.  The code to do that is at line 6379 of the SemanticAnalyzer.  I assume it does this to ensure all the rows go to a single reducer, so the limit can be enforced.

  As for making hive.multigroupby.singlereducer true by default, currently map aggregation is set to true by default, and setting this new variable causes the map aggregation variable to be ignored for any group by that can go to the same reducer as at least one other.

  I'll turn it on by default, I just wanted to make that clear.

  Also, in that case this test isn't needed as I just copied another test and turned on hive.multigroupby.singlereducer
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:6211 This code looks at all the group bys in the query block and allows for only some of the distinct expressions and group by keys to be the same.

  The new code looks for sets of group by subqueries where all of the distinct expressions and group by keys are the same.

  I'll go back and see if I reused as much of the definitions of the two functions getCommonDistinctExprs and getCommonGroupByKeys as I could have, but I don't think this code here could be simplified using my new code.
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:6273 The above if block guarantees the job can be performed using only 2 MR jobs provided that there are no filter operators, and the distinct keys are the same across all subqueries.

  If my new code were used, there is no hard limit on the number of MR jobs because it depends on how many variations there are on the group by keys.

  I think it is possible that we could first look at the set of subqueries without filters, see how they split on both distinct and group by keys, and for any of these groups where two or more have the same distinct keys run this other method on them.  However, this could have been done much more easily before (without the comparison, just group queries with the same distinct keys together), and it wasn't, so I suspect the gains are not that great, so I'd much rather do this as part of a separate patch once we have queries that would benefit from it.

  I'm not sure whether or not the code is broken, but I would like to leave it for now, to be fixed, or modified as I described above as part of a separate patch.

REVISION DETAIL
  https://reviews.facebook.net/D567

                
> Allow multiple group bys with the same input data and spray keys to be run on the same reducer.
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2621
>                 URL: https://issues.apache.org/jira/browse/HIVE-2621
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch
>
>
> Currently, when a user runs a query, such as a multi-insert, where each insertion subclause consists of a simple query followed by a group by, the group bys for each clause are run on a separate reducer.  This requires writing the data for each group by clause to an intermediate file, and then reading it back.  This uses a significant amount of the total CPU consumed by the query for an otherwise simple query.
> If the subclauses are grouped by their distinct expressions and group by keys, with all of the group by expressions for a group of subclauses run on a single reducer, this would reduce the amount of reading/writing to intermediate files for some queries.
> To do this, for each group of subclauses, in the mapper we would execute a the filters for each subclause 'or'd together (provided each subclause has a filter) followed by a reduce sink.  In the reducer, the child operators would be each subclauses filter followed by the group by and any subsequent operations.
> Note that this would require turning off map aggregation, so we would need to make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

Posted by "Phabricator (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175201#comment-13175201 ] 

Phabricator commented on HIVE-2621:
-----------------------------------

heyongqiang has commented on the revision "HIVE-2621 [jira] Allow multiple group bys with the same input data and spray keys to be run on the same reducer.".

  can u add a testcase which includes a subquery in one group by clause?
   still reviewing


INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:6290 the check here is confusing. It is not very clear which cases should come in, and which cases should not. Can u try to reduce the size of check list by moving some up?

  or add some comments here

REVISION DETAIL
  https://reviews.facebook.net/D567

                
> Allow multiple group bys with the same input data and spray keys to be run on the same reducer.
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2621
>                 URL: https://issues.apache.org/jira/browse/HIVE-2621
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch
>
>
> Currently, when a user runs a query, such as a multi-insert, where each insertion subclause consists of a simple query followed by a group by, the group bys for each clause are run on a separate reducer.  This requires writing the data for each group by clause to an intermediate file, and then reading it back.  This uses a significant amount of the total CPU consumed by the query for an otherwise simple query.
> If the subclauses are grouped by their distinct expressions and group by keys, with all of the group by expressions for a group of subclauses run on a single reducer, this would reduce the amount of reading/writing to intermediate files for some queries.
> To do this, for each group of subclauses, in the mapper we would execute a the filters for each subclause 'or'd together (provided each subclause has a filter) followed by a reduce sink.  In the reducer, the child operators would be each subclauses filter followed by the group by and any subsequent operations.
> Note that this would require turning off map aggregation, so we would need to make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

Posted by "Phabricator (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Phabricator updated HIVE-2621:
------------------------------

    Attachment: HIVE-2621.D567.2.patch

kevinwilfong updated the revision "HIVE-2621 [jira] Allow multiple group bys with the same input data and spray keys to be run on the same reducer.".
Reviewers: JIRA

  Updated to avoid merge conflicts.

REVISION DETAIL
  https://reviews.facebook.net/D567

AFFECTED FILES
  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
  ql/src/test/results/clientpositive/groupby7_noskew_multi_single_reducer.q.out
  ql/src/test/results/clientpositive/groupby_multi_single_reducer.q.out
  ql/src/test/results/clientpositive/groupby_complex_types_multi_single_reducer.q.out
  ql/src/test/queries/clientpositive/groupby_multi_single_reducer.q
  ql/src/test/queries/clientpositive/groupby7_noskew_multi_single_reducer.q
  ql/src/test/queries/clientpositive/groupby_complex_types_multi_single_reducer.q
  ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDesc.java
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java

                
> Allow multiple group bys with the same input data and spray keys to be run on the same reducer.
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2621
>                 URL: https://issues.apache.org/jira/browse/HIVE-2621
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, HIVE-2621.D567.2.patch
>
>
> Currently, when a user runs a query, such as a multi-insert, where each insertion subclause consists of a simple query followed by a group by, the group bys for each clause are run on a separate reducer.  This requires writing the data for each group by clause to an intermediate file, and then reading it back.  This uses a significant amount of the total CPU consumed by the query for an otherwise simple query.
> If the subclauses are grouped by their distinct expressions and group by keys, with all of the group by expressions for a group of subclauses run on a single reducer, this would reduce the amount of reading/writing to intermediate files for some queries.
> To do this, for each group of subclauses, in the mapper we would execute a the filters for each subclause 'or'd together (provided each subclause has a filter) followed by a reduce sink.  In the reducer, the child operators would be each subclauses filter followed by the group by and any subsequent operations.
> Note that this would require turning off map aggregation, so we would need to make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

Posted by "Phabricator (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175205#comment-13175205 ] 

Phabricator commented on HIVE-2621:
-----------------------------------

heyongqiang has commented on the revision "HIVE-2621 [jira] Allow multiple group bys with the same input data and spray keys to be run on the same reducer.".

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java:6284 can u add here comments that you just explained to me offline?

REVISION DETAIL
  https://reviews.facebook.net/D567

                
> Allow multiple group bys with the same input data and spray keys to be run on the same reducer.
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2621
>                 URL: https://issues.apache.org/jira/browse/HIVE-2621
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch
>
>
> Currently, when a user runs a query, such as a multi-insert, where each insertion subclause consists of a simple query followed by a group by, the group bys for each clause are run on a separate reducer.  This requires writing the data for each group by clause to an intermediate file, and then reading it back.  This uses a significant amount of the total CPU consumed by the query for an otherwise simple query.
> If the subclauses are grouped by their distinct expressions and group by keys, with all of the group by expressions for a group of subclauses run on a single reducer, this would reduce the amount of reading/writing to intermediate files for some queries.
> To do this, for each group of subclauses, in the mapper we would execute a the filters for each subclause 'or'd together (provided each subclause has a filter) followed by a reduce sink.  In the reducer, the child operators would be each subclauses filter followed by the group by and any subsequent operations.
> Note that this would require turning off map aggregation, so we would need to make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

Posted by "Phabricator (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173907#comment-13173907 ] 

Phabricator commented on HIVE-2621:
-----------------------------------

njain has commented on the revision "HIVE-2621 [jira] Allow multiple group bys with the same input data and spray keys to be run on the same reducer.".

INLINE COMMENTS
  ql/src/test/queries/clientpositive/groupby7_noskew_multi_single_reducer.q:12 A general comment for all the tests.

  Instead of loading 500 rows to src - create a new table - load some 10 rows of src into it, and use that for all tests.

  The test outputs are really long, and difficult to review

REVISION DETAIL
  https://reviews.facebook.net/D567

                
> Allow multiple group bys with the same input data and spray keys to be run on the same reducer.
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2621
>                 URL: https://issues.apache.org/jira/browse/HIVE-2621
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, HIVE-2621.D567.2.patch
>
>
> Currently, when a user runs a query, such as a multi-insert, where each insertion subclause consists of a simple query followed by a group by, the group bys for each clause are run on a separate reducer.  This requires writing the data for each group by clause to an intermediate file, and then reading it back.  This uses a significant amount of the total CPU consumed by the query for an otherwise simple query.
> If the subclauses are grouped by their distinct expressions and group by keys, with all of the group by expressions for a group of subclauses run on a single reducer, this would reduce the amount of reading/writing to intermediate files for some queries.
> To do this, for each group of subclauses, in the mapper we would execute a the filters for each subclause 'or'd together (provided each subclause has a filter) followed by a reduce sink.  In the reducer, the child operators would be each subclauses filter followed by the group by and any subsequent operations.
> Note that this would require turning off map aggregation, so we would need to make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.

Posted by "Phabricator (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Phabricator updated HIVE-2621:
------------------------------

    Attachment: HIVE-2621.D567.3.patch

kevinwilfong updated the revision "HIVE-2621 [jira] Allow multiple group bys with the same input data and spray keys to be run on the same reducer.".
Reviewers: JIRA

  Updated the diff again to prevent conflicts.

  Added limits in the test cases to prevent the output from getting too long.

REVISION DETAIL
  https://reviews.facebook.net/D567

AFFECTED FILES
  common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
  ql/src/test/results/clientpositive/groupby7_noskew_multi_single_reducer.q.out
  ql/src/test/results/clientpositive/groupby_multi_single_reducer.q.out
  ql/src/test/results/clientpositive/groupby_complex_types_multi_single_reducer.q.out
  ql/src/test/queries/clientpositive/groupby_multi_single_reducer.q
  ql/src/test/queries/clientpositive/groupby7_noskew_multi_single_reducer.q
  ql/src/test/queries/clientpositive/groupby_complex_types_multi_single_reducer.q
  ql/src/java/org/apache/hadoop/hive/ql/plan/ExprNodeDesc.java
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java

                
> Allow multiple group bys with the same input data and spray keys to be run on the same reducer.
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2621
>                 URL: https://issues.apache.org/jira/browse/HIVE-2621
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, HIVE-2621.D567.2.patch, HIVE-2621.D567.3.patch
>
>
> Currently, when a user runs a query, such as a multi-insert, where each insertion subclause consists of a simple query followed by a group by, the group bys for each clause are run on a separate reducer.  This requires writing the data for each group by clause to an intermediate file, and then reading it back.  This uses a significant amount of the total CPU consumed by the query for an otherwise simple query.
> If the subclauses are grouped by their distinct expressions and group by keys, with all of the group by expressions for a group of subclauses run on a single reducer, this would reduce the amount of reading/writing to intermediate files for some queries.
> To do this, for each group of subclauses, in the mapper we would execute a the filters for each subclause 'or'd together (provided each subclause has a filter) followed by a reduce sink.  In the reducer, the child operators would be each subclauses filter followed by the group by and any subsequent operations.
> Note that this would require turning off map aggregation, so we would need to make using this type of plan configurable.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira