You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Namit Jain (JIRA)" <ji...@apache.org> on 2008/09/10 01:47:44 UTC

[jira] Created: (HADOOP-4139) [Hive] multi group by statement is not optimized

[Hive] multi group by statement is not optimized
------------------------------------------------

                 Key: HADOOP-4139
                 URL: https://issues.apache.org/jira/browse/HADOOP-4139
             Project: Hadoop Core
          Issue Type: Bug
            Reporter: Namit Jain
            Assignee: Namit Jain


A simple multi-group by statement is not optimized. A simple statement like:

FROM SRC
INSERT OVERWRITE TABLE DEST1 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key
INSERT OVERWRITE TABLE DEST2 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key;


results in making 2 copies of the data (SRC). Instead, the data can be first partially aggregated on the distinct value and then aggregated. 
The first step can be common to all group bys.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4139) [Hive] multi group by statement is not optimized

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630255#action_12630255 ] 

Ashish Thusoo commented on HADOOP-4139:
---------------------------------------

+1 

looks good to me.

> [Hive] multi group by statement is not optimized
> ------------------------------------------------
>
>                 Key: HADOOP-4139
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4139
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch1, patch3
>
>
> A simple multi-group by statement is not optimized. A simple statement like:
> FROM SRC
> INSERT OVERWRITE TABLE DEST1 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key
> INSERT OVERWRITE TABLE DEST2 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key;
> results in making 2 copies of the data (SRC). Instead, the data can be first partially aggregated on the distinct value and then aggregated. 
> The first step can be common to all group bys.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4139) [Hive] multi group by statement is not optimized

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630726#action_12630726 ] 

Ashish Thusoo commented on HADOOP-4139:
---------------------------------------

+1


> [Hive] multi group by statement is not optimized
> ------------------------------------------------
>
>                 Key: HADOOP-4139
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4139
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch1, patch3, patch4.txt
>
>
> A simple multi-group by statement is not optimized. A simple statement like:
> FROM SRC
> INSERT OVERWRITE TABLE DEST1 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key
> INSERT OVERWRITE TABLE DEST2 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key;
> results in making 2 copies of the data (SRC). Instead, the data can be first partially aggregated on the distinct value and then aggregated. 
> The first step can be common to all group bys.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4139) [Hive] multi group by statement is not optimized

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630055#action_12630055 ] 

Ashish Thusoo commented on HADOOP-4139:
---------------------------------------

Namit and I went over this. The following are the comments:

1. In OpForward instead of copying input row resolver we could use the same one.
2. In the first job we evaluate all the input columns + all group by clause expressions + parameters to all the aggregation functions and do not eliminate all the duplicates because we treat expression resolution and column resolution differently (To be fixed in a later txn).
3. We can be smarter in terms of this list of what parameters we evaluate in the first stage, we should only evaluate those that are common across the group by clauses (To be fixed in a later txn.)


> [Hive] multi group by statement is not optimized
> ------------------------------------------------
>
>                 Key: HADOOP-4139
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4139
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch1
>
>
> A simple multi-group by statement is not optimized. A simple statement like:
> FROM SRC
> INSERT OVERWRITE TABLE DEST1 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key
> INSERT OVERWRITE TABLE DEST2 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key;
> results in making 2 copies of the data (SRC). Instead, the data can be first partially aggregated on the distinct value and then aggregated. 
> The first step can be common to all group bys.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4139) [Hive] multi group by statement is not optimized

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633308#action_12633308 ] 

Hudson commented on HADOOP-4139:
--------------------------------

Integrated in Hadoop-trunk #611 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/611/])

> [Hive] multi group by statement is not optimized
> ------------------------------------------------
>
>                 Key: HADOOP-4139
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4139
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>             Fix For: 0.19.0
>
>         Attachments: patch1, patch3, patch4.txt
>
>
> A simple multi-group by statement is not optimized. A simple statement like:
> FROM SRC
> INSERT OVERWRITE TABLE DEST1 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key
> INSERT OVERWRITE TABLE DEST2 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key;
> results in making 2 copies of the data (SRC). Instead, the data can be first partially aggregated on the distinct value and then aggregated. 
> The first step can be common to all group bys.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4139) [Hive] multi group by statement is not optimized

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630058#action_12630058 ] 

Namit Jain commented on HADOOP-4139:
------------------------------------

Done 1. - am filing separate jiras for 2 and 3

> [Hive] multi group by statement is not optimized
> ------------------------------------------------
>
>                 Key: HADOOP-4139
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4139
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch1, patch3
>
>
> A simple multi-group by statement is not optimized. A simple statement like:
> FROM SRC
> INSERT OVERWRITE TABLE DEST1 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key
> INSERT OVERWRITE TABLE DEST2 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key;
> results in making 2 copies of the data (SRC). Instead, the data can be first partially aggregated on the distinct value and then aggregated. 
> The first step can be common to all group bys.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4139) [Hive] multi group by statement is not optimized

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HADOOP-4139:
-------------------------------

    Hadoop Flags: [Reviewed]
          Status: Patch Available  (was: Open)

> [Hive] multi group by statement is not optimized
> ------------------------------------------------
>
>                 Key: HADOOP-4139
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4139
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch1, patch3
>
>
> A simple multi-group by statement is not optimized. A simple statement like:
> FROM SRC
> INSERT OVERWRITE TABLE DEST1 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key
> INSERT OVERWRITE TABLE DEST2 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key;
> results in making 2 copies of the data (SRC). Instead, the data can be first partially aggregated on the distinct value and then aggregated. 
> The first step can be common to all group bys.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4139) [Hive] multi group by statement is not optimized

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HADOOP-4139:
-------------------------------

    Attachment: patch1

I verified the plan - can someone from Hive comment 

> [Hive] multi group by statement is not optimized
> ------------------------------------------------
>
>                 Key: HADOOP-4139
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4139
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch1
>
>
> A simple multi-group by statement is not optimized. A simple statement like:
> FROM SRC
> INSERT OVERWRITE TABLE DEST1 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key
> INSERT OVERWRITE TABLE DEST2 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key;
> results in making 2 copies of the data (SRC). Instead, the data can be first partially aggregated on the distinct value and then aggregated. 
> The first step can be common to all group bys.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4139) [Hive] multi group by statement is not optimized

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HADOOP-4139:
-------------------------------

    Status: Open  (was: Patch Available)

> [Hive] multi group by statement is not optimized
> ------------------------------------------------
>
>                 Key: HADOOP-4139
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4139
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch1, patch3, patch4.txt
>
>
> A simple multi-group by statement is not optimized. A simple statement like:
> FROM SRC
> INSERT OVERWRITE TABLE DEST1 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key
> INSERT OVERWRITE TABLE DEST2 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key;
> results in making 2 copies of the data (SRC). Instead, the data can be first partially aggregated on the distinct value and then aggregated. 
> The first step can be common to all group bys.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4139) [Hive] multi group by statement is not optimized

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HADOOP-4139:
-------------------------------

    Attachment: patch3

> [Hive] multi group by statement is not optimized
> ------------------------------------------------
>
>                 Key: HADOOP-4139
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4139
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch1, patch3
>
>
> A simple multi-group by statement is not optimized. A simple statement like:
> FROM SRC
> INSERT OVERWRITE TABLE DEST1 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key
> INSERT OVERWRITE TABLE DEST2 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key;
> results in making 2 copies of the data (SRC). Instead, the data can be first partially aggregated on the distinct value and then aggregated. 
> The first step can be common to all group bys.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4139) [Hive] multi group by statement is not optimized

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629861#action_12629861 ] 

Namit Jain commented on HADOOP-4139:
------------------------------------

1. I will change the test.
2. The check is done before the plan is generated.
3. I will write comments/algrithm


> [Hive] multi group by statement is not optimized
> ------------------------------------------------
>
>                 Key: HADOOP-4139
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4139
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch1
>
>
> A simple multi-group by statement is not optimized. A simple statement like:
> FROM SRC
> INSERT OVERWRITE TABLE DEST1 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key
> INSERT OVERWRITE TABLE DEST2 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key;
> results in making 2 copies of the data (SRC). Instead, the data can be first partially aggregated on the distinct value and then aggregated. 
> The first step can be common to all group bys.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4139) [Hive] multi group by statement is not optimized

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HADOOP-4139:
-------------------------------

    Status: Patch Available  (was: Open)

> [Hive] multi group by statement is not optimized
> ------------------------------------------------
>
>                 Key: HADOOP-4139
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4139
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch1, patch3, patch4.txt
>
>
> A simple multi-group by statement is not optimized. A simple statement like:
> FROM SRC
> INSERT OVERWRITE TABLE DEST1 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key
> INSERT OVERWRITE TABLE DEST2 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key;
> results in making 2 copies of the data (SRC). Instead, the data can be first partially aggregated on the distinct value and then aggregated. 
> The first step can be common to all group bys.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4139) [Hive] multi group by statement is not optimized

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HADOOP-4139:
-------------------------------

    Component/s: contrib/hive

> [Hive] multi group by statement is not optimized
> ------------------------------------------------
>
>                 Key: HADOOP-4139
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4139
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch1
>
>
> A simple multi-group by statement is not optimized. A simple statement like:
> FROM SRC
> INSERT OVERWRITE TABLE DEST1 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key
> INSERT OVERWRITE TABLE DEST2 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key;
> results in making 2 copies of the data (SRC). Instead, the data can be first partially aggregated on the distinct value and then aggregated. 
> The first step can be common to all group bys.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4139) [Hive] multi group by statement is not optimized

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629845#action_12629845 ] 

Ashish Thusoo commented on HADOOP-4139:
---------------------------------------

I should be done reviewing this in couple of hours...

A few minor comments though:

1. In the tests we should drop the created destination tables. At some point we want to ensure that the cleanup code for a test is isolated within the test. (This is minor - I am ok with it as is for now).
2. The check to disallow different distincts - can that be moved up and potentially even before we generate the groupbyPlan. No point going through the entire processing stuff if we can disallow it right up front.
3. Also a comment describing the algorithm somewhere would be great


> [Hive] multi group by statement is not optimized
> ------------------------------------------------
>
>                 Key: HADOOP-4139
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4139
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch1
>
>
> A simple multi-group by statement is not optimized. A simple statement like:
> FROM SRC
> INSERT OVERWRITE TABLE DEST1 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key
> INSERT OVERWRITE TABLE DEST2 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key;
> results in making 2 copies of the data (SRC). Instead, the data can be first partially aggregated on the distinct value and then aggregated. 
> The first step can be common to all group bys.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4139) [Hive] multi group by statement is not optimized

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12631306#action_12631306 ] 

Hadoop QA commented on HADOOP-4139:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12390042/patch4.txt
  against trunk revision 695690.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 18 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3269/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3269/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3269/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3269/console

This message is automatically generated.

> [Hive] multi group by statement is not optimized
> ------------------------------------------------
>
>                 Key: HADOOP-4139
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4139
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch1, patch3, patch4.txt
>
>
> A simple multi-group by statement is not optimized. A simple statement like:
> FROM SRC
> INSERT OVERWRITE TABLE DEST1 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key
> INSERT OVERWRITE TABLE DEST2 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key;
> results in making 2 copies of the data (SRC). Instead, the data can be first partially aggregated on the distinct value and then aggregated. 
> The first step can be common to all group bys.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4139) [Hive] multi group by statement is not optimized

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-4139:
-------------------------------------

       Resolution: Fixed
    Fix Version/s: 0.19.0
           Status: Resolved  (was: Patch Available)

I just committed this. Thanks Namit!

> [Hive] multi group by statement is not optimized
> ------------------------------------------------
>
>                 Key: HADOOP-4139
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4139
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>             Fix For: 0.19.0
>
>         Attachments: patch1, patch3, patch4.txt
>
>
> A simple multi-group by statement is not optimized. A simple statement like:
> FROM SRC
> INSERT OVERWRITE TABLE DEST1 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key
> INSERT OVERWRITE TABLE DEST2 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key;
> results in making 2 copies of the data (SRC). Instead, the data can be first partially aggregated on the distinct value and then aggregated. 
> The first step can be common to all group bys.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4139) [Hive] multi group by statement is not optimized

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HADOOP-4139:
-------------------------------

    Attachment: patch4.txt

I generated the file from the wrong directory. It should be fine now.
Ashish, can you accept again ?

> [Hive] multi group by statement is not optimized
> ------------------------------------------------
>
>                 Key: HADOOP-4139
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4139
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch1, patch3, patch4.txt
>
>
> A simple multi-group by statement is not optimized. A simple statement like:
> FROM SRC
> INSERT OVERWRITE TABLE DEST1 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key
> INSERT OVERWRITE TABLE DEST2 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key;
> results in making 2 copies of the data (SRC). Instead, the data can be first partially aggregated on the distinct value and then aggregated. 
> The first step can be common to all group bys.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4139) [Hive] multi group by statement is not optimized

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630479#action_12630479 ] 

Hadoop QA commented on HADOOP-4139:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12389888/patch3
  against trunk revision 694562.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 20 new or modified tests.

    -1 patch.  The patch command could not apply the patch.

Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3254/console

This message is automatically generated.

> [Hive] multi group by statement is not optimized
> ------------------------------------------------
>
>                 Key: HADOOP-4139
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4139
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hive
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch1, patch3
>
>
> A simple multi-group by statement is not optimized. A simple statement like:
> FROM SRC
> INSERT OVERWRITE TABLE DEST1 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key
> INSERT OVERWRITE TABLE DEST2 SELECT SRC.key, count(distinct  SUBSTR(SRC.value,4)) GROUP BY SRC.key;
> results in making 2 copies of the data (SRC). Instead, the data can be first partially aggregated on the distinct value and then aggregated. 
> The first step can be common to all group bys.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.