You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Siying Dong (JIRA)" <ji...@apache.org> on 2011/05/03 03:27:03 UTC

[jira] [Created] (HIVE-2146) Block Sampling should adjust number of reducers accordingly to make it useful

Block Sampling should adjust number of reducers accordingly to make it useful
-----------------------------------------------------------------------------

                 Key: HIVE-2146
                 URL: https://issues.apache.org/jira/browse/HIVE-2146
             Project: Hive
          Issue Type: Bug
            Reporter: Siying Dong


Now number of reducers of block sampling is not modified, so that queries like:
select c from tab tablesample(1 percent) group by c;
can generate huge number of reducers although the input is sampled to be small.
We need to shrink number of reducers to make block sampling more useful.
Since now number of reducers are determined before get splits, the way to do it probably is not clean enough, but we can do a good guess.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2146) Block Sampling should adjust number of reducers accordingly to make it useful

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong updated HIVE-2146:
------------------------------

    Attachment: HIVE-2146.1.patch

> Block Sampling should adjust number of reducers accordingly to make it useful
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-2146
>                 URL: https://issues.apache.org/jira/browse/HIVE-2146
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2146.1.patch
>
>
> Now number of reducers of block sampling is not modified, so that queries like:
> select c from tab tablesample(1 percent) group by c;
> can generate huge number of reducers although the input is sampled to be small.
> We need to shrink number of reducers to make block sampling more useful.
> Since now number of reducers are determined before get splits, the way to do it probably is not clean enough, but we can do a good guess.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2146) Block Sampling should adjust number of reducers accordingly to make it useful

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028564#comment-13028564 ] 

jiraposter@reviews.apache.org commented on HIVE-2146:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/685/
-----------------------------------------------------------

Review request for hive, Ning Zhang and namit jain.


Summary
-------

Now number of reducers of block sampling is not modified, so that queries like:
select c from tab tablesample(1 percent) group by c;
can generate huge number of reducers although the input is sampled to be small.
We need to shrink number of reducers to make block sampling more useful.
Since now number of reducers are determined before get splits, the way to do it probably is not clean enough, but we can do a good guess.


This addresses bug HIVE-2146.
    https://issues.apache.org/jira/browse/HIVE-2146


Diffs
-----

  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapRedTask.java 1098885 

Diff: https://reviews.apache.org/r/685/diff


Testing
-------


Thanks,

Siying



> Block Sampling should adjust number of reducers accordingly to make it useful
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-2146
>                 URL: https://issues.apache.org/jira/browse/HIVE-2146
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2146.1.patch, HIVE-2146.2.patch
>
>
> Now number of reducers of block sampling is not modified, so that queries like:
> select c from tab tablesample(1 percent) group by c;
> can generate huge number of reducers although the input is sampled to be small.
> We need to shrink number of reducers to make block sampling more useful.
> Since now number of reducers are determined before get splits, the way to do it probably is not clean enough, but we can do a good guess.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2146) Block Sampling should adjust number of reducers accordingly to make it useful

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong updated HIVE-2146:
------------------------------

    Status: Patch Available  (was: Open)

> Block Sampling should adjust number of reducers accordingly to make it useful
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-2146
>                 URL: https://issues.apache.org/jira/browse/HIVE-2146
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2146.1.patch, HIVE-2146.2.patch
>
>
> Now number of reducers of block sampling is not modified, so that queries like:
> select c from tab tablesample(1 percent) group by c;
> can generate huge number of reducers although the input is sampled to be small.
> We need to shrink number of reducers to make block sampling more useful.
> Since now number of reducers are determined before get splits, the way to do it probably is not clean enough, but we can do a good guess.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2146) Block Sampling should adjust number of reducers accordingly to make it useful

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong updated HIVE-2146:
------------------------------

    Attachment: HIVE-2146.2.patch

> Block Sampling should adjust number of reducers accordingly to make it useful
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-2146
>                 URL: https://issues.apache.org/jira/browse/HIVE-2146
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2146.1.patch, HIVE-2146.2.patch
>
>
> Now number of reducers of block sampling is not modified, so that queries like:
> select c from tab tablesample(1 percent) group by c;
> can generate huge number of reducers although the input is sampled to be small.
> We need to shrink number of reducers to make block sampling more useful.
> Since now number of reducers are determined before get splits, the way to do it probably is not clean enough, but we can do a good guess.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2146) Block Sampling should adjust number of reducers accordingly to make it useful

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang updated HIVE-2146:
-----------------------------

    Status: Open  (was: Patch Available)

> Block Sampling should adjust number of reducers accordingly to make it useful
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-2146
>                 URL: https://issues.apache.org/jira/browse/HIVE-2146
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2146.1.patch
>
>
> Now number of reducers of block sampling is not modified, so that queries like:
> select c from tab tablesample(1 percent) group by c;
> can generate huge number of reducers although the input is sampled to be small.
> We need to shrink number of reducers to make block sampling more useful.
> Since now number of reducers are determined before get splits, the way to do it probably is not clean enough, but we can do a good guess.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2146) Block Sampling should adjust number of reducers accordingly to make it useful

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong updated HIVE-2146:
------------------------------

    Attachment: HIVE-2146.1.patch

> Block Sampling should adjust number of reducers accordingly to make it useful
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-2146
>                 URL: https://issues.apache.org/jira/browse/HIVE-2146
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2146.1.patch
>
>
> Now number of reducers of block sampling is not modified, so that queries like:
> select c from tab tablesample(1 percent) group by c;
> can generate huge number of reducers although the input is sampled to be small.
> We need to shrink number of reducers to make block sampling more useful.
> Since now number of reducers are determined before get splits, the way to do it probably is not clean enough, but we can do a good guess.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2146) Block Sampling should adjust number of reducers accordingly to make it useful

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang updated HIVE-2146:
-----------------------------

       Resolution: Fixed
    Fix Version/s: 0.8.0
     Hadoop Flags: [Reviewed]
           Status: Resolved  (was: Patch Available)

Committed. Thanks Siying!

> Block Sampling should adjust number of reducers accordingly to make it useful
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-2146
>                 URL: https://issues.apache.org/jira/browse/HIVE-2146
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>             Fix For: 0.8.0
>
>         Attachments: HIVE-2146.1.patch, HIVE-2146.2.patch
>
>
> Now number of reducers of block sampling is not modified, so that queries like:
> select c from tab tablesample(1 percent) group by c;
> can generate huge number of reducers although the input is sampled to be small.
> We need to shrink number of reducers to make block sampling more useful.
> Since now number of reducers are determined before get splits, the way to do it probably is not clean enough, but we can do a good guess.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2146) Block Sampling should adjust number of reducers accordingly to make it useful

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13029816#comment-13029816 ] 

Hudson commented on HIVE-2146:
------------------------------

Integrated in Hive-trunk-h0.20 #712 (See [https://builds.apache.org/hudson/job/Hive-trunk-h0.20/712/])
    

> Block Sampling should adjust number of reducers accordingly to make it useful
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-2146
>                 URL: https://issues.apache.org/jira/browse/HIVE-2146
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>             Fix For: 0.8.0
>
>         Attachments: HIVE-2146.1.patch, HIVE-2146.2.patch
>
>
> Now number of reducers of block sampling is not modified, so that queries like:
> select c from tab tablesample(1 percent) group by c;
> can generate huge number of reducers although the input is sampled to be small.
> We need to shrink number of reducers to make block sampling more useful.
> Since now number of reducers are determined before get splits, the way to do it probably is not clean enough, but we can do a good guess.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HIVE-2146) Block Sampling should adjust number of reducers accordingly to make it useful

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong reassigned HIVE-2146:
---------------------------------

    Assignee: Siying Dong

> Block Sampling should adjust number of reducers accordingly to make it useful
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-2146
>                 URL: https://issues.apache.org/jira/browse/HIVE-2146
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2146.1.patch
>
>
> Now number of reducers of block sampling is not modified, so that queries like:
> select c from tab tablesample(1 percent) group by c;
> can generate huge number of reducers although the input is sampled to be small.
> We need to shrink number of reducers to make block sampling more useful.
> Since now number of reducers are determined before get splits, the way to do it probably is not clean enough, but we can do a good guess.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2146) Block Sampling should adjust number of reducers accordingly to make it useful

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028624#comment-13028624 ] 

Ning Zhang commented on HIVE-2146:
----------------------------------

+1. Will commit if tests pass.

> Block Sampling should adjust number of reducers accordingly to make it useful
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-2146
>                 URL: https://issues.apache.org/jira/browse/HIVE-2146
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2146.1.patch, HIVE-2146.2.patch
>
>
> Now number of reducers of block sampling is not modified, so that queries like:
> select c from tab tablesample(1 percent) group by c;
> can generate huge number of reducers although the input is sampled to be small.
> We need to shrink number of reducers to make block sampling more useful.
> Since now number of reducers are determined before get splits, the way to do it probably is not clean enough, but we can do a good guess.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2146) Block Sampling should adjust number of reducers accordingly to make it useful

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong updated HIVE-2146:
------------------------------

    Attachment:     (was: HIVE-2146.1.patch)

> Block Sampling should adjust number of reducers accordingly to make it useful
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-2146
>                 URL: https://issues.apache.org/jira/browse/HIVE-2146
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>
> Now number of reducers of block sampling is not modified, so that queries like:
> select c from tab tablesample(1 percent) group by c;
> can generate huge number of reducers although the input is sampled to be small.
> We need to shrink number of reducers to make block sampling more useful.
> Since now number of reducers are determined before get splits, the way to do it probably is not clean enough, but we can do a good guess.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2146) Block Sampling should adjust number of reducers accordingly to make it useful

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Siying Dong updated HIVE-2146:
------------------------------

    Status: Patch Available  (was: In Progress)

> Block Sampling should adjust number of reducers accordingly to make it useful
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-2146
>                 URL: https://issues.apache.org/jira/browse/HIVE-2146
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2146.1.patch
>
>
> Now number of reducers of block sampling is not modified, so that queries like:
> select c from tab tablesample(1 percent) group by c;
> can generate huge number of reducers although the input is sampled to be small.
> We need to shrink number of reducers to make block sampling more useful.
> Since now number of reducers are determined before get splits, the way to do it probably is not clean enough, but we can do a good guess.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2146) Block Sampling should adjust number of reducers accordingly to make it useful

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028556#comment-13028556 ] 

Siying Dong commented on HIVE-2146:
-----------------------------------

for 2) the possibility that it can't be sampled is more likely to be the case that CombineHiveInputformat.getSplits() finally calls super.getSplits() for some reasons. In those cases, the data are not sampled at all. 
Another possible is that, for example, two alias of the MapReduce job include the same directory. We can't sample it then.

For 1) and 3), I think about it more. I'll remove the extra bytesPerReducer added. The worst case is that we run one less reducer. Shouldn't be so bad.

> Block Sampling should adjust number of reducers accordingly to make it useful
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-2146
>                 URL: https://issues.apache.org/jira/browse/HIVE-2146
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2146.1.patch
>
>
> Now number of reducers of block sampling is not modified, so that queries like:
> select c from tab tablesample(1 percent) group by c;
> can generate huge number of reducers although the input is sampled to be small.
> We need to shrink number of reducers to make block sampling more useful.
> Since now number of reducers are determined before get splits, the way to do it probably is not clean enough, but we can do a good guess.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2146) Block Sampling should adjust number of reducers accordingly to make it useful

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028563#comment-13028563 ] 

Siying Dong commented on HIVE-2146:
-----------------------------------

review board:

https://reviews.apache.org/r/685/

> Block Sampling should adjust number of reducers accordingly to make it useful
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-2146
>                 URL: https://issues.apache.org/jira/browse/HIVE-2146
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2146.1.patch, HIVE-2146.2.patch
>
>
> Now number of reducers of block sampling is not modified, so that queries like:
> select c from tab tablesample(1 percent) group by c;
> can generate huge number of reducers although the input is sampled to be small.
> We need to shrink number of reducers to make block sampling more useful.
> Since now number of reducers are determined before get splits, the way to do it probably is not clean enough, but we can do a good guess.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2146) Block Sampling should adjust number of reducers accordingly to make it useful

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13028532#comment-13028532 ] 

Ning Zhang commented on HIVE-2146:
----------------------------------

Siying can you create a review request? 

I've a couple of comments as well: 
 1) comment in line 387 is not complete
 2) comments in line 388-389: can you give an example in which case all input alias are sampled by syntax but it actually cannot be sampled?
 3) line 391: if we want to mimic the old estimation algorithm it seems we shouldn't + bytesPerReducer here? It is added in line 399 right?  

> Block Sampling should adjust number of reducers accordingly to make it useful
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-2146
>                 URL: https://issues.apache.org/jira/browse/HIVE-2146
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2146.1.patch
>
>
> Now number of reducers of block sampling is not modified, so that queries like:
> select c from tab tablesample(1 percent) group by c;
> can generate huge number of reducers although the input is sampled to be small.
> We need to shrink number of reducers to make block sampling more useful.
> Since now number of reducers are determined before get splits, the way to do it probably is not clean enough, but we can do a good guess.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Work started] (HIVE-2146) Block Sampling should adjust number of reducers accordingly to make it useful

Posted by "Siying Dong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on HIVE-2146 started by Siying Dong.

> Block Sampling should adjust number of reducers accordingly to make it useful
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-2146
>                 URL: https://issues.apache.org/jira/browse/HIVE-2146
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2146.1.patch
>
>
> Now number of reducers of block sampling is not modified, so that queries like:
> select c from tab tablesample(1 percent) group by c;
> can generate huge number of reducers although the input is sampled to be small.
> We need to shrink number of reducers to make block sampling more useful.
> Since now number of reducers are determined before get splits, the way to do it probably is not clean enough, but we can do a good guess.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira