You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@chukwa.apache.org by "Jie Huang (JIRA)" <ji...@apache.org> on 2012/07/16 07:29:33 UTC

[jira] [Created] (CHUKWA-647) Spread out intermediate data with the same ReduceType into different Reduce Tasks

Jie Huang created CHUKWA-647:
--------------------------------

             Summary: Spread out intermediate data with the same ReduceType into different Reduce Tasks
                 Key: CHUKWA-647
                 URL: https://issues.apache.org/jira/browse/CHUKWA-647
             Project: Chukwa
          Issue Type: Improvement
          Components: Data Processors
    Affects Versions: 0.4.0, 0.6.0
            Reporter: Jie Huang
            Priority: Minor


We have found that if we partitioned the map output data according to ReduceType, we can see the data skew in some HiTune cases. Then one or two Reduce Tasks slow down the whole Demux job somehow, since those reduce tasks have to process more input-data.    

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CHUKWA-647) Spread out intermediate data with the same ReduceType into different Reduce Tasks

Posted by "Jie Huang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CHUKWA-647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jie Huang updated CHUKWA-647:
-----------------------------

    Attachment: Chukwa-647.patch

Here attaches a simple workaround. If the key contains the specific mark, the partitioner will include the key as well. OR Another option is to include part of the key content.Any other idea?
                
> Spread out intermediate data with the same ReduceType into different Reduce Tasks
> ---------------------------------------------------------------------------------
>
>                 Key: CHUKWA-647
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-647
>             Project: Chukwa
>          Issue Type: Improvement
>          Components: Data Processors
>    Affects Versions: 0.4.0, 0.6.0
>            Reporter: Jie Huang
>            Priority: Minor
>         Attachments: Chukwa-647.patch
>
>
> We have found that if we partitioned the map output data according to ReduceType, we can see the data skew in some HiTune cases. Then one or two Reduce Tasks slow down the whole Demux job somehow, since those reduce tasks have to process more input-data.    

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CHUKWA-647) Spread out intermediate data with the same ReduceType into different Reduce Tasks

Posted by "Jie Huang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CHUKWA-647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jie Huang updated CHUKWA-647:
-----------------------------

    Attachment:     (was: Chukwa-647.patch)
    
> Spread out intermediate data with the same ReduceType into different Reduce Tasks
> ---------------------------------------------------------------------------------
>
>                 Key: CHUKWA-647
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-647
>             Project: Chukwa
>          Issue Type: Improvement
>          Components: Data Processors
>    Affects Versions: 0.4.0, 0.6.0
>            Reporter: Jie Huang
>            Priority: Minor
>         Attachments: Chukwa-647-0_4.patch
>
>
> We have found that if we partitioned the map output data according to ReduceType, we can see the data skew in some HiTune cases. Then one or two Reduce Tasks slow down the whole Demux job somehow, since those reduce tasks have to process more input-data.    

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CHUKWA-647) Spread out intermediate data with the same ReduceType into different Reduce Tasks

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CHUKWA-647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13415678#comment-13415678 ] 

Ari Rabkin commented on CHUKWA-647:
-----------------------------------

Looks good to me. Will commit to Trunk barring objections.

(My sense is that we aren't going to be doing minor-version releases so it doens't make sense to apply to 0.4 or 0.5 branches.)
                
> Spread out intermediate data with the same ReduceType into different Reduce Tasks
> ---------------------------------------------------------------------------------
>
>                 Key: CHUKWA-647
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-647
>             Project: Chukwa
>          Issue Type: Improvement
>          Components: Data Processors
>    Affects Versions: 0.4.0, 0.6.0
>            Reporter: Jie Huang
>            Priority: Minor
>         Attachments: Chukwa-647-0_4.patch
>
>
> We have found that if we partitioned the map output data according to ReduceType, we can see the data skew in some HiTune cases. Then one or two Reduce Tasks slow down the whole Demux job somehow, since those reduce tasks have to process more input-data.    

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CHUKWA-647) Spread out intermediate data with the same ReduceType into different Reduce Tasks

Posted by "Jie Huang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CHUKWA-647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13414878#comment-13414878 ] 

Jie Huang commented on CHUKWA-647:
----------------------------------

The current ChukwaRecordPartitioner dispatches the records to different Reduce Tasks based on ReduceType. 
{noformat}
return (key.getReduceType().hashCode() & Integer.MAX_VALUE)
{noformat}
I wonder if it is possible to include the key or part of the key content into the ChukwaRecordPartitioner, so that we can spread out all those map output data into different Reduce Tasks even for the same Reduce Type.

                
> Spread out intermediate data with the same ReduceType into different Reduce Tasks
> ---------------------------------------------------------------------------------
>
>                 Key: CHUKWA-647
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-647
>             Project: Chukwa
>          Issue Type: Improvement
>          Components: Data Processors
>    Affects Versions: 0.4.0, 0.6.0
>            Reporter: Jie Huang
>            Priority: Minor
>
> We have found that if we partitioned the map output data according to ReduceType, we can see the data skew in some HiTune cases. Then one or two Reduce Tasks slow down the whole Demux job somehow, since those reduce tasks have to process more input-data.    

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (CHUKWA-647) Spread out intermediate data with the same ReduceType into different Reduce Tasks

Posted by "Ari Rabkin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CHUKWA-647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ari Rabkin resolved CHUKWA-647.
-------------------------------

       Resolution: Fixed
    Fix Version/s: 0.6.0
         Assignee: Ari Rabkin

I just committed this to Trunk. Thanks!

NOTE: made some slight changes to patch to apply correctly to Trunk.
                
> Spread out intermediate data with the same ReduceType into different Reduce Tasks
> ---------------------------------------------------------------------------------
>
>                 Key: CHUKWA-647
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-647
>             Project: Chukwa
>          Issue Type: Improvement
>          Components: Data Processors
>    Affects Versions: 0.4.0, 0.6.0
>            Reporter: Jie Huang
>            Assignee: Ari Rabkin
>            Priority: Minor
>             Fix For: 0.6.0
>
>         Attachments: Chukwa-647-0_4.patch
>
>
> We have found that if we partitioned the map output data according to ReduceType, we can see the data skew in some HiTune cases. Then one or two Reduce Tasks slow down the whole Demux job somehow, since those reduce tasks have to process more input-data.    

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (CHUKWA-647) Spread out intermediate data with the same ReduceType into different Reduce Tasks

Posted by "Jie Huang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CHUKWA-647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jie Huang updated CHUKWA-647:
-----------------------------

    Attachment: Chukwa-647-0_4.patch
    
> Spread out intermediate data with the same ReduceType into different Reduce Tasks
> ---------------------------------------------------------------------------------
>
>                 Key: CHUKWA-647
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-647
>             Project: Chukwa
>          Issue Type: Improvement
>          Components: Data Processors
>    Affects Versions: 0.4.0, 0.6.0
>            Reporter: Jie Huang
>            Priority: Minor
>         Attachments: Chukwa-647-0_4.patch
>
>
> We have found that if we partitioned the map output data according to ReduceType, we can see the data skew in some HiTune cases. Then one or two Reduce Tasks slow down the whole Demux job somehow, since those reduce tasks have to process more input-data.    

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CHUKWA-647) Spread out intermediate data with the same ReduceType into different Reduce Tasks

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CHUKWA-647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13415820#comment-13415820 ] 

Hudson commented on CHUKWA-647:
-------------------------------

Integrated in Chukwa-trunk #453 (See [https://builds.apache.org/job/Chukwa-trunk/453/])
    CHUKWA-647. Spread out intermediate data with the same ReduceType into different Reduce Tasks. Contributed by Jie Huang. (Revision 1362318)

     Result = FAILURE
asrabkin : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1362318
Files : 
* /incubator/chukwa/trunk/src/main/java/org/apache/hadoop/chukwa/extraction/CHUKWA_CONSTANT.java
* /incubator/chukwa/trunk/src/main/java/org/apache/hadoop/chukwa/extraction/demux/ChukwaRecordPartitioner.java

                
> Spread out intermediate data with the same ReduceType into different Reduce Tasks
> ---------------------------------------------------------------------------------
>
>                 Key: CHUKWA-647
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-647
>             Project: Chukwa
>          Issue Type: Improvement
>          Components: Data Processors
>    Affects Versions: 0.4.0, 0.6.0
>            Reporter: Jie Huang
>            Assignee: Ari Rabkin
>            Priority: Minor
>             Fix For: 0.6.0
>
>         Attachments: Chukwa-647-0_4.patch
>
>
> We have found that if we partitioned the map output data according to ReduceType, we can see the data skew in some HiTune cases. Then one or two Reduce Tasks slow down the whole Demux job somehow, since those reduce tasks have to process more input-data.    

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira