You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Runping Qi (JIRA)" <ji...@apache.org> on 2008/03/21 06:31:25 UTC

[jira] Created: (HADOOP-3062) Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling and break them down by racks

Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling  and break them down by racks 
------------------------------------------------------------------------------------------------------------------------------------

                 Key: HADOOP-3062
                 URL: https://issues.apache.org/jira/browse/HADOOP-3062
             Project: Hadoop Core
          Issue Type: Improvement
          Components: metrics
            Reporter: Runping Qi



In order to better understand the relationship between hadoop performance and the network bandwidth, we need to know 
what the aggregated traffic data in a cluster and its breakdown by racks. With these data, we can determine whether the network 
bandwidth is the bottleneck when certain jobs are running on a cluster.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3062) Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling and break them down by racks

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3062:
----------------------------------

    Attachment: 3062-0.patch

First draft.

Format:
{noformat}
<log4j schema including timestamp, etc.> src: <src IP>, dest: <dst IP>, bytes: <bytes>, op: <op enum>, id: <DFSClient id|taskid>[, blockid: <block id>] 
{noformat}

The patch adds the DFSClient clientName to OP_READ_BLOCK and changes the String in OP_WRITE_BLOCK from the path- which is unused- to the clientName. Is this is set to DFSClient_<taskid> in map and reduce tasks, tracing the output of a job should be straightforward after some processing of each entry. Writes for replications (where the clientName is "") are logged as they have been; the logging in PacketResponder has been reformatted to fit the preceding schema. A few known issues:

* The logging assumes the IP address is sufficient to distinguish a source, particularly for writes and in the shuffle
* This logs to the DataNode and ReduceTask appenders; these entries should be directed elsewhere and disabled by default
* In testing this, some entries in the read exhibited a strange property: the source and destination match, but neither matches the DataNode on which it is logged. I'm clearly missing something.

I tried tracing a few blocks and map outputs through the logs and all made sense. That said- as mentioned in the last bullet- not all of the entries made sense.

> Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling  and break them down by racks 
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3062
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3062
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>            Reporter: Runping Qi
>         Attachments: 3062-0.patch
>
>
> In order to better understand the relationship between hadoop performance and the network bandwidth, we need to know 
> what the aggregated traffic data in a cluster and its breakdown by racks. With these data, we can determine whether the network 
> bandwidth is the bottleneck when certain jobs are running on a cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3062) Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling and break them down by racks

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3062:
----------------------------------

    Attachment: 3062-2.patch

Updated based on Nicholas's feedback, i.e. added {{isInfoEnabled}} guards around appropriate log stmts. Also removed the irrelevant replication log message.

> Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling  and break them down by racks 
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3062
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3062
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>            Reporter: Runping Qi
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3062-0.patch, 3062-1.patch, 3062-2.patch
>
>
> In order to better understand the relationship between hadoop performance and the network bandwidth, we need to know 
> what the aggregated traffic data in a cluster and its breakdown by racks. With these data, we can determine whether the network 
> bandwidth is the bottleneck when certain jobs are running on a cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3062) Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling and break them down by racks

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624451#action_12624451 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-3062:
------------------------------------------------

+1 new patch looks good.

> Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling  and break them down by racks 
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3062
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3062
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>            Reporter: Runping Qi
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3062-0.patch, 3062-1.patch, 3062-2.patch, 3062-3.patch, 3062-4.patch
>
>
> In order to better understand the relationship between hadoop performance and the network bandwidth, we need to know 
> what the aggregated traffic data in a cluster and its breakdown by racks. With these data, we can determine whether the network 
> bandwidth is the bottleneck when certain jobs are running on a cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3062) Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling and break them down by racks

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12620503#action_12620503 ] 

Hadoop QA commented on HADOOP-3062:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12387698/3062-0.patch
  against trunk revision 683448.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3029/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3029/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3029/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3029/console

This message is automatically generated.

> Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling  and break them down by racks 
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3062
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3062
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>            Reporter: Runping Qi
>             Fix For: 0.19.0
>
>         Attachments: 3062-0.patch
>
>
> In order to better understand the relationship between hadoop performance and the network bandwidth, we need to know 
> what the aggregated traffic data in a cluster and its breakdown by racks. With these data, we can determine whether the network 
> bandwidth is the bottleneck when certain jobs are running on a cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3062) Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling and break them down by racks

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3062:
----------------------------------

    Status: Patch Available  (was: Open)

Verified results with a randomwriter/sort run

> Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling  and break them down by racks 
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3062
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3062
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>            Reporter: Runping Qi
>             Fix For: 0.19.0
>
>         Attachments: 3062-0.patch, 3062-1.patch
>
>
> In order to better understand the relationship between hadoop performance and the network bandwidth, we need to know 
> what the aggregated traffic data in a cluster and its breakdown by racks. With these data, we can determine whether the network 
> bandwidth is the bottleneck when certain jobs are running on a cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3062) Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling and break them down by racks

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624778#action_12624778 ] 

Hudson commented on HADOOP-3062:
--------------------------------

Integrated in Hadoop-trunk #581 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/581/])

> Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling  and break them down by racks 
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3062
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3062
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>            Reporter: Runping Qi
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3062-0.patch, 3062-1.patch, 3062-2.patch, 3062-3.patch, 3062-4.patch, 3062-5.patch
>
>
> In order to better understand the relationship between hadoop performance and the network bandwidth, we need to know 
> what the aggregated traffic data in a cluster and its breakdown by racks. With these data, we can determine whether the network 
> bandwidth is the bottleneck when certain jobs are running on a cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3062) Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling and break them down by racks

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3062:
----------------------------------

    Attachment: 3062-1.patch

> Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling  and break them down by racks 
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3062
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3062
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>            Reporter: Runping Qi
>             Fix For: 0.19.0
>
>         Attachments: 3062-0.patch, 3062-1.patch
>
>
> In order to better understand the relationship between hadoop performance and the network bandwidth, we need to know 
> what the aggregated traffic data in a cluster and its breakdown by racks. With these data, we can determine whether the network 
> bandwidth is the bottleneck when certain jobs are running on a cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3062) Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling and break them down by racks

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623482#action_12623482 ] 

Chris Douglas commented on HADOOP-3062:
---------------------------------------

bq.  Should we check whether ClientTraceLog.isInfoEnabled() before logging?

Excluding the string concatenation to produce the actual, the cost of each log message is low or infrequent (like the shuffle message). Excluding the new read log message, it's comparable to the logging that's already happening. I'm not certain if the logging this replaces (for client writes) should occur when ClientTraceLog.inInfoEnabled() is false, since nothing would be logged in that case...

bq. Should we define an AUDIT_FORMAT for the log messages, like FSNamesystem.AUDIT_FORMAT?

Unlike the FSNamesystem audit format, these are going to require some additional processing to be useful (e.g. the id param, optional block id), so the key/value pairing doesn't offer the same syntactical guarantees. That said, you're probably right, but unless we adopt a packaging like what you suggest in your following point, we'd introduce a link between hdfs and mapred. For now- with only these few messages- I don't think it gains much by being pulled out.

bq. I think it might worth to create a utility class, say org.apache.hadoop.log.AuditLog, so that we could put AUDIT_FORMAT, isInfoEnabled(), etc. inside it. Then, both DataNode and FSNamesystem can use it.

Agreed: it would be better if there were a more central location for Hadoop APIs exported through the logging interfaces, like audit logs and these metrics. If nothing else, it would let us know which messages have consumers (hence the uncertainty for logging client writes). That's likely part of a different patch, though.

> Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling  and break them down by racks 
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3062
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3062
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>            Reporter: Runping Qi
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3062-0.patch, 3062-1.patch
>
>
> In order to better understand the relationship between hadoop performance and the network bandwidth, we need to know 
> what the aggregated traffic data in a cluster and its breakdown by racks. With these data, we can determine whether the network 
> bandwidth is the bottleneck when certain jobs are running on a cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-3062) Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling and break them down by racks

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas reassigned HADOOP-3062:
-------------------------------------

    Assignee: Chris Douglas

> Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling  and break them down by racks 
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3062
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3062
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>            Reporter: Runping Qi
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3062-0.patch, 3062-1.patch
>
>
> In order to better understand the relationship between hadoop performance and the network bandwidth, we need to know 
> what the aggregated traffic data in a cluster and its breakdown by racks. With these data, we can determine whether the network 
> bandwidth is the bottleneck when certain jobs are running on a cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3062) Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling and break them down by racks

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623480#action_12623480 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-3062:
------------------------------------------------

- Should we check whether ClientTraceLog.isInfoEnabled() before logging?

- Should we define an AUDIT_FORMAT for the log messages, like FSNamesystem.AUDIT_FORMAT?

- I think it might worth to create a utility class, say org.apache.hadoop.log.AuditLog, so that we could put AUDIT_FORMAT, isInfoEnabled(), etc. inside it.  Then, both DataNode and FSNamesystem can use it.

> Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling  and break them down by racks 
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3062
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3062
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>            Reporter: Runping Qi
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3062-0.patch, 3062-1.patch
>
>
> In order to better understand the relationship between hadoop performance and the network bandwidth, we need to know 
> what the aggregated traffic data in a cluster and its breakdown by racks. With these data, we can determine whether the network 
> bandwidth is the bottleneck when certain jobs are running on a cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3062) Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling and break them down by racks

Posted by "Lohit Vijayarenu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12620731#action_12620731 ] 

Lohit Vijayarenu commented on HADOOP-3062:
------------------------------------------

For this
bq. and break them down by racks 
Is there any information logged about this?

> Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling  and break them down by racks 
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3062
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3062
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>            Reporter: Runping Qi
>             Fix For: 0.19.0
>
>         Attachments: 3062-0.patch
>
>
> In order to better understand the relationship between hadoop performance and the network bandwidth, we need to know 
> what the aggregated traffic data in a cluster and its breakdown by racks. With these data, we can determine whether the network 
> bandwidth is the bottleneck when certain jobs are running on a cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3062) Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling and break them down by racks

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3062:
----------------------------------

    Attachment: 3062-5.patch

{noformat}
     [exec] -1 overall.

     [exec]     +1 @author.  The patch does not contain any @author tags.

     [exec]     +1 tests included.  The patch appears to include 3 new or modified tests.

     [exec]     -1 javadoc.  The javadoc tool appears to have generated 1 warning messages.

     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
{noformat}

Fixed a findbugs warning, javadoc remains unrelated, and passes unit tests.

> Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling  and break them down by racks 
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3062
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3062
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>            Reporter: Runping Qi
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3062-0.patch, 3062-1.patch, 3062-2.patch, 3062-3.patch, 3062-4.patch, 3062-5.patch
>
>
> In order to better understand the relationship between hadoop performance and the network bandwidth, we need to know 
> what the aggregated traffic data in a cluster and its breakdown by racks. With these data, we can determine whether the network 
> bandwidth is the bottleneck when certain jobs are running on a cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3062) Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling and break them down by racks

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3062:
----------------------------------

      Resolution: Fixed
    Hadoop Flags: [Incompatible change, Reviewed]  (was: [Reviewed, Incompatible change])
          Status: Resolved  (was: Patch Available)

I just committed this.

> Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling  and break them down by racks 
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3062
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3062
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>            Reporter: Runping Qi
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3062-0.patch, 3062-1.patch, 3062-2.patch, 3062-3.patch, 3062-4.patch, 3062-5.patch
>
>
> In order to better understand the relationship between hadoop performance and the network bandwidth, we need to know 
> what the aggregated traffic data in a cluster and its breakdown by racks. With these data, we can determine whether the network 
> bandwidth is the bottleneck when certain jobs are running on a cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-3062) Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling and break them down by racks

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623533#action_12623533 ] 

chris.douglas edited comment on HADOOP-3062 at 8/18/08 6:44 PM:
----------------------------------------------------------------

Updated based on Nicholas's feedback, i.e. added {{isInfoEnabled}} guards around appropriate log stmts. Also removed the irrelevant replication log message. I'll commit this if Hudson doesn't object.

      was (Author: chris.douglas):
    Updated based on Nicholas's feedback, i.e. added {{isInfoEnabled}} guards around appropriate log stmts. Also removed the irrelevant replication log message.
  
> Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling  and break them down by racks 
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3062
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3062
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>            Reporter: Runping Qi
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3062-0.patch, 3062-1.patch, 3062-2.patch
>
>
> In order to better understand the relationship between hadoop performance and the network bandwidth, we need to know 
> what the aggregated traffic data in a cluster and its breakdown by racks. With these data, we can determine whether the network 
> bandwidth is the bottleneck when certain jobs are running on a cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3062) Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling and break them down by racks

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12620791#action_12620791 ] 

Chris Douglas commented on HADOOP-3062:
---------------------------------------

bq. Is there any information logged about [breakdown by racks]?

No, that's handled in the analysis. I don't think the datanodes or the reduce tasks know about topology, anyway.

> Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling  and break them down by racks 
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3062
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3062
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>            Reporter: Runping Qi
>             Fix For: 0.19.0
>
>         Attachments: 3062-0.patch
>
>
> In order to better understand the relationship between hadoop performance and the network bandwidth, we need to know 
> what the aggregated traffic data in a cluster and its breakdown by racks. With these data, we can determine whether the network 
> bandwidth is the bottleneck when certain jobs are running on a cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3062) Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling and break them down by racks

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3062:
----------------------------------

    Fix Version/s: 0.19.0
     Hadoop Flags: [Incompatible change]
           Status: Patch Available  (was: Open)

> Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling  and break them down by racks 
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3062
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3062
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>            Reporter: Runping Qi
>             Fix For: 0.19.0
>
>         Attachments: 3062-0.patch
>
>
> In order to better understand the relationship between hadoop performance and the network bandwidth, we need to know 
> what the aggregated traffic data in a cluster and its breakdown by racks. With these data, we can determine whether the network 
> bandwidth is the bottleneck when certain jobs are running on a cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3062) Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling and break them down by racks

Posted by "Robert Chansler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Chansler updated HADOOP-3062:
------------------------------------

    Release Note: Introduced additional log records for data transfers.
    Hadoop Flags: [Incompatible change, Reviewed]  (was: [Reviewed, Incompatible change])

> Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling  and break them down by racks 
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3062
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3062
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>            Reporter: Runping Qi
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3062-0.patch, 3062-1.patch, 3062-2.patch, 3062-3.patch, 3062-4.patch, 3062-5.patch
>
>
> In order to better understand the relationship between hadoop performance and the network bandwidth, we need to know 
> what the aggregated traffic data in a cluster and its breakdown by racks. With these data, we can determine whether the network 
> bandwidth is the bottleneck when certain jobs are running on a cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3062) Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling and break them down by racks

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3062:
----------------------------------

    Attachment: 3062-4.patch

* Added storageID to datanode string
* Replaced redundant log message

This probably needs only one more pass in review.

> Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling  and break them down by racks 
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3062
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3062
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>            Reporter: Runping Qi
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3062-0.patch, 3062-1.patch, 3062-2.patch, 3062-3.patch, 3062-4.patch
>
>
> In order to better understand the relationship between hadoop performance and the network bandwidth, we need to know 
> what the aggregated traffic data in a cluster and its breakdown by racks. With these data, we can determine whether the network 
> bandwidth is the bottleneck when certain jobs are running on a cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3062) Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling and break them down by racks

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12618896#action_12618896 ] 

Chris Douglas commented on HADOOP-3062:
---------------------------------------

The analysis should leverage HADOOP-3719, so this issue should cover the log4j appender emitting the HDFS and shuffling data. There are a few open questions and arguable assumptions:

* Should this count bytes successfully transferred separately from failed transfers? Should failed transfers be logged at all?
* The header/metadata/etc. traffic is assumed to be a negligible fraction of the total network traffic and irrelevant to the analysis for a particular job. The overall network utilization is also best measured using standard monitoring utilities that don't require any knowledge of Hadoop. This will focus on tracking block traffic over HDFS (reads, writes, replications) and map output fetched during the shuffle, only.
* For local reads, the source and destination IP will match. This should be sufficient to detect and discard during analysis of network traffic, but will not be sufficient to account for all reads from the local disk (counters and job history are likely better tools for this).
* Accounting for topology (to break down by racks, etc.) is best deferred to the analysis. Logging changes in topology would also be helpful, though I don't know whether Hadoop has sufficient information to do this in the general case.
* If job information is available (in the shuffle), should it be included in the entry? Doing this for HDFS is non-trivial, but would be invaluable to the analysis. I'm not certain how to do this, yet. Of course, replications and rebalancing won't include this, and HDFS reads prior to job submission (and all other traffic from JobClient) will likely be orphaned, as well.
* Should this include start/end entries so one can infer how long the transfer took?
* What about DistributedCache? Can it be ignored as part of the job setup, which is already omitted?

In general, the format will follow:
{noformat}
<log4j schema including timestamp, etc.> source: <src IP>, destination: <dst IP>, bytes: <bytes>, operation: <op enum>[, taskid: <TaskID>]
{noformat}

Where {{<(src|dst) IP>}} is the IP address of the source and destination nodes, {{<bytes>}} is a long, and {{<op enum>}} is one of {{HDFS_READ}}, {{HDFS_WRITE}}, {{HDFS_COPY}}, and {{MAPRED_SHUFFLE}}. {{HDFS_REPLACE}} should be redundant if {{HDFS_COPY}} is recorded (I think). The rebalancing traffic isn't relevant to job analysis, but if one is including sufficient information to determine the duration of each transfer it may be interesting. The TaskID should be sufficient, but one could argue that including the JobID would be useful as a point to join on.

Thoughts?

> Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling  and break them down by racks 
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3062
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3062
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>            Reporter: Runping Qi
>
> In order to better understand the relationship between hadoop performance and the network bandwidth, we need to know 
> what the aggregated traffic data in a cluster and its breakdown by racks. With these data, we can determine whether the network 
> bandwidth is the bottleneck when certain jobs are running on a cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3062) Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling and break them down by racks

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3062:
----------------------------------

    Status: Open  (was: Patch Available)

Patch was mauled by HADOOP-3935 and the second and third (HADOOP-3658) bullets should be addressed.

> Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling  and break them down by racks 
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3062
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3062
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>            Reporter: Runping Qi
>             Fix For: 0.19.0
>
>         Attachments: 3062-0.patch
>
>
> In order to better understand the relationship between hadoop performance and the network bandwidth, we need to know 
> what the aggregated traffic data in a cluster and its breakdown by racks. With these data, we can determine whether the network 
> bandwidth is the bottleneck when certain jobs are running on a cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3062) Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling and break them down by racks

Posted by "Chris Douglas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3062:
----------------------------------

    Attachment: 3062-3.patch

* Moved mapred logging from ReduceTask to the TaskTracker
* Changed HDFS_READ logging to record bytes actually read from datanode rather than bytes requested
* Put \*.clienttrace format into TaskTracker, DataNode

> Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling  and break them down by racks 
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3062
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3062
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>            Reporter: Runping Qi
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3062-0.patch, 3062-1.patch, 3062-2.patch, 3062-3.patch
>
>
> In order to better understand the relationship between hadoop performance and the network bandwidth, we need to know 
> what the aggregated traffic data in a cluster and its breakdown by racks. With these data, we can determine whether the network 
> bandwidth is the bottleneck when certain jobs are running on a cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3062) Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling and break them down by racks

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-3062:
-------------------------------------------

     Description: 
In order to better understand the relationship between hadoop performance and the network bandwidth, we need to know 
what the aggregated traffic data in a cluster and its breakdown by racks. With these data, we can determine whether the network 
bandwidth is the bottleneck when certain jobs are running on a cluster.


  was:

In order to better understand the relationship between hadoop performance and the network bandwidth, we need to know 
what the aggregated traffic data in a cluster and its breakdown by racks. With these data, we can determine whether the network 
bandwidth is the bottleneck when certain jobs are running on a cluster.


    Hadoop Flags: [Incompatible change, Reviewed]  (was: [Incompatible change])

Got it.  Let's work on a utility class in future if there is a need.

+1  the patch is good.

> Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling  and break them down by racks 
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3062
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3062
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>            Reporter: Runping Qi
>            Assignee: Chris Douglas
>             Fix For: 0.19.0
>
>         Attachments: 3062-0.patch, 3062-1.patch
>
>
> In order to better understand the relationship between hadoop performance and the network bandwidth, we need to know 
> what the aggregated traffic data in a cluster and its breakdown by racks. With these data, we can determine whether the network 
> bandwidth is the bottleneck when certain jobs are running on a cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.