You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Chris Douglas (JIRA)" <ji...@apache.org> on 2008/08/07 04:45:44 UTC

[jira] Updated: (HADOOP-3062) Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling and break them down by racks

     [ https://issues.apache.org/jira/browse/HADOOP-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Douglas updated HADOOP-3062:
----------------------------------

    Attachment: 3062-0.patch

First draft.

Format:
{noformat}
<log4j schema including timestamp, etc.> src: <src IP>, dest: <dst IP>, bytes: <bytes>, op: <op enum>, id: <DFSClient id|taskid>[, blockid: <block id>] 
{noformat}

The patch adds the DFSClient clientName to OP_READ_BLOCK and changes the String in OP_WRITE_BLOCK from the path- which is unused- to the clientName. Is this is set to DFSClient_<taskid> in map and reduce tasks, tracing the output of a job should be straightforward after some processing of each entry. Writes for replications (where the clientName is "") are logged as they have been; the logging in PacketResponder has been reformatted to fit the preceding schema. A few known issues:

* The logging assumes the IP address is sufficient to distinguish a source, particularly for writes and in the shuffle
* This logs to the DataNode and ReduceTask appenders; these entries should be directed elsewhere and disabled by default
* In testing this, some entries in the read exhibited a strange property: the source and destination match, but neither matches the DataNode on which it is logged. I'm clearly missing something.

I tried tracing a few blocks and map outputs through the logs and all made sense. That said- as mentioned in the last bullet- not all of the entries made sense.

> Need to capture the metrics for the network ios generate by dfs reads/writes and map/reduce shuffling  and break them down by racks 
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3062
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3062
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: metrics
>            Reporter: Runping Qi
>         Attachments: 3062-0.patch
>
>
> In order to better understand the relationship between hadoop performance and the network bandwidth, we need to know 
> what the aggregated traffic data in a cluster and its breakdown by racks. With these data, we can determine whether the network 
> bandwidth is the bottleneck when certain jobs are running on a cluster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.