You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2006/03/10 00:01:21 UTC
[jira] Commented: (HADOOP-72) hadoop doesn't take advatage of distributed compiting in TestDFSIO

    [ http://issues.apache.org/jira/browse/HADOOP-72?page=comments#action_12369754 ] 

Doug Cutting commented on HADOOP-72:
------------------------------------

Did you look at the web ui to see how many map tasks were used to execute this?  My suspicion is that only a single map task is used.  A SequenceFile cannot be split into chunks smaller than 2k.  With less than 64 files, your single input file is probably less than 2k.  You could instead use a text input file, which can be split into smaller chunks, or you could use a custom getSplits() implementation that actually parses the input file, or you could use a much larger number of files.

> hadoop doesn't take advatage of distributed compiting in TestDFSIO
> ------------------------------------------------------------------
>
>          Key: HADOOP-72
>          URL: http://issues.apache.org/jira/browse/HADOOP-72
>      Project: Hadoop
>         Type: Test
>   Components: dfs, fs, mapred
>  Environment: 200 node cluster
>     Reporter: Konstantin Shvachko
>  Attachments: TestDFSIO.java, TestDFSIO_results_200_node_cluster.log, TestDFSIO_results_sequential.log
>
> TestDFSIO runs N map jobs, each either writing to or reading from a separate file of the same size, 
> and collects statistical information on its performance. 
> The reducer further calculates the overall statistics for all maps. 
> It outputs the following data:
> - read or write test
> - date and time the test finished   
> - number of files
> - total number of bytes processed
> - overall throughput in mb/sec
> - average IO rate in mb/sec per file
> __Results__
> I run 7 iterations of the test one after another on a cluster of ~200 nodes. 
> The file size is the same in all cases 320Mb. 
> The number of files tried is 1,2,4,8,16,32,64.
> The log file with statistics is attached.
> It looks like we don't have any distributed computing here at all.
> The total execution time increases proportionally to the total size of data both for writes and reads.
> Another thing is that the io ratio for read is higher than the write rate just gradually.
> For comparison I attach time measuring for the same ios performed on the same cluster but sequentially in a simple loop.
> This is the summary:
> Files	map/red time	sequential time
>  1		49			  34 
>  2		86			  69
>  4		158			131
>  8		299			266
> 16		569			532
> 32		1131
> 64		2218
> This doesn't look good, unless there is something wrong with my test (attached) or the cluster settings.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira