You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Konstantin Shvachko (JIRA)" <ji...@apache.org> on 2006/03/09 21:40:43 UTC

[jira] Created: (HADOOP-72) hadoop doesn't take advatage of distributed compiting in TestDFSIO

hadoop doesn't take advatage of distributed compiting in TestDFSIO
------------------------------------------------------------------

         Key: HADOOP-72
         URL: http://issues.apache.org/jira/browse/HADOOP-72
     Project: Hadoop
        Type: Test
  Components: dfs, fs, mapred  
 Environment: 200 node cluster
    Reporter: Konstantin Shvachko


TestDFSIO runs N map jobs, each either writing to or reading from a separate file of the same size, 
and collects statistical information on its performance. 
The reducer further calculates the overall statistics for all maps. 
It outputs the following data:
- read or write test
- date and time the test finished   
- number of files
- total number of bytes processed
- overall throughput in mb/sec
- average IO rate in mb/sec per file

__Results__
I run 7 iterations of the test one after another on a cluster of ~200 nodes. 
The file size is the same in all cases 320Mb. 
The number of files tried is 1,2,4,8,16,32,64.
The log file with statistics is attached.
It looks like we don't have any distributed computing here at all.
The total execution time increases proportionally to the total size of data both for writes and reads.
Another thing is that the io ratio for read is higher than the write rate just gradually.
For comparison I attach time measuring for the same ios performed on the same cluster but sequentially in a simple loop.
This is the summary:

Files	map/red time	sequential time
 1		49			  34 
 2		86			  69
 4		158			131
 8		299			266
16		569			532
32		1131
64		2218

This doesn't look good, unless there is something wrong with my test (attached) or the cluster settings.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Commented: (HADOOP-72) hadoop doesn't take advatage of distributed compiting in TestDFSIO

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/HADOOP-72?page=comments#action_12369754 ] 

Doug Cutting commented on HADOOP-72:
------------------------------------

Did you look at the web ui to see how many map tasks were used to execute this?  My suspicion is that only a single map task is used.  A SequenceFile cannot be split into chunks smaller than 2k.  With less than 64 files, your single input file is probably less than 2k.  You could instead use a text input file, which can be split into smaller chunks, or you could use a custom getSplits() implementation that actually parses the input file, or you could use a much larger number of files.

> hadoop doesn't take advatage of distributed compiting in TestDFSIO
> ------------------------------------------------------------------
>
>          Key: HADOOP-72
>          URL: http://issues.apache.org/jira/browse/HADOOP-72
>      Project: Hadoop
>         Type: Test
>   Components: dfs, fs, mapred
>  Environment: 200 node cluster
>     Reporter: Konstantin Shvachko
>  Attachments: TestDFSIO.java, TestDFSIO_results_200_node_cluster.log, TestDFSIO_results_sequential.log
>
> TestDFSIO runs N map jobs, each either writing to or reading from a separate file of the same size, 
> and collects statistical information on its performance. 
> The reducer further calculates the overall statistics for all maps. 
> It outputs the following data:
> - read or write test
> - date and time the test finished   
> - number of files
> - total number of bytes processed
> - overall throughput in mb/sec
> - average IO rate in mb/sec per file
> __Results__
> I run 7 iterations of the test one after another on a cluster of ~200 nodes. 
> The file size is the same in all cases 320Mb. 
> The number of files tried is 1,2,4,8,16,32,64.
> The log file with statistics is attached.
> It looks like we don't have any distributed computing here at all.
> The total execution time increases proportionally to the total size of data both for writes and reads.
> Another thing is that the io ratio for read is higher than the write rate just gradually.
> For comparison I attach time measuring for the same ios performed on the same cluster but sequentially in a simple loop.
> This is the summary:
> Files	map/red time	sequential time
>  1		49			  34 
>  2		86			  69
>  4		158			131
>  8		299			266
> 16		569			532
> 32		1131
> 64		2218
> This doesn't look good, unless there is something wrong with my test (attached) or the cluster settings.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (HADOOP-72) hadoop doesn't take advatage of distributed compiting in TestDFSIO

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-72?page=all ]

Konstantin Shvachko updated HADOOP-72:
--------------------------------------

    Attachment: TestDFSIO.java
                TestDFSIO_results.log

TestDFSIO is modified to create one input file for each map task.
That way it works in parallel.
Everything is getting very slow when the the number of writes is close to or 
larger than the size of the cluster.


> hadoop doesn't take advatage of distributed compiting in TestDFSIO
> ------------------------------------------------------------------
>
>          Key: HADOOP-72
>          URL: http://issues.apache.org/jira/browse/HADOOP-72
>      Project: Hadoop
>         Type: Test
>   Components: dfs, fs, mapred
>  Environment: 200 node cluster
>     Reporter: Konstantin Shvachko
>  Attachments: TestDFSIO.java, TestDFSIO_results.log, TestDFSIO_results_200_node_cluster.log, TestDFSIO_results_sequential.log
>
> TestDFSIO runs N map jobs, each either writing to or reading from a separate file of the same size, 
> and collects statistical information on its performance. 
> The reducer further calculates the overall statistics for all maps. 
> It outputs the following data:
> - read or write test
> - date and time the test finished   
> - number of files
> - total number of bytes processed
> - overall throughput in mb/sec
> - average IO rate in mb/sec per file
> __Results__
> I run 7 iterations of the test one after another on a cluster of ~200 nodes. 
> The file size is the same in all cases 320Mb. 
> The number of files tried is 1,2,4,8,16,32,64.
> The log file with statistics is attached.
> It looks like we don't have any distributed computing here at all.
> The total execution time increases proportionally to the total size of data both for writes and reads.
> Another thing is that the io ratio for read is higher than the write rate just gradually.
> For comparison I attach time measuring for the same ios performed on the same cluster but sequentially in a simple loop.
> This is the summary:
> Files	map/red time	sequential time
>  1		49			  34 
>  2		86			  69
>  4		158			131
>  8		299			266
> 16		569			532
> 32		1131
> 64		2218
> This doesn't look good, unless there is something wrong with my test (attached) or the cluster settings.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (HADOOP-72) hadoop doesn't take advatage of distributed compiting in TestDFSIO

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-72?page=all ]

Konstantin Shvachko updated HADOOP-72:
--------------------------------------

    Attachment:     (was: TestDFSIO.java)

> hadoop doesn't take advatage of distributed compiting in TestDFSIO
> ------------------------------------------------------------------
>
>          Key: HADOOP-72
>          URL: http://issues.apache.org/jira/browse/HADOOP-72
>      Project: Hadoop
>         Type: Test
>   Components: dfs, fs, mapred
>  Environment: 200 node cluster
>     Reporter: Konstantin Shvachko
>  Attachments: TestDFSIO_results_200_node_cluster.log, TestDFSIO_results_sequential.log
>
> TestDFSIO runs N map jobs, each either writing to or reading from a separate file of the same size, 
> and collects statistical information on its performance. 
> The reducer further calculates the overall statistics for all maps. 
> It outputs the following data:
> - read or write test
> - date and time the test finished   
> - number of files
> - total number of bytes processed
> - overall throughput in mb/sec
> - average IO rate in mb/sec per file
> __Results__
> I run 7 iterations of the test one after another on a cluster of ~200 nodes. 
> The file size is the same in all cases 320Mb. 
> The number of files tried is 1,2,4,8,16,32,64.
> The log file with statistics is attached.
> It looks like we don't have any distributed computing here at all.
> The total execution time increases proportionally to the total size of data both for writes and reads.
> Another thing is that the io ratio for read is higher than the write rate just gradually.
> For comparison I attach time measuring for the same ios performed on the same cluster but sequentially in a simple loop.
> This is the summary:
> Files	map/red time	sequential time
>  1		49			  34 
>  2		86			  69
>  4		158			131
>  8		299			266
> 16		569			532
> 32		1131
> 64		2218
> This doesn't look good, unless there is something wrong with my test (attached) or the cluster settings.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Resolved: (HADOOP-72) hadoop doesn't take advatage of distributed compiting in TestDFSIO

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-72?page=all ]
     
Doug Cutting resolved HADOOP-72:
--------------------------------

    Fix Version: 0.2
     Resolution: Won't Fix

This was caused by a misunderstanding.

> hadoop doesn't take advatage of distributed compiting in TestDFSIO
> ------------------------------------------------------------------
>
>          Key: HADOOP-72
>          URL: http://issues.apache.org/jira/browse/HADOOP-72
>      Project: Hadoop
>         Type: Test

>   Components: dfs, fs, mapred
>  Environment: 200 node cluster
>     Reporter: Konstantin Shvachko
>      Fix For: 0.2
>  Attachments: TestDFSIO.java, TestDFSIO_results.log, TestDFSIO_results_200_node_cluster.log, TestDFSIO_results_sequential.log
>
> TestDFSIO runs N map jobs, each either writing to or reading from a separate file of the same size, 
> and collects statistical information on its performance. 
> The reducer further calculates the overall statistics for all maps. 
> It outputs the following data:
> - read or write test
> - date and time the test finished   
> - number of files
> - total number of bytes processed
> - overall throughput in mb/sec
> - average IO rate in mb/sec per file
> __Results__
> I run 7 iterations of the test one after another on a cluster of ~200 nodes. 
> The file size is the same in all cases 320Mb. 
> The number of files tried is 1,2,4,8,16,32,64.
> The log file with statistics is attached.
> It looks like we don't have any distributed computing here at all.
> The total execution time increases proportionally to the total size of data both for writes and reads.
> Another thing is that the io ratio for read is higher than the write rate just gradually.
> For comparison I attach time measuring for the same ios performed on the same cluster but sequentially in a simple loop.
> This is the summary:
> Files	map/red time	sequential time
>  1		49			  34 
>  2		86			  69
>  4		158			131
>  8		299			266
> 16		569			532
> 32		1131
> 64		2218
> This doesn't look good, unless there is something wrong with my test (attached) or the cluster settings.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (HADOOP-72) hadoop doesn't take advatage of distributed compiting in TestDFSIO

Posted by "Konstantin Shvachko (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/HADOOP-72?page=all ]

Konstantin Shvachko updated HADOOP-72:
--------------------------------------

    Attachment: TestDFSIO.java
                TestDFSIO_results_200_node_cluster.log
                TestDFSIO_results_sequential.log

> hadoop doesn't take advatage of distributed compiting in TestDFSIO
> ------------------------------------------------------------------
>
>          Key: HADOOP-72
>          URL: http://issues.apache.org/jira/browse/HADOOP-72
>      Project: Hadoop
>         Type: Test
>   Components: dfs, fs, mapred
>  Environment: 200 node cluster
>     Reporter: Konstantin Shvachko
>  Attachments: TestDFSIO.java, TestDFSIO_results_200_node_cluster.log, TestDFSIO_results_sequential.log
>
> TestDFSIO runs N map jobs, each either writing to or reading from a separate file of the same size, 
> and collects statistical information on its performance. 
> The reducer further calculates the overall statistics for all maps. 
> It outputs the following data:
> - read or write test
> - date and time the test finished   
> - number of files
> - total number of bytes processed
> - overall throughput in mb/sec
> - average IO rate in mb/sec per file
> __Results__
> I run 7 iterations of the test one after another on a cluster of ~200 nodes. 
> The file size is the same in all cases 320Mb. 
> The number of files tried is 1,2,4,8,16,32,64.
> The log file with statistics is attached.
> It looks like we don't have any distributed computing here at all.
> The total execution time increases proportionally to the total size of data both for writes and reads.
> Another thing is that the io ratio for read is higher than the write rate just gradually.
> For comparison I attach time measuring for the same ios performed on the same cluster but sequentially in a simple loop.
> This is the summary:
> Files	map/red time	sequential time
>  1		49			  34 
>  2		86			  69
>  4		158			131
>  8		299			266
> 16		569			532
> 32		1131
> 64		2218
> This doesn't look good, unless there is something wrong with my test (attached) or the cluster settings.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira