You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Raj V <ra...@yahoo.com> on 2011/01/11 06:06:41 UTC

TeraSort question.

All,
 
I have been running terasort on a 480 node hadoop cluster. I have also collected cpu,memory,disk, network statistics during this run. The system stats are quite intersting. I can post it when I have put them together in some presentable format ( if there is interest.). However while looking at the data, I noticed something interesting. 
 
 I thought, intutively, that the all the systems in the cluster would have more or less similar behaviour ( time translation was possible) but the overall graph would look the same.,
 
Just to confirm it I took 5 random nodes and looked at the CPU, disk ,network etc. activity when the sort was running. Strangeley enough, it was not so., Two of the 5 systems were seriously busy, big IO with lots of disk and network activity. The other three systems, CPU was more or less 100% idle, slight network and I/O. 
 
Is that normal and/or expected? SHouldn't all the nodes be utilized in more or less manner over the length of the run?
 
I generated the data forf the sort using teragen. ( 128MB bloick size, replication =3).
 
I would also be interested in other people timings of sort. Is there some place where people can post sort numbers ( not just the record.)
 
I will post the actual graphs of the 5 nodes, if there is interest, tomorrow. ( Some logistical issues abt. posting them tonight)
 
I am using CDH3B3, even though I think this is not specific to CDH3B3. 
 
Sorry for the cross post.
 
Raj

Re: TeraSort question.

Posted by Ted Dunning <td...@maprtech.com>.
Raj,

Do you have the job history files?  That would be very useful.  I would be
happy to create some swimlane and related graphs for you if you can send me
the history files.

On Mon, Jan 10, 2011 at 9:06 PM, Raj V <ra...@yahoo.com> wrote:

> All,
>
> I have been running terasort on a 480 node hadoop cluster. I have also
> collected cpu,memory,disk, network statistics during this run. The system
> stats are quite intersting. I can post it when I have put them together in
> some presentable format ( if there is interest.). However while looking at
> the data, I noticed something interesting.
>
>  I thought, intutively, that the all the systems in the cluster would have
> more or less similar behaviour ( time translation was possible) but the
> overall graph would look the same.,
>
> Just to confirm it I took 5 random nodes and looked at the CPU, disk
> ,network etc. activity when the sort was running. Strangeley enough, it was
> not so., Two of the 5 systems were seriously busy, big IO with lots of disk
> and network activity. The other three systems, CPU was more or less 100%
> idle, slight network and I/O.
>
> Is that normal and/or expected? SHouldn't all the nodes be utilized in more
> or less manner over the length of the run?
>
> I generated the data forf the sort using teragen. ( 128MB bloick size,
> replication =3).
>
> I would also be interested in other people timings of sort. Is there some
> place where people can post sort numbers ( not just the record.)
>
> I will post the actual graphs of the 5 nodes, if there is interest,
> tomorrow. ( Some logistical issues abt. posting them tonight)
>
> I am using CDH3B3, even though I think this is not specific to CDH3B3.
>
> Sorry for the cross post.
>
> Raj

Re: TeraSort question.

Posted by Adarsh Sharma <ad...@orkash.com>.
If possible Please also post your configuration parameters like 
*dfs.data.dir* , *mapred.local.dir* , map and reduce parmeters, java  etc.


Thanks

bharath vissapragada wrote:
> Ravi,
>
> Please post the figures and graphs .. Figures for  large clusters (>
> 200 nodes) are certainly interesting ..
>
> Thanks
>
> On Tue, Jan 11, 2011 at 10:36 AM, Raj V <ra...@yahoo.com> wrote:
>   
>> All,
>>
>> I have been running terasort on a 480 node hadoop cluster. I have also collected cpu,memory,disk, network statistics during this run. The system stats are quite intersting. I can post it when I have put them together in some presentable format ( if there is interest.). However while looking at the data, I noticed something interesting.
>>
>>  I thought, intutively, that the all the systems in the cluster would have more or less similar behaviour ( time translation was possible) but the overall graph would look the same.,
>>
>> Just to confirm it I took 5 random nodes and looked at the CPU, disk ,network etc. activity when the sort was running. Strangeley enough, it was not so., Two of the 5 systems were seriously busy, big IO with lots of disk and network activity. The other three systems, CPU was more or less 100% idle, slight network and I/O.
>>
>> Is that normal and/or expected? SHouldn't all the nodes be utilized in more or less manner over the length of the run?
>>
>> I generated the data forf the sort using teragen. ( 128MB bloick size, replication =3).
>>
>> I would also be interested in other people timings of sort. Is there some place where people can post sort numbers ( not just the record.)
>>
>> I will post the actual graphs of the 5 nodes, if there is interest, tomorrow. ( Some logistical issues abt. posting them tonight)
>>
>> I am using CDH3B3, even though I think this is not specific to CDH3B3.
>>
>> Sorry for the cross post.
>>
>> Raj
>>     


Re: TeraSort question.

Posted by bharath vissapragada <bh...@gmail.com>.
Ravi,

Please post the figures and graphs .. Figures for  large clusters (>
200 nodes) are certainly interesting ..

Thanks

On Tue, Jan 11, 2011 at 10:36 AM, Raj V <ra...@yahoo.com> wrote:
> All,
>
> I have been running terasort on a 480 node hadoop cluster. I have also collected cpu,memory,disk, network statistics during this run. The system stats are quite intersting. I can post it when I have put them together in some presentable format ( if there is interest.). However while looking at the data, I noticed something interesting.
>
>  I thought, intutively, that the all the systems in the cluster would have more or less similar behaviour ( time translation was possible) but the overall graph would look the same.,
>
> Just to confirm it I took 5 random nodes and looked at the CPU, disk ,network etc. activity when the sort was running. Strangeley enough, it was not so., Two of the 5 systems were seriously busy, big IO with lots of disk and network activity. The other three systems, CPU was more or less 100% idle, slight network and I/O.
>
> Is that normal and/or expected? SHouldn't all the nodes be utilized in more or less manner over the length of the run?
>
> I generated the data forf the sort using teragen. ( 128MB bloick size, replication =3).
>
> I would also be interested in other people timings of sort. Is there some place where people can post sort numbers ( not just the record.)
>
> I will post the actual graphs of the 5 nodes, if there is interest, tomorrow. ( Some logistical issues abt. posting them tonight)
>
> I am using CDH3B3, even though I think this is not specific to CDH3B3.
>
> Sorry for the cross post.
>
> Raj

Re: TeraSort question.

Posted by Phil Whelan <ph...@gmail.com>.
Hi Raj,

> Two of the 5 systems were seriously busy, big IO with lots of disk and network activity. The other three systems, CPU was more or less 100% idle, slight network and I/O.

This process defaults to just 2 map jobs, so only 2 nodes are
utilized. Did you try this option? mapred.map.tasks. I found a very
similar question + answer here...

http://www.mail-archive.com/common-user@hadoop.apache.org/msg00005.html

>> 1.      The data is generated in a fashion to where it is not balanced
>> across my cluster.  This is because the data is generated with 2 maps.
>
> These are due to the default #maps/#reduces in Map-Reduce.
> Use:
> $ bin/hadoop jar hadoop-*-dev-examples.jar teragen - Dmapred.map.tasks=8000 10000000000 /tera/in $ bin/hadoop jar hadoop-*-dev-examples.jar terasort - Dmapred.reduce.tasks=5300 /tera/in /tera/out
> Arun

Hope that helps.

Thanks,
Phil

On Mon, Jan 10, 2011 at 9:06 PM, Raj V <ra...@yahoo.com> wrote:
> All,
>
> I have been running terasort on a 480 node hadoop cluster. I have also collected cpu,memory,disk, network statistics during this run. The system stats are quite intersting. I can post it when I have put them together in some presentable format ( if there is interest.). However while looking at the data, I noticed something interesting.
>
>  I thought, intutively, that the all the systems in the cluster would have more or less similar behaviour ( time translation was possible) but the overall graph would look the same.,
>
> Just to confirm it I took 5 random nodes and looked at the CPU, disk ,network etc. activity when the sort was running. Strangeley enough, it was not so., Two of the 5 systems were seriously busy, big IO with lots of disk and network activity. The other three systems, CPU was more or less 100% idle, slight network and I/O.
>
> Is that normal and/or expected? SHouldn't all the nodes be utilized in more or less manner over the length of the run?
>
> I generated the data forf the sort using teragen. ( 128MB bloick size, replication =3).
>
> I would also be interested in other people timings of sort. Is there some place where people can post sort numbers ( not just the record.)
>
> I will post the actual graphs of the 5 nodes, if there is interest, tomorrow. ( Some logistical issues abt. posting them tonight)
>
> I am using CDH3B3, even though I think this is not specific to CDH3B3.
>
> Sorry for the cross post.
>
> Raj