You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Usman Waheed <us...@opera.com> on 2009/10/14 10:48:00 UTC

Hardware performance from HADOOP cluster

Hi,

Is there a way to tell what kind of performance numbers one can expect 
out of their cluster given a certain set of specs.

For example i have 5 nodes in my cluster that all have the following 
hardware configuration(s):
Quad Core 2.0GHz, 8GB RAM, 4x1TB disks and are all on the same rack.

Thanks,
Usman

Re: Hardware performance from HADOOP cluster

Posted by Chris Seline <ch...@searchles.com>.

Unless my calcs are off, that is right in line with the terabyte sort 
record:

1TB = roughly 1,000,000 MB
1460 nodes
62 seconds
1000000/1460/62 = 11MB/s per machine

but they used 2.5ghz quad core DUAL cpu servers, so they have roughly 
2.5x the horsepower as Usman's setup.

http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petabyte_in_162.html

Or am I wrong?

Chris


Todd Lipcon wrote:
> This seems a bit slow for that setup (4-5 MB/sec/node sorting). Have
> you changed the configurations at all? There are some notes on this
> blog post that might help your performance a bit:
>
> http://www.cloudera.com/blog/2009/03/30/configuration-parameters-what-can-you-just-ignore/
>
> How many map and reduce slots did you configure for the daemons? If
> you have Ganglia installed you can usually get a good idea of whether
> you're using your resources well by looking at the graphs while
> running a job like this sort.
>
> -Todd
>
> On Wed, Oct 14, 2009 at 4:04 AM, Usman Waheed <us...@opera.com> wrote:
>   
>> Here are the results i got from my 4 node cluster (correction i noted 5
>> earlier). One of my nodes out of the 4 is a namenode+datanode both.
>>
>> GENERATE RANDOM DATA
>> Wrote out 40GB of random binary data:
>> Map output records=4088301
>> The job took 358 seconds. (approximately: 6 minutes).
>>
>> SORT RANDOM GENERATED DATA
>> Map output records=4088301
>> Reduce input records=4088301
>> The job took 2136 seconds. (approximately: 35 minutes).
>>
>> VALIDATION OF SORTED DATA
>> The job took 183 seconds.
>> SUCCESS! Validated the MapReduce framework's 'sort' successfully.
>>
>> It would be interesting to see what performance numbers others with a
>> similar setup have obtained.
>>
>> Thanks,
>> Usman
>>
>>     
>>> I am setting up a new cluster of 10 nodes of 2.83G Quadcore (2x6MB
>>> cache), 8G RAM and 2x500G drives, and will do the same soon.  Got some
>>> issues though so it won't start up...
>>>
>>> Tim
>>>
>>>
>>> On Wed, Oct 14, 2009 at 11:36 AM, Usman Waheed <us...@opera.com> wrote:
>>>
>>>       
>>>> Thanks Tim, i will check it out and post my results for comments.
>>>> -Usman
>>>>
>>>>         
>>>>> Might it be worth running the http://wiki.apache.org/hadoop/Sort and
>>>>> posting your results for comment?
>>>>>
>>>>> Tim
>>>>>
>>>>>
>>>>> On Wed, Oct 14, 2009 at 10:48 AM, Usman Waheed <us...@opera.com> wrote:
>>>>>
>>>>>
>>>>>           
>>>>>> Hi,
>>>>>>
>>>>>> Is there a way to tell what kind of performance numbers one can expect
>>>>>> out
>>>>>> of their cluster given a certain set of specs.
>>>>>>
>>>>>> For example i have 5 nodes in my cluster that all have the following
>>>>>> hardware configuration(s):
>>>>>> Quad Core 2.0GHz, 8GB RAM, 4x1TB disks and are all on the same rack.
>>>>>>
>>>>>> Thanks,
>>>>>> Usman
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>           
>>>>         
>>>       
>>     
>
>
>

Re: Hardware performance from HADOOP cluster

Posted by tim robertson <ti...@gmail.com>.

Yeah they are single proc machines and other than setting to 4
map/reduces, completely 0.20.1 vanilla installation.

I will tune it up in the morning based on what I can find on the web
(e.g. cloudera guidelines) and post the results.  I am going to be
running HBase on top of this, but want to make sure the HDFS/MR is
running sound before continuing.

Seems there are a few people at the moment setting up clusters - might
it be worth adding our config and results to
http://wiki.apache.org/hadoop/HardwareBenchmarks ?

For people like me (first cluster set up from scratch - previously
used the EC2 scripts) it is nice to sanity check things look about
right.  The mailing lists suggest there are a few small clusters of
medium spec machines springing up.

Cheers,
Tim




On Thu, Oct 15, 2009 at 5:52 PM, Patrick Angeles
<pa...@gmail.com> wrote:
> Hi Tim,
> I assume those are single proc machines?
>
> I got 649 secs on 70GB of data for our 7-node cluster (~11 mins), but we
> have dual quad Nehalems (2.26Ghz).
>
> On Thu, Oct 15, 2009 at 11:34 AM, tim robertson
> <ti...@gmail.com>wrote:
>
>> Hi Usmam,
>>
>> So on my 10 node cluster (9 DN) with 4 maps and 4 reduces (I plan on
>> high memory jobs so picked 4 only)
>> [9 DN of Dell R300: 2.83G Quadcore (2x6MB cache), 8G RAM and 2x500G SATA
>> drives]
>>
>> Using your template for stats, I get the following with no tuning:
>>
>> GENERATE RANDOM DATA
>> Wrote out 90GB of random binary data:
>> Map output records=9198009
>> The job took 350 seconds. (approximately: 6 minutes)
>>
>> SORT RANDOM GENERATED DATA
>> Map output records= 9197821
>> Reduce input records=9197821
>> The job took 2176 seconds. (approximately: 36mins).
>>
>> So pretty similar to your initial benchmark.  I will tune it a bit
>> tomorrow and rerun.
>>
>> If you spent time tuning your cluster and it was successful, please
>> can you share your config?
>>
>> Cheers,
>> Tim
>>
>>
>>
>>
>>
>> On Thu, Oct 15, 2009 at 11:32 AM, Usman Waheed <us...@opera.com> wrote:
>> > Hi Todd,
>> >
>> > Some changes have been applied to the cluster based on the documentation
>> > (URL) you noted below,
>> > like file descriptor settings and io.file.buffer.size. I will check out
>> the
>> > other settings which I haven't applied yet.
>> >
>> > My map/reduce slot settings from my hadoop-site.xml and
>> hadoop-default.xml
>> > on all nodes in the cluster.
>> >
>> > _*hadoop-site.xml
>> > *_mapred.tasktracker.task.maximum = 2
>> > mapred.tasktracker.map.tasks.maximum = 8
>> > mapred.tasktracker.reduce.tasks.maximum = 8
>> > _*
>> > hadoop-default.xml
>> > *_mapred.map.tasks = 2
>> > mapred.reduce.tasks = 1
>> >
>> > Thanks,
>> > Usman
>> >
>> >
>> >> This seems a bit slow for that setup (4-5 MB/sec/node sorting). Have
>> >> you changed the configurations at all? There are some notes on this
>> >> blog post that might help your performance a bit:
>> >>
>> >>
>> >>
>> http://www.cloudera.com/blog/2009/03/30/configuration-parameters-what-can-you-just-ignore/
>> >>
>> >> How many map and reduce slots did you configure for the daemons? If
>> >> you have Ganglia installed you can usually get a good idea of whether
>> >> you're using your resources well by looking at the graphs while
>> >> running a job like this sort.
>> >>
>> >> -Todd
>> >>
>> >> On Wed, Oct 14, 2009 at 4:04 AM, Usman Waheed <us...@opera.com> wrote:
>> >>
>> >>>
>> >>> Here are the results i got from my 4 node cluster (correction i noted 5
>> >>> earlier). One of my nodes out of the 4 is a namenode+datanode both.
>> >>>
>> >>> GENERATE RANDOM DATA
>> >>> Wrote out 40GB of random binary data:
>> >>> Map output records=4088301
>> >>> The job took 358 seconds. (approximately: 6 minutes).
>> >>>
>> >>> SORT RANDOM GENERATED DATA
>> >>> Map output records=4088301
>> >>> Reduce input records=4088301
>> >>> The job took 2136 seconds. (approximately: 35 minutes).
>> >>>
>> >>> VALIDATION OF SORTED DATA
>> >>> The job took 183 seconds.
>> >>> SUCCESS! Validated the MapReduce framework's 'sort' successfully.
>> >>>
>> >>> It would be interesting to see what performance numbers others with a
>> >>> similar setup have obtained.
>> >>>
>> >>> Thanks,
>> >>> Usman
>> >>>
>> >>>
>> >>>>
>> >>>> I am setting up a new cluster of 10 nodes of 2.83G Quadcore (2x6MB
>> >>>> cache), 8G RAM and 2x500G drives, and will do the same soon.  Got some
>> >>>> issues though so it won't start up...
>> >>>>
>> >>>> Tim
>> >>>>
>> >>>>
>> >>>> On Wed, Oct 14, 2009 at 11:36 AM, Usman Waheed <us...@opera.com>
>> wrote:
>> >>>>
>> >>>>
>> >>>>>
>> >>>>> Thanks Tim, i will check it out and post my results for comments.
>> >>>>> -Usman
>> >>>>>
>> >>>>>
>> >>>>>>
>> >>>>>> Might it be worth running the http://wiki.apache.org/hadoop/Sortand
>> >>>>>> posting your results for comment?
>> >>>>>>
>> >>>>>> Tim
>> >>>>>>
>> >>>>>>
>> >>>>>> On Wed, Oct 14, 2009 at 10:48 AM, Usman Waheed <us...@opera.com>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>>
>> >>>>>>> Hi,
>> >>>>>>>
>> >>>>>>> Is there a way to tell what kind of performance numbers one can
>> >>>>>>> expect
>> >>>>>>> out
>> >>>>>>> of their cluster given a certain set of specs.
>> >>>>>>>
>> >>>>>>> For example i have 5 nodes in my cluster that all have the
>> following
>> >>>>>>> hardware configuration(s):
>> >>>>>>> Quad Core 2.0GHz, 8GB RAM, 4x1TB disks and are all on the same
>> rack.
>> >>>>>>>
>> >>>>>>> Thanks,
>> >>>>>>> Usman
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >>>
>> >>>
>> >>
>> >>
>> >
>> >
>>
>

Re: Hardware performance from HADOOP cluster

Posted by Patrick Angeles <pa...@gmail.com>.

Hi Tim,
I assume those are single proc machines?

I got 649 secs on 70GB of data for our 7-node cluster (~11 mins), but we
have dual quad Nehalems (2.26Ghz).

On Thu, Oct 15, 2009 at 11:34 AM, tim robertson
<ti...@gmail.com>wrote:

> Hi Usmam,
>
> So on my 10 node cluster (9 DN) with 4 maps and 4 reduces (I plan on
> high memory jobs so picked 4 only)
> [9 DN of Dell R300: 2.83G Quadcore (2x6MB cache), 8G RAM and 2x500G SATA
> drives]
>
> Using your template for stats, I get the following with no tuning:
>
> GENERATE RANDOM DATA
> Wrote out 90GB of random binary data:
> Map output records=9198009
> The job took 350 seconds. (approximately: 6 minutes)
>
> SORT RANDOM GENERATED DATA
> Map output records= 9197821
> Reduce input records=9197821
> The job took 2176 seconds. (approximately: 36mins).
>
> So pretty similar to your initial benchmark.  I will tune it a bit
> tomorrow and rerun.
>
> If you spent time tuning your cluster and it was successful, please
> can you share your config?
>
> Cheers,
> Tim
>
>
>
>
>
> On Thu, Oct 15, 2009 at 11:32 AM, Usman Waheed <us...@opera.com> wrote:
> > Hi Todd,
> >
> > Some changes have been applied to the cluster based on the documentation
> > (URL) you noted below,
> > like file descriptor settings and io.file.buffer.size. I will check out
> the
> > other settings which I haven't applied yet.
> >
> > My map/reduce slot settings from my hadoop-site.xml and
> hadoop-default.xml
> > on all nodes in the cluster.
> >
> > _*hadoop-site.xml
> > *_mapred.tasktracker.task.maximum = 2
> > mapred.tasktracker.map.tasks.maximum = 8
> > mapred.tasktracker.reduce.tasks.maximum = 8
> > _*
> > hadoop-default.xml
> > *_mapred.map.tasks = 2
> > mapred.reduce.tasks = 1
> >
> > Thanks,
> > Usman
> >
> >
> >> This seems a bit slow for that setup (4-5 MB/sec/node sorting). Have
> >> you changed the configurations at all? There are some notes on this
> >> blog post that might help your performance a bit:
> >>
> >>
> >>
> http://www.cloudera.com/blog/2009/03/30/configuration-parameters-what-can-you-just-ignore/
> >>
> >> How many map and reduce slots did you configure for the daemons? If
> >> you have Ganglia installed you can usually get a good idea of whether
> >> you're using your resources well by looking at the graphs while
> >> running a job like this sort.
> >>
> >> -Todd
> >>
> >> On Wed, Oct 14, 2009 at 4:04 AM, Usman Waheed <us...@opera.com> wrote:
> >>
> >>>
> >>> Here are the results i got from my 4 node cluster (correction i noted 5
> >>> earlier). One of my nodes out of the 4 is a namenode+datanode both.
> >>>
> >>> GENERATE RANDOM DATA
> >>> Wrote out 40GB of random binary data:
> >>> Map output records=4088301
> >>> The job took 358 seconds. (approximately: 6 minutes).
> >>>
> >>> SORT RANDOM GENERATED DATA
> >>> Map output records=4088301
> >>> Reduce input records=4088301
> >>> The job took 2136 seconds. (approximately: 35 minutes).
> >>>
> >>> VALIDATION OF SORTED DATA
> >>> The job took 183 seconds.
> >>> SUCCESS! Validated the MapReduce framework's 'sort' successfully.
> >>>
> >>> It would be interesting to see what performance numbers others with a
> >>> similar setup have obtained.
> >>>
> >>> Thanks,
> >>> Usman
> >>>
> >>>
> >>>>
> >>>> I am setting up a new cluster of 10 nodes of 2.83G Quadcore (2x6MB
> >>>> cache), 8G RAM and 2x500G drives, and will do the same soon.  Got some
> >>>> issues though so it won't start up...
> >>>>
> >>>> Tim
> >>>>
> >>>>
> >>>> On Wed, Oct 14, 2009 at 11:36 AM, Usman Waheed <us...@opera.com>
> wrote:
> >>>>
> >>>>
> >>>>>
> >>>>> Thanks Tim, i will check it out and post my results for comments.
> >>>>> -Usman
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> Might it be worth running the http://wiki.apache.org/hadoop/Sortand
> >>>>>> posting your results for comment?
> >>>>>>
> >>>>>> Tim
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Oct 14, 2009 at 10:48 AM, Usman Waheed <us...@opera.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> Is there a way to tell what kind of performance numbers one can
> >>>>>>> expect
> >>>>>>> out
> >>>>>>> of their cluster given a certain set of specs.
> >>>>>>>
> >>>>>>> For example i have 5 nodes in my cluster that all have the
> following
> >>>>>>> hardware configuration(s):
> >>>>>>> Quad Core 2.0GHz, 8GB RAM, 4x1TB disks and are all on the same
> rack.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Usman
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
> >
>

Re: Hardware performance from HADOOP cluster

Posted by Usman Waheed <us...@opera.com>.

Hi Tim,

Thanks much for sharing the info.
I will most certainly share my configuration settings after applying 
some tuning at my end.
Will let you know the results on this email list.

Thanks,
Usman


> Hi Usmam,
>
> So on my 10 node cluster (9 DN) with 4 maps and 4 reduces (I plan on
> high memory jobs so picked 4 only)
> [9 DN of Dell R300: 2.83G Quadcore (2x6MB cache), 8G RAM and 2x500G SATA drives]
>
> Using your template for stats, I get the following with no tuning:
>
> GENERATE RANDOM DATA
> Wrote out 90GB of random binary data:
> Map output records=9198009
> The job took 350 seconds. (approximately: 6 minutes)
>
> SORT RANDOM GENERATED DATA
> Map output records= 9197821
> Reduce input records=9197821
> The job took 2176 seconds. (approximately: 36mins).
>
> So pretty similar to your initial benchmark.  I will tune it a bit
> tomorrow and rerun.
>
> If you spent time tuning your cluster and it was successful, please
> can you share your config?
>
> Cheers,
> Tim
>
>
>
>
>
> On Thu, Oct 15, 2009 at 11:32 AM, Usman Waheed <us...@opera.com> wrote:
>   
>> Hi Todd,
>>
>> Some changes have been applied to the cluster based on the documentation
>> (URL) you noted below,
>> like file descriptor settings and io.file.buffer.size. I will check out the
>> other settings which I haven't applied yet.
>>
>> My map/reduce slot settings from my hadoop-site.xml and hadoop-default.xml
>> on all nodes in the cluster.
>>
>> _*hadoop-site.xml
>> *_mapred.tasktracker.task.maximum = 2
>> mapred.tasktracker.map.tasks.maximum = 8
>> mapred.tasktracker.reduce.tasks.maximum = 8
>> _*
>> hadoop-default.xml
>> *_mapred.map.tasks = 2
>> mapred.reduce.tasks = 1
>>
>> Thanks,
>> Usman
>>
>>
>>     
>>> This seems a bit slow for that setup (4-5 MB/sec/node sorting). Have
>>> you changed the configurations at all? There are some notes on this
>>> blog post that might help your performance a bit:
>>>
>>>
>>> http://www.cloudera.com/blog/2009/03/30/configuration-parameters-what-can-you-just-ignore/
>>>
>>> How many map and reduce slots did you configure for the daemons? If
>>> you have Ganglia installed you can usually get a good idea of whether
>>> you're using your resources well by looking at the graphs while
>>> running a job like this sort.
>>>
>>> -Todd
>>>
>>> On Wed, Oct 14, 2009 at 4:04 AM, Usman Waheed <us...@opera.com> wrote:
>>>
>>>       
>>>> Here are the results i got from my 4 node cluster (correction i noted 5
>>>> earlier). One of my nodes out of the 4 is a namenode+datanode both.
>>>>
>>>> GENERATE RANDOM DATA
>>>> Wrote out 40GB of random binary data:
>>>> Map output records=4088301
>>>> The job took 358 seconds. (approximately: 6 minutes).
>>>>
>>>> SORT RANDOM GENERATED DATA
>>>> Map output records=4088301
>>>> Reduce input records=4088301
>>>> The job took 2136 seconds. (approximately: 35 minutes).
>>>>
>>>> VALIDATION OF SORTED DATA
>>>> The job took 183 seconds.
>>>> SUCCESS! Validated the MapReduce framework's 'sort' successfully.
>>>>
>>>> It would be interesting to see what performance numbers others with a
>>>> similar setup have obtained.
>>>>
>>>> Thanks,
>>>> Usman
>>>>
>>>>
>>>>         
>>>>> I am setting up a new cluster of 10 nodes of 2.83G Quadcore (2x6MB
>>>>> cache), 8G RAM and 2x500G drives, and will do the same soon.  Got some
>>>>> issues though so it won't start up...
>>>>>
>>>>> Tim
>>>>>
>>>>>
>>>>> On Wed, Oct 14, 2009 at 11:36 AM, Usman Waheed <us...@opera.com> wrote:
>>>>>
>>>>>
>>>>>           
>>>>>> Thanks Tim, i will check it out and post my results for comments.
>>>>>> -Usman
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> Might it be worth running the http://wiki.apache.org/hadoop/Sort and
>>>>>>> posting your results for comment?
>>>>>>>
>>>>>>> Tim
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Oct 14, 2009 at 10:48 AM, Usman Waheed <us...@opera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>               
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Is there a way to tell what kind of performance numbers one can
>>>>>>>> expect
>>>>>>>> out
>>>>>>>> of their cluster given a certain set of specs.
>>>>>>>>
>>>>>>>> For example i have 5 nodes in my cluster that all have the following
>>>>>>>> hardware configuration(s):
>>>>>>>> Quad Core 2.0GHz, 8GB RAM, 4x1TB disks and are all on the same rack.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Usman
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>                 
>>>>>>>               
>>>>>>             
>>>>>           
>>>>         
>>>       
>>     
>
>

Re: Hardware performance from HADOOP cluster

Posted by tim robertson <ti...@gmail.com>.

Hi Usmam,

So on my 10 node cluster (9 DN) with 4 maps and 4 reduces (I plan on
high memory jobs so picked 4 only)
[9 DN of Dell R300: 2.83G Quadcore (2x6MB cache), 8G RAM and 2x500G SATA drives]

Using your template for stats, I get the following with no tuning:

GENERATE RANDOM DATA
Wrote out 90GB of random binary data:
Map output records=9198009
The job took 350 seconds. (approximately: 6 minutes)

SORT RANDOM GENERATED DATA
Map output records= 9197821
Reduce input records=9197821
The job took 2176 seconds. (approximately: 36mins).

So pretty similar to your initial benchmark.  I will tune it a bit
tomorrow and rerun.

If you spent time tuning your cluster and it was successful, please
can you share your config?

Cheers,
Tim





On Thu, Oct 15, 2009 at 11:32 AM, Usman Waheed <us...@opera.com> wrote:
> Hi Todd,
>
> Some changes have been applied to the cluster based on the documentation
> (URL) you noted below,
> like file descriptor settings and io.file.buffer.size. I will check out the
> other settings which I haven't applied yet.
>
> My map/reduce slot settings from my hadoop-site.xml and hadoop-default.xml
> on all nodes in the cluster.
>
> _*hadoop-site.xml
> *_mapred.tasktracker.task.maximum = 2
> mapred.tasktracker.map.tasks.maximum = 8
> mapred.tasktracker.reduce.tasks.maximum = 8
> _*
> hadoop-default.xml
> *_mapred.map.tasks = 2
> mapred.reduce.tasks = 1
>
> Thanks,
> Usman
>
>
>> This seems a bit slow for that setup (4-5 MB/sec/node sorting). Have
>> you changed the configurations at all? There are some notes on this
>> blog post that might help your performance a bit:
>>
>>
>> http://www.cloudera.com/blog/2009/03/30/configuration-parameters-what-can-you-just-ignore/
>>
>> How many map and reduce slots did you configure for the daemons? If
>> you have Ganglia installed you can usually get a good idea of whether
>> you're using your resources well by looking at the graphs while
>> running a job like this sort.
>>
>> -Todd
>>
>> On Wed, Oct 14, 2009 at 4:04 AM, Usman Waheed <us...@opera.com> wrote:
>>
>>>
>>> Here are the results i got from my 4 node cluster (correction i noted 5
>>> earlier). One of my nodes out of the 4 is a namenode+datanode both.
>>>
>>> GENERATE RANDOM DATA
>>> Wrote out 40GB of random binary data:
>>> Map output records=4088301
>>> The job took 358 seconds. (approximately: 6 minutes).
>>>
>>> SORT RANDOM GENERATED DATA
>>> Map output records=4088301
>>> Reduce input records=4088301
>>> The job took 2136 seconds. (approximately: 35 minutes).
>>>
>>> VALIDATION OF SORTED DATA
>>> The job took 183 seconds.
>>> SUCCESS! Validated the MapReduce framework's 'sort' successfully.
>>>
>>> It would be interesting to see what performance numbers others with a
>>> similar setup have obtained.
>>>
>>> Thanks,
>>> Usman
>>>
>>>
>>>>
>>>> I am setting up a new cluster of 10 nodes of 2.83G Quadcore (2x6MB
>>>> cache), 8G RAM and 2x500G drives, and will do the same soon.  Got some
>>>> issues though so it won't start up...
>>>>
>>>> Tim
>>>>
>>>>
>>>> On Wed, Oct 14, 2009 at 11:36 AM, Usman Waheed <us...@opera.com> wrote:
>>>>
>>>>
>>>>>
>>>>> Thanks Tim, i will check it out and post my results for comments.
>>>>> -Usman
>>>>>
>>>>>
>>>>>>
>>>>>> Might it be worth running the http://wiki.apache.org/hadoop/Sort and
>>>>>> posting your results for comment?
>>>>>>
>>>>>> Tim
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 14, 2009 at 10:48 AM, Usman Waheed <us...@opera.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Is there a way to tell what kind of performance numbers one can
>>>>>>> expect
>>>>>>> out
>>>>>>> of their cluster given a certain set of specs.
>>>>>>>
>>>>>>> For example i have 5 nodes in my cluster that all have the following
>>>>>>> hardware configuration(s):
>>>>>>> Quad Core 2.0GHz, 8GB RAM, 4x1TB disks and are all on the same rack.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Usman
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Re: Hardware performance from HADOOP cluster

Posted by tim robertson <ti...@gmail.com>.

Just while I am on this thread of performance I thought I would continue...

I am moving a 200 million record table (mostly varchar) from a myisam
mysql database to hadoop.  First map reduce tests seem pretty
positive.  A full table scan in mysql takes 17 minutes on a $20,000
pretty heavyweight 64G memory server (sorry don't know exact CPU /
Disk specs offhand), with a count that can use an indexed INTEGER
(index pegged in memory) taking 3 minutes.  Exporting the table and
importing to HDFS as a tab delimited file (100GB), on the above
cluster, the same scans take 4 minutes for a custom MR job.

Actually I have 2x 200 million record tables, and a full scan
requiring a join of both those tables in mysql is around 30 minutes,
but I joined them in the export before the import into hadoop.  So
some of my ad hoc reports have just come down from 30 mins to 4 mins
(and I will use Hive for this).

Tim


> Mon, Oct 19, 2009 at 4:14 PM, Usman Waheed <us...@opera.com> wrote:
> Tim,
>
> I have 4 nodes (Quad Core 2.00Ghz, 8GB RAM, 4x1TB disks), where one is the
> master+datanode and the rest are datanodes.
>
> Job: Sort 40GB of random data
>
> With the following current configuration setting:
>
> io.sort.factor: 10
> io.sort.mb: 100
> io.file.buffer.size: 65536
> mapred.child.java.opts: -Xmx200M
> dfs.datanode.handler.count=3
> 2 Mappers
> 2 Reducer
> Time taken: 28 minutes
>
> Still testing with more config changes, will send the results out.
>
> -Usman
>
>
>
>> Hi all,
>>
>> I thought I would post the findings of my tuning tests running the
>> sort benchmark.
>>
>> This is all based on 10 machines (1 as masters and 9 DN/TT) each of:
>> Dell R300: 2.83G Quadcore (2x6MB cache 1 proc), 8G RAM and 2x500G SATA
>> drives
>>
>> --- Vanilla installation ---
>> 2M 2R: 36 mins
>> 4M 4R: 36 mins (yes the same)
>>
>>
>> --- Tuned according to Cloudera http://tinyurl.com/ykupczu ---
>> io.sort.factor: 20  (mapred-site.xml)
>> io.sort.mb: 200  (mapred-site.xml)
>> io.file.buffer.size: 65536   (core-site.xml)
>> mapred.child.java.opts: -Xmx512M  (mapred-site.xml)
>>
>> 2M 2R: 33.5 mins
>> 4M 4R: 29 mins
>> 8M 8R: 41 mins
>>
>>
>> --- Increasing the task memory a little ---
>> io.sort.factor: 20
>> io.sort.mb: 200
>> io.file.buffer.size: 65536
>> mapred.child.java.opts: -Xmx1G
>>
>> 2M 2R: 29 mins  (adding dfs.datanode.handler.count=8 resulted in 30 mins)
>> 4M 4R: 29 mins (yes the same)
>>
>>
>> --- Increasing sort memory ---
>> io.sort.factor: 32
>> io.sort.mb: 320
>> io.file.buffer.size: 65536
>> mapred.child.java.opts: -Xmx1G
>>
>> 2M 2R: 31 mins (yes longer than lower sort sizes)
>>
>> I am going to stick with the following for now and get back to work...
>>  io.sort.factor: 20
>>  io.sort.mb: 200
>>  io.file.buffer.size: 65536
>>  mapred.child.java.opts: -Xmx1G
>>  dfs.datanode.handler.count=8
>>  4 Mappers
>>  4 Reducer
>>
>> Hope that helps someone.  How did your tuning go Usman?
>>
>> Tim
>>
>>
>> On Fri, Oct 16, 2009 at 10:41 PM, tim robertson
>> <ti...@gmail.com> wrote:
>>
>>>
>>> No worries Usman,  I will try and do the same on Monday.
>>>
>>> Thanks Todd for the clarification.
>>>
>>> Tim
>>>
>>>
>>> On Fri, Oct 16, 2009 at 5:30 PM, Usman Waheed <us...@opera.com> wrote:
>>>
>>>>
>>>> Hi Tim,
>>>>
>>>> I have been swamped with some other stuff so did not get a chance to run
>>>> further tests on my setup.
>>>> Will send them out early next week so we can compare.
>>>>
>>>> Cheers,
>>>> Usman
>>>>
>>>>
>>>>>
>>>>> On Fri, Oct 16, 2009 at 4:01 AM, tim robertson
>>>>> <ti...@gmail.com>wrote:
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Adding the following to core-site.xml, mapred-site.xml and
>>>>>> hdfs-site.xml (based on Cloudera guidelines:
>>>>>> http://tinyurl.com/ykupczu)
>>>>>>  io.sort.factor: 15  (mapred-site.xml)
>>>>>>  io.sort.mb: 150  (mapred-site.xml)
>>>>>>  io.file.buffer.size: 65536   (core-site.xml)
>>>>>>  dfs.datanode.handler.count: 3 (hdfs-site.xml  actually this is the
>>>>>> default)
>>>>>>
>>>>>> and using the default of HADOOP_HEAPSIZE=1000 (hadoop-env.sh)
>>>>>>
>>>>>> Using 2 mappers and 2 reducers, can someone please help me with the
>>>>>> maths as to why my jobs are failing with "Error: Java heap space" in
>>>>>> the maps?
>>>>>> (the same runs fine with io.sort.factor of 10 and io.sort.mb of 100)
>>>>>>
>>>>>> io.sort.mb of 200 x 4 (2 mappers, 2 reducers) = 0.8G
>>>>>> Plus the 2 daemons on the node at 1G each = 1.8G
>>>>>> Plus Xmx of 1G for each hadoop daemon task = 5.8G
>>>>>>
>>>>>> The machines have 8G in them.  Obviously my maths is screwy
>>>>>> somewhere...
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> Hi Tim,
>>>>>
>>>>> Did you also change mapred.child.java.opts? The HADOOP_HEAPSIZE
>>>>> parameter
>>>>> is
>>>>> for the daemons, not the tasks. If you bump up io.sort.mb you also have
>>>>> to
>>>>> bump up the -Xmx argument in mapred.child.java.opts to give the actual
>>>>> tasks
>>>>> more RAM.
>>>>>
>>>>> -Todd
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> On Fri, Oct 16, 2009 at 9:59 AM, Erik Forsberg <fo...@opera.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> On Thu, 15 Oct 2009 11:32:35 +0200
>>>>>>> Usman Waheed <us...@opera.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Hi Todd,
>>>>>>>>
>>>>>>>> Some changes have been applied to the cluster based on the
>>>>>>>> documentation (URL) you noted below,
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> I would also like to know what settings people are tuning on the
>>>>>>> operating system level. The blog post mentioned here does not mention
>>>>>>> much about that, except for the fileno changes.
>>>>>>>
>>>>>>> We got about 3x the read performance when running DFSIOTest by
>>>>>>> mounting
>>>>>>> our ext3 filesystems with the noatime parameter. I saw that mentioned
>>>>>>> in the slides from some Cloudera presentation.
>>>>>>>
>>>>>>> (For those who don't know, the noatime parameter turns off the
>>>>>>> recording of access time on files. That's a horrible performance
>>>>>>> killer
>>>>>>> since it means every read of a file also means that the kernel must
>>>>>>> do
>>>>>>> a write. These writes are probably queued up, but still, if you don't
>>>>>>> need the atime (very few applications do), turn it off!)
>>>>>>>
>>>>>>> Have people been experimenting with different filesystems, or are
>>>>>>> most
>>>>>>> of us running on top of ext3?
>>>>>>>
>>>>>>> How about mounting ext3 with "data=writeback"? That's rumoured to
>>>>>>> give
>>>>>>> the best throughput and could help with write performance. From
>>>>>>> mount(8):
>>>>>>>
>>>>>>>   writeback
>>>>>>>          Data ordering is not preserved - data may be written into
>>>>>>> the
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> main file system
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>          after its metadata has been  committed  to the journal.
>>>>>>>  This
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> is rumoured to be the
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>          highest throughput option.  It guarantees internal file
>>>>>>> system
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> integrity,
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>          however it can allow old data to appear in files after a
>>>>>>> crash
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> and journal recovery.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> How would the HDFS consistency checks cope with old data appearing in
>>>>>>> the unerlying files after a system crash?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> \EF
>>>>>>> --
>>>>>>> Erik Forsberg <fo...@opera.com>
>>>>>>> Developer, Opera Software - http://www.opera.com/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>
>>
>
>

Re: Hardware performance from HADOOP cluster

Posted by Bryan Talbot <bt...@aeriagames.com>.

Here's another data point from a small cluster running Cloudera 20.1:

4 slaves of 2 Quad core (E5405) 2.00 GHz, 8 GB RAM, 4 1TB SATA drives
1 master running nn, 2nn and jt


dfs.replication=2
io.sort.factor: 25
io.sort.mb: 250
io.file.buffer.size: 65536
mapred.child.java.opts: -Xmx400M
mapred.tasktracker.map.tasks.maximum=7
mapred.tasktracker.reduce.tasks.maximum=7
mapred.job.reuse.jvm.num.tasks=10

$>hadoop jar /usr/lib/hadoop/hadoop-0.20.1+133-examples.jar  
randomwriter -D dfs.block.size=134217728  input

Takes about 4 mins


$>hadoop jar /usr/lib/hadoop/hadoop-0.20.1+133-examples.jar sort input  
output

Takes about 11 mins (map takes about 4.5 mins)



With the default configurations, the map tasks run for just a couple  
seconds with the average number of tasks running at any one time being  
just 20% of the map task capacity.  Increasing the block size and  
reusing jvm tasks had the most noticeable impact on performance.


-Bryan




On Oct 19, 2009, at Oct 19, 7:14 AM, Usman Waheed wrote:

> io.sort.factor: 10
> io.sort.mb: 100
> io.file.buffer.size: 65536
> mapred.child.java.opts: -Xmx200M
> dfs.datanode.handler.count=3
> 2 Mappers
> 2 Reducer

Re: Hardware performance from HADOOP cluster

Posted by Usman Waheed <us...@opera.com>.

Tim,

I have 4 nodes (Quad Core 2.00Ghz, 8GB RAM, 4x1TB disks), where one is 
the master+datanode and the rest are datanodes.

Job: Sort 40GB of random data

With the following current configuration setting:

io.sort.factor: 10
io.sort.mb: 100
io.file.buffer.size: 65536
mapred.child.java.opts: -Xmx200M
dfs.datanode.handler.count=3
2 Mappers
2 Reducer
Time taken: 28 minutes

Still testing with more config changes, will send the results out.

-Usman

 

> Hi all,
>
> I thought I would post the findings of my tuning tests running the
> sort benchmark.
>
> This is all based on 10 machines (1 as masters and 9 DN/TT) each of:
> Dell R300: 2.83G Quadcore (2x6MB cache 1 proc), 8G RAM and 2x500G SATA drives
>
> --- Vanilla installation ---
> 2M 2R: 36 mins
> 4M 4R: 36 mins (yes the same)
>
>
> --- Tuned according to Cloudera http://tinyurl.com/ykupczu ---
> io.sort.factor: 20  (mapred-site.xml)
> io.sort.mb: 200  (mapred-site.xml)
> io.file.buffer.size: 65536   (core-site.xml)
> mapred.child.java.opts: -Xmx512M  (mapred-site.xml)
>
> 2M 2R: 33.5 mins
> 4M 4R: 29 mins
> 8M 8R: 41 mins
>
>
> --- Increasing the task memory a little ---
> io.sort.factor: 20
> io.sort.mb: 200
> io.file.buffer.size: 65536
> mapred.child.java.opts: -Xmx1G
>
> 2M 2R: 29 mins  (adding dfs.datanode.handler.count=8 resulted in 30 mins)
> 4M 4R: 29 mins (yes the same)
>
>
> --- Increasing sort memory ---
> io.sort.factor: 32
> io.sort.mb: 320
> io.file.buffer.size: 65536
> mapred.child.java.opts: -Xmx1G
>
> 2M 2R: 31 mins (yes longer than lower sort sizes)
>
> I am going to stick with the following for now and get back to work...
>   io.sort.factor: 20
>   io.sort.mb: 200
>   io.file.buffer.size: 65536
>   mapred.child.java.opts: -Xmx1G
>   dfs.datanode.handler.count=8
>   4 Mappers
>   4 Reducer
>
> Hope that helps someone.  How did your tuning go Usman?
>
> Tim
>
>
> On Fri, Oct 16, 2009 at 10:41 PM, tim robertson
> <ti...@gmail.com> wrote:
>   
>> No worries Usman,  I will try and do the same on Monday.
>>
>> Thanks Todd for the clarification.
>>
>> Tim
>>
>>
>> On Fri, Oct 16, 2009 at 5:30 PM, Usman Waheed <us...@opera.com> wrote:
>>     
>>> Hi Tim,
>>>
>>> I have been swamped with some other stuff so did not get a chance to run
>>> further tests on my setup.
>>> Will send them out early next week so we can compare.
>>>
>>> Cheers,
>>> Usman
>>>
>>>       
>>>> On Fri, Oct 16, 2009 at 4:01 AM, tim robertson
>>>> <ti...@gmail.com>wrote:
>>>>
>>>>
>>>>         
>>>>> Hi all,
>>>>>
>>>>> Adding the following to core-site.xml, mapred-site.xml and
>>>>> hdfs-site.xml (based on Cloudera guidelines:
>>>>> http://tinyurl.com/ykupczu)
>>>>>  io.sort.factor: 15  (mapred-site.xml)
>>>>>  io.sort.mb: 150  (mapred-site.xml)
>>>>>  io.file.buffer.size: 65536   (core-site.xml)
>>>>>  dfs.datanode.handler.count: 3 (hdfs-site.xml  actually this is the
>>>>> default)
>>>>>
>>>>> and using the default of HADOOP_HEAPSIZE=1000 (hadoop-env.sh)
>>>>>
>>>>> Using 2 mappers and 2 reducers, can someone please help me with the
>>>>> maths as to why my jobs are failing with "Error: Java heap space" in
>>>>> the maps?
>>>>> (the same runs fine with io.sort.factor of 10 and io.sort.mb of 100)
>>>>>
>>>>> io.sort.mb of 200 x 4 (2 mappers, 2 reducers) = 0.8G
>>>>> Plus the 2 daemons on the node at 1G each = 1.8G
>>>>> Plus Xmx of 1G for each hadoop daemon task = 5.8G
>>>>>
>>>>> The machines have 8G in them.  Obviously my maths is screwy somewhere...
>>>>>
>>>>>
>>>>>
>>>>>           
>>>> Hi Tim,
>>>>
>>>> Did you also change mapred.child.java.opts? The HADOOP_HEAPSIZE parameter
>>>> is
>>>> for the daemons, not the tasks. If you bump up io.sort.mb you also have to
>>>> bump up the -Xmx argument in mapred.child.java.opts to give the actual
>>>> tasks
>>>> more RAM.
>>>>
>>>> -Todd
>>>>
>>>>
>>>>
>>>>         
>>>>> On Fri, Oct 16, 2009 at 9:59 AM, Erik Forsberg <fo...@opera.com>
>>>>> wrote:
>>>>>
>>>>>           
>>>>>> On Thu, 15 Oct 2009 11:32:35 +0200
>>>>>> Usman Waheed <us...@opera.com> wrote:
>>>>>>
>>>>>>
>>>>>>             
>>>>>>> Hi Todd,
>>>>>>>
>>>>>>> Some changes have been applied to the cluster based on the
>>>>>>> documentation (URL) you noted below,
>>>>>>>
>>>>>>>               
>>>>>> I would also like to know what settings people are tuning on the
>>>>>> operating system level. The blog post mentioned here does not mention
>>>>>> much about that, except for the fileno changes.
>>>>>>
>>>>>> We got about 3x the read performance when running DFSIOTest by mounting
>>>>>> our ext3 filesystems with the noatime parameter. I saw that mentioned
>>>>>> in the slides from some Cloudera presentation.
>>>>>>
>>>>>> (For those who don't know, the noatime parameter turns off the
>>>>>> recording of access time on files. That's a horrible performance killer
>>>>>> since it means every read of a file also means that the kernel must do
>>>>>> a write. These writes are probably queued up, but still, if you don't
>>>>>> need the atime (very few applications do), turn it off!)
>>>>>>
>>>>>> Have people been experimenting with different filesystems, or are most
>>>>>> of us running on top of ext3?
>>>>>>
>>>>>> How about mounting ext3 with "data=writeback"? That's rumoured to give
>>>>>> the best throughput and could help with write performance. From
>>>>>> mount(8):
>>>>>>
>>>>>>    writeback
>>>>>>           Data ordering is not preserved - data may be written into the
>>>>>>
>>>>>>             
>>>>> main file system
>>>>>
>>>>>           
>>>>>>           after its metadata has been  committed  to the journal.  This
>>>>>>
>>>>>>             
>>>>> is rumoured to be the
>>>>>
>>>>>           
>>>>>>           highest throughput option.  It guarantees internal file system
>>>>>>
>>>>>>             
>>>>> integrity,
>>>>>
>>>>>           
>>>>>>           however it can allow old data to appear in files after a crash
>>>>>>
>>>>>>             
>>>>> and journal recovery.
>>>>>
>>>>>           
>>>>>> How would the HDFS consistency checks cope with old data appearing in
>>>>>> the unerlying files after a system crash?
>>>>>>
>>>>>> Cheers,
>>>>>> \EF
>>>>>> --
>>>>>> Erik Forsberg <fo...@opera.com>
>>>>>> Developer, Opera Software - http://www.opera.com/
>>>>>>
>>>>>>
>>>>>>             
>>>>         
>>>       
>
>

Re: Hardware performance from HADOOP cluster

Posted by tim robertson <ti...@gmail.com>.

Hi all,

I thought I would post the findings of my tuning tests running the
sort benchmark.

This is all based on 10 machines (1 as masters and 9 DN/TT) each of:
Dell R300: 2.83G Quadcore (2x6MB cache 1 proc), 8G RAM and 2x500G SATA drives

--- Vanilla installation ---
2M 2R: 36 mins
4M 4R: 36 mins (yes the same)


--- Tuned according to Cloudera http://tinyurl.com/ykupczu ---
io.sort.factor: 20  (mapred-site.xml)
io.sort.mb: 200  (mapred-site.xml)
io.file.buffer.size: 65536   (core-site.xml)
mapred.child.java.opts: -Xmx512M  (mapred-site.xml)

2M 2R: 33.5 mins
4M 4R: 29 mins
8M 8R: 41 mins


--- Increasing the task memory a little ---
io.sort.factor: 20
io.sort.mb: 200
io.file.buffer.size: 65536
mapred.child.java.opts: -Xmx1G

2M 2R: 29 mins  (adding dfs.datanode.handler.count=8 resulted in 30 mins)
4M 4R: 29 mins (yes the same)


--- Increasing sort memory ---
io.sort.factor: 32
io.sort.mb: 320
io.file.buffer.size: 65536
mapred.child.java.opts: -Xmx1G

2M 2R: 31 mins (yes longer than lower sort sizes)

I am going to stick with the following for now and get back to work...
  io.sort.factor: 20
  io.sort.mb: 200
  io.file.buffer.size: 65536
  mapred.child.java.opts: -Xmx1G
  dfs.datanode.handler.count=8
  4 Mappers
  4 Reducer

Hope that helps someone.  How did your tuning go Usman?

Tim


On Fri, Oct 16, 2009 at 10:41 PM, tim robertson
<ti...@gmail.com> wrote:
> No worries Usman,  I will try and do the same on Monday.
>
> Thanks Todd for the clarification.
>
> Tim
>
>
> On Fri, Oct 16, 2009 at 5:30 PM, Usman Waheed <us...@opera.com> wrote:
>> Hi Tim,
>>
>> I have been swamped with some other stuff so did not get a chance to run
>> further tests on my setup.
>> Will send them out early next week so we can compare.
>>
>> Cheers,
>> Usman
>>
>>> On Fri, Oct 16, 2009 at 4:01 AM, tim robertson
>>> <ti...@gmail.com>wrote:
>>>
>>>
>>>>
>>>> Hi all,
>>>>
>>>> Adding the following to core-site.xml, mapred-site.xml and
>>>> hdfs-site.xml (based on Cloudera guidelines:
>>>> http://tinyurl.com/ykupczu)
>>>>  io.sort.factor: 15  (mapred-site.xml)
>>>>  io.sort.mb: 150  (mapred-site.xml)
>>>>  io.file.buffer.size: 65536   (core-site.xml)
>>>>  dfs.datanode.handler.count: 3 (hdfs-site.xml  actually this is the
>>>> default)
>>>>
>>>> and using the default of HADOOP_HEAPSIZE=1000 (hadoop-env.sh)
>>>>
>>>> Using 2 mappers and 2 reducers, can someone please help me with the
>>>> maths as to why my jobs are failing with "Error: Java heap space" in
>>>> the maps?
>>>> (the same runs fine with io.sort.factor of 10 and io.sort.mb of 100)
>>>>
>>>> io.sort.mb of 200 x 4 (2 mappers, 2 reducers) = 0.8G
>>>> Plus the 2 daemons on the node at 1G each = 1.8G
>>>> Plus Xmx of 1G for each hadoop daemon task = 5.8G
>>>>
>>>> The machines have 8G in them.  Obviously my maths is screwy somewhere...
>>>>
>>>>
>>>>
>>>
>>> Hi Tim,
>>>
>>> Did you also change mapred.child.java.opts? The HADOOP_HEAPSIZE parameter
>>> is
>>> for the daemons, not the tasks. If you bump up io.sort.mb you also have to
>>> bump up the -Xmx argument in mapred.child.java.opts to give the actual
>>> tasks
>>> more RAM.
>>>
>>> -Todd
>>>
>>>
>>>
>>>>
>>>> On Fri, Oct 16, 2009 at 9:59 AM, Erik Forsberg <fo...@opera.com>
>>>> wrote:
>>>>
>>>>>
>>>>> On Thu, 15 Oct 2009 11:32:35 +0200
>>>>> Usman Waheed <us...@opera.com> wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> Hi Todd,
>>>>>>
>>>>>> Some changes have been applied to the cluster based on the
>>>>>> documentation (URL) you noted below,
>>>>>>
>>>>>
>>>>> I would also like to know what settings people are tuning on the
>>>>> operating system level. The blog post mentioned here does not mention
>>>>> much about that, except for the fileno changes.
>>>>>
>>>>> We got about 3x the read performance when running DFSIOTest by mounting
>>>>> our ext3 filesystems with the noatime parameter. I saw that mentioned
>>>>> in the slides from some Cloudera presentation.
>>>>>
>>>>> (For those who don't know, the noatime parameter turns off the
>>>>> recording of access time on files. That's a horrible performance killer
>>>>> since it means every read of a file also means that the kernel must do
>>>>> a write. These writes are probably queued up, but still, if you don't
>>>>> need the atime (very few applications do), turn it off!)
>>>>>
>>>>> Have people been experimenting with different filesystems, or are most
>>>>> of us running on top of ext3?
>>>>>
>>>>> How about mounting ext3 with "data=writeback"? That's rumoured to give
>>>>> the best throughput and could help with write performance. From
>>>>> mount(8):
>>>>>
>>>>>    writeback
>>>>>           Data ordering is not preserved - data may be written into the
>>>>>
>>>>
>>>> main file system
>>>>
>>>>>
>>>>>           after its metadata has been  committed  to the journal.  This
>>>>>
>>>>
>>>> is rumoured to be the
>>>>
>>>>>
>>>>>           highest throughput option.  It guarantees internal file system
>>>>>
>>>>
>>>> integrity,
>>>>
>>>>>
>>>>>           however it can allow old data to appear in files after a crash
>>>>>
>>>>
>>>> and journal recovery.
>>>>
>>>>>
>>>>> How would the HDFS consistency checks cope with old data appearing in
>>>>> the unerlying files after a system crash?
>>>>>
>>>>> Cheers,
>>>>> \EF
>>>>> --
>>>>> Erik Forsberg <fo...@opera.com>
>>>>> Developer, Opera Software - http://www.opera.com/
>>>>>
>>>>>
>>>
>>>
>>
>>
>

Re: Hardware performance from HADOOP cluster

Posted by tim robertson <ti...@gmail.com>.

No worries Usman,  I will try and do the same on Monday.

Thanks Todd for the clarification.

Tim


On Fri, Oct 16, 2009 at 5:30 PM, Usman Waheed <us...@opera.com> wrote:
> Hi Tim,
>
> I have been swamped with some other stuff so did not get a chance to run
> further tests on my setup.
> Will send them out early next week so we can compare.
>
> Cheers,
> Usman
>
>> On Fri, Oct 16, 2009 at 4:01 AM, tim robertson
>> <ti...@gmail.com>wrote:
>>
>>
>>>
>>> Hi all,
>>>
>>> Adding the following to core-site.xml, mapred-site.xml and
>>> hdfs-site.xml (based on Cloudera guidelines:
>>> http://tinyurl.com/ykupczu)
>>>  io.sort.factor: 15  (mapred-site.xml)
>>>  io.sort.mb: 150  (mapred-site.xml)
>>>  io.file.buffer.size: 65536   (core-site.xml)
>>>  dfs.datanode.handler.count: 3 (hdfs-site.xml  actually this is the
>>> default)
>>>
>>> and using the default of HADOOP_HEAPSIZE=1000 (hadoop-env.sh)
>>>
>>> Using 2 mappers and 2 reducers, can someone please help me with the
>>> maths as to why my jobs are failing with "Error: Java heap space" in
>>> the maps?
>>> (the same runs fine with io.sort.factor of 10 and io.sort.mb of 100)
>>>
>>> io.sort.mb of 200 x 4 (2 mappers, 2 reducers) = 0.8G
>>> Plus the 2 daemons on the node at 1G each = 1.8G
>>> Plus Xmx of 1G for each hadoop daemon task = 5.8G
>>>
>>> The machines have 8G in them.  Obviously my maths is screwy somewhere...
>>>
>>>
>>>
>>
>> Hi Tim,
>>
>> Did you also change mapred.child.java.opts? The HADOOP_HEAPSIZE parameter
>> is
>> for the daemons, not the tasks. If you bump up io.sort.mb you also have to
>> bump up the -Xmx argument in mapred.child.java.opts to give the actual
>> tasks
>> more RAM.
>>
>> -Todd
>>
>>
>>
>>>
>>> On Fri, Oct 16, 2009 at 9:59 AM, Erik Forsberg <fo...@opera.com>
>>> wrote:
>>>
>>>>
>>>> On Thu, 15 Oct 2009 11:32:35 +0200
>>>> Usman Waheed <us...@opera.com> wrote:
>>>>
>>>>
>>>>>
>>>>> Hi Todd,
>>>>>
>>>>> Some changes have been applied to the cluster based on the
>>>>> documentation (URL) you noted below,
>>>>>
>>>>
>>>> I would also like to know what settings people are tuning on the
>>>> operating system level. The blog post mentioned here does not mention
>>>> much about that, except for the fileno changes.
>>>>
>>>> We got about 3x the read performance when running DFSIOTest by mounting
>>>> our ext3 filesystems with the noatime parameter. I saw that mentioned
>>>> in the slides from some Cloudera presentation.
>>>>
>>>> (For those who don't know, the noatime parameter turns off the
>>>> recording of access time on files. That's a horrible performance killer
>>>> since it means every read of a file also means that the kernel must do
>>>> a write. These writes are probably queued up, but still, if you don't
>>>> need the atime (very few applications do), turn it off!)
>>>>
>>>> Have people been experimenting with different filesystems, or are most
>>>> of us running on top of ext3?
>>>>
>>>> How about mounting ext3 with "data=writeback"? That's rumoured to give
>>>> the best throughput and could help with write performance. From
>>>> mount(8):
>>>>
>>>>    writeback
>>>>           Data ordering is not preserved - data may be written into the
>>>>
>>>
>>> main file system
>>>
>>>>
>>>>           after its metadata has been  committed  to the journal.  This
>>>>
>>>
>>> is rumoured to be the
>>>
>>>>
>>>>           highest throughput option.  It guarantees internal file system
>>>>
>>>
>>> integrity,
>>>
>>>>
>>>>           however it can allow old data to appear in files after a crash
>>>>
>>>
>>> and journal recovery.
>>>
>>>>
>>>> How would the HDFS consistency checks cope with old data appearing in
>>>> the unerlying files after a system crash?
>>>>
>>>> Cheers,
>>>> \EF
>>>> --
>>>> Erik Forsberg <fo...@opera.com>
>>>> Developer, Opera Software - http://www.opera.com/
>>>>
>>>>
>>
>>
>
>

Re: Hardware performance from HADOOP cluster

Posted by Usman Waheed <us...@opera.com>.

Hi Tim,

I have been swamped with some other stuff so did not get a chance to run 
further tests on my setup.
Will send them out early next week so we can compare.

Cheers,
Usman

> On Fri, Oct 16, 2009 at 4:01 AM, tim robertson <ti...@gmail.com>wrote:
>
>   
>> Hi all,
>>
>> Adding the following to core-site.xml, mapred-site.xml and
>> hdfs-site.xml (based on Cloudera guidelines:
>> http://tinyurl.com/ykupczu)
>>  io.sort.factor: 15  (mapred-site.xml)
>>  io.sort.mb: 150  (mapred-site.xml)
>>  io.file.buffer.size: 65536   (core-site.xml)
>>  dfs.datanode.handler.count: 3 (hdfs-site.xml  actually this is the
>> default)
>>
>> and using the default of HADOOP_HEAPSIZE=1000 (hadoop-env.sh)
>>
>> Using 2 mappers and 2 reducers, can someone please help me with the
>> maths as to why my jobs are failing with "Error: Java heap space" in
>> the maps?
>> (the same runs fine with io.sort.factor of 10 and io.sort.mb of 100)
>>
>> io.sort.mb of 200 x 4 (2 mappers, 2 reducers) = 0.8G
>> Plus the 2 daemons on the node at 1G each = 1.8G
>> Plus Xmx of 1G for each hadoop daemon task = 5.8G
>>
>> The machines have 8G in them.  Obviously my maths is screwy somewhere...
>>
>>
>>     
> Hi Tim,
>
> Did you also change mapred.child.java.opts? The HADOOP_HEAPSIZE parameter is
> for the daemons, not the tasks. If you bump up io.sort.mb you also have to
> bump up the -Xmx argument in mapred.child.java.opts to give the actual tasks
> more RAM.
>
> -Todd
>
>
>   
>> On Fri, Oct 16, 2009 at 9:59 AM, Erik Forsberg <fo...@opera.com> wrote:
>>     
>>> On Thu, 15 Oct 2009 11:32:35 +0200
>>> Usman Waheed <us...@opera.com> wrote:
>>>
>>>       
>>>> Hi Todd,
>>>>
>>>> Some changes have been applied to the cluster based on the
>>>> documentation (URL) you noted below,
>>>>         
>>> I would also like to know what settings people are tuning on the
>>> operating system level. The blog post mentioned here does not mention
>>> much about that, except for the fileno changes.
>>>
>>> We got about 3x the read performance when running DFSIOTest by mounting
>>> our ext3 filesystems with the noatime parameter. I saw that mentioned
>>> in the slides from some Cloudera presentation.
>>>
>>> (For those who don't know, the noatime parameter turns off the
>>> recording of access time on files. That's a horrible performance killer
>>> since it means every read of a file also means that the kernel must do
>>> a write. These writes are probably queued up, but still, if you don't
>>> need the atime (very few applications do), turn it off!)
>>>
>>> Have people been experimenting with different filesystems, or are most
>>> of us running on top of ext3?
>>>
>>> How about mounting ext3 with "data=writeback"? That's rumoured to give
>>> the best throughput and could help with write performance. From
>>> mount(8):
>>>
>>>     writeback
>>>            Data ordering is not preserved - data may be written into the
>>>       
>> main file system
>>     
>>>            after its metadata has been  committed  to the journal.  This
>>>       
>> is rumoured to be the
>>     
>>>            highest throughput option.  It guarantees internal file system
>>>       
>> integrity,
>>     
>>>            however it can allow old data to appear in files after a crash
>>>       
>> and journal recovery.
>>     
>>> How would the HDFS consistency checks cope with old data appearing in
>>> the unerlying files after a system crash?
>>>
>>> Cheers,
>>> \EF
>>> --
>>> Erik Forsberg <fo...@opera.com>
>>> Developer, Opera Software - http://www.opera.com/
>>>
>>>       
>
>

Re: Hardware performance from HADOOP cluster

Posted by Todd Lipcon <to...@cloudera.com>.

On Fri, Oct 16, 2009 at 4:01 AM, tim robertson <ti...@gmail.com>wrote:

> Hi all,
>
> Adding the following to core-site.xml, mapred-site.xml and
> hdfs-site.xml (based on Cloudera guidelines:
> http://tinyurl.com/ykupczu)
>  io.sort.factor: 15  (mapred-site.xml)
>  io.sort.mb: 150  (mapred-site.xml)
>  io.file.buffer.size: 65536   (core-site.xml)
>  dfs.datanode.handler.count: 3 (hdfs-site.xml  actually this is the
> default)
>
> and using the default of HADOOP_HEAPSIZE=1000 (hadoop-env.sh)
>
> Using 2 mappers and 2 reducers, can someone please help me with the
> maths as to why my jobs are failing with "Error: Java heap space" in
> the maps?
> (the same runs fine with io.sort.factor of 10 and io.sort.mb of 100)
>
> io.sort.mb of 200 x 4 (2 mappers, 2 reducers) = 0.8G
> Plus the 2 daemons on the node at 1G each = 1.8G
> Plus Xmx of 1G for each hadoop daemon task = 5.8G
>
> The machines have 8G in them.  Obviously my maths is screwy somewhere...
>
>
Hi Tim,

Did you also change mapred.child.java.opts? The HADOOP_HEAPSIZE parameter is
for the daemons, not the tasks. If you bump up io.sort.mb you also have to
bump up the -Xmx argument in mapred.child.java.opts to give the actual tasks
more RAM.

-Todd


>
>
> On Fri, Oct 16, 2009 at 9:59 AM, Erik Forsberg <fo...@opera.com> wrote:
> > On Thu, 15 Oct 2009 11:32:35 +0200
> > Usman Waheed <us...@opera.com> wrote:
> >
> >> Hi Todd,
> >>
> >> Some changes have been applied to the cluster based on the
> >> documentation (URL) you noted below,
> >
> > I would also like to know what settings people are tuning on the
> > operating system level. The blog post mentioned here does not mention
> > much about that, except for the fileno changes.
> >
> > We got about 3x the read performance when running DFSIOTest by mounting
> > our ext3 filesystems with the noatime parameter. I saw that mentioned
> > in the slides from some Cloudera presentation.
> >
> > (For those who don't know, the noatime parameter turns off the
> > recording of access time on files. That's a horrible performance killer
> > since it means every read of a file also means that the kernel must do
> > a write. These writes are probably queued up, but still, if you don't
> > need the atime (very few applications do), turn it off!)
> >
> > Have people been experimenting with different filesystems, or are most
> > of us running on top of ext3?
> >
> > How about mounting ext3 with "data=writeback"? That's rumoured to give
> > the best throughput and could help with write performance. From
> > mount(8):
> >
> >     writeback
> >            Data ordering is not preserved - data may be written into the
> main file system
> >            after its metadata has been  committed  to the journal.  This
> is rumoured to be the
> >            highest throughput option.  It guarantees internal file system
> integrity,
> >            however it can allow old data to appear in files after a crash
> and journal recovery.
> >
> > How would the HDFS consistency checks cope with old data appearing in
> > the unerlying files after a system crash?
> >
> > Cheers,
> > \EF
> > --
> > Erik Forsberg <fo...@opera.com>
> > Developer, Opera Software - http://www.opera.com/
> >
>

Re: Hardware performance from HADOOP cluster

Posted by tim robertson <ti...@gmail.com>.

Hi all,

Adding the following to core-site.xml, mapred-site.xml and
hdfs-site.xml (based on Cloudera guidelines:
http://tinyurl.com/ykupczu)
  io.sort.factor: 15  (mapred-site.xml)
  io.sort.mb: 150  (mapred-site.xml)
  io.file.buffer.size: 65536   (core-site.xml)
  dfs.datanode.handler.count: 3 (hdfs-site.xml  actually this is the default)

and using the default of HADOOP_HEAPSIZE=1000 (hadoop-env.sh)

Using 2 mappers and 2 reducers, can someone please help me with the
maths as to why my jobs are failing with "Error: Java heap space" in
the maps?
(the same runs fine with io.sort.factor of 10 and io.sort.mb of 100)

io.sort.mb of 200 x 4 (2 mappers, 2 reducers) = 0.8G
Plus the 2 daemons on the node at 1G each = 1.8G
Plus Xmx of 1G for each hadoop daemon task = 5.8G

The machines have 8G in them.  Obviously my maths is screwy somewhere...

Cheers,
Tim



On Fri, Oct 16, 2009 at 9:59 AM, Erik Forsberg <fo...@opera.com> wrote:
> On Thu, 15 Oct 2009 11:32:35 +0200
> Usman Waheed <us...@opera.com> wrote:
>
>> Hi Todd,
>>
>> Some changes have been applied to the cluster based on the
>> documentation (URL) you noted below,
>
> I would also like to know what settings people are tuning on the
> operating system level. The blog post mentioned here does not mention
> much about that, except for the fileno changes.
>
> We got about 3x the read performance when running DFSIOTest by mounting
> our ext3 filesystems with the noatime parameter. I saw that mentioned
> in the slides from some Cloudera presentation.
>
> (For those who don't know, the noatime parameter turns off the
> recording of access time on files. That's a horrible performance killer
> since it means every read of a file also means that the kernel must do
> a write. These writes are probably queued up, but still, if you don't
> need the atime (very few applications do), turn it off!)
>
> Have people been experimenting with different filesystems, or are most
> of us running on top of ext3?
>
> How about mounting ext3 with "data=writeback"? That's rumoured to give
> the best throughput and could help with write performance. From
> mount(8):
>
>     writeback
>            Data ordering is not preserved - data may be written into the main file system
>            after its metadata has been  committed  to the journal.  This is rumoured to be the
>            highest throughput option.  It guarantees internal file system integrity,
>            however it can allow old data to appear in files after a crash and journal recovery.
>
> How would the HDFS consistency checks cope with old data appearing in
> the unerlying files after a system crash?
>
> Cheers,
> \EF
> --
> Erik Forsberg <fo...@opera.com>
> Developer, Opera Software - http://www.opera.com/
>

Re: Hardware performance from HADOOP cluster

Posted by Erik Forsberg <fo...@opera.com>.

On Thu, 15 Oct 2009 11:32:35 +0200
Usman Waheed <us...@opera.com> wrote:

> Hi Todd,
> 
> Some changes have been applied to the cluster based on the
> documentation (URL) you noted below,

I would also like to know what settings people are tuning on the
operating system level. The blog post mentioned here does not mention
much about that, except for the fileno changes.

We got about 3x the read performance when running DFSIOTest by mounting
our ext3 filesystems with the noatime parameter. I saw that mentioned
in the slides from some Cloudera presentation.

(For those who don't know, the noatime parameter turns off the
recording of access time on files. That's a horrible performance killer
since it means every read of a file also means that the kernel must do
a write. These writes are probably queued up, but still, if you don't
need the atime (very few applications do), turn it off!)

Have people been experimenting with different filesystems, or are most
of us running on top of ext3? 

How about mounting ext3 with "data=writeback"? That's rumoured to give
the best throughput and could help with write performance. From
mount(8):

     writeback
            Data ordering is not preserved - data may be written into the main file system 
            after its metadata has been  committed  to the journal.  This is rumoured to be the
            highest throughput option.  It guarantees internal file system integrity,  
            however it can allow old data to appear in files after a crash and journal recovery.

How would the HDFS consistency checks cope with old data appearing in
the unerlying files after a system crash?

Cheers,
\EF
-- 
Erik Forsberg <fo...@opera.com>
Developer, Opera Software - http://www.opera.com/

Re: Hardware performance from HADOOP cluster

Posted by Usman Waheed <us...@opera.com>.

Hi Todd,

Some changes have been applied to the cluster based on the documentation 
(URL) you noted below,
like file descriptor settings and io.file.buffer.size. I will check out 
the other settings which I haven't applied yet.

My map/reduce slot settings from my hadoop-site.xml and 
hadoop-default.xml on all nodes in the cluster.

_*hadoop-site.xml
*_mapred.tasktracker.task.maximum = 2
mapred.tasktracker.map.tasks.maximum = 8
mapred.tasktracker.reduce.tasks.maximum = 8
_*
hadoop-default.xml
*_mapred.map.tasks = 2
mapred.reduce.tasks = 1

Thanks,
Usman


> This seems a bit slow for that setup (4-5 MB/sec/node sorting). Have
> you changed the configurations at all? There are some notes on this
> blog post that might help your performance a bit:
>
> http://www.cloudera.com/blog/2009/03/30/configuration-parameters-what-can-you-just-ignore/
>
> How many map and reduce slots did you configure for the daemons? If
> you have Ganglia installed you can usually get a good idea of whether
> you're using your resources well by looking at the graphs while
> running a job like this sort.
>
> -Todd
>
> On Wed, Oct 14, 2009 at 4:04 AM, Usman Waheed <us...@opera.com> wrote:
>   
>> Here are the results i got from my 4 node cluster (correction i noted 5
>> earlier). One of my nodes out of the 4 is a namenode+datanode both.
>>
>> GENERATE RANDOM DATA
>> Wrote out 40GB of random binary data:
>> Map output records=4088301
>> The job took 358 seconds. (approximately: 6 minutes).
>>
>> SORT RANDOM GENERATED DATA
>> Map output records=4088301
>> Reduce input records=4088301
>> The job took 2136 seconds. (approximately: 35 minutes).
>>
>> VALIDATION OF SORTED DATA
>> The job took 183 seconds.
>> SUCCESS! Validated the MapReduce framework's 'sort' successfully.
>>
>> It would be interesting to see what performance numbers others with a
>> similar setup have obtained.
>>
>> Thanks,
>> Usman
>>
>>     
>>> I am setting up a new cluster of 10 nodes of 2.83G Quadcore (2x6MB
>>> cache), 8G RAM and 2x500G drives, and will do the same soon.  Got some
>>> issues though so it won't start up...
>>>
>>> Tim
>>>
>>>
>>> On Wed, Oct 14, 2009 at 11:36 AM, Usman Waheed <us...@opera.com> wrote:
>>>
>>>       
>>>> Thanks Tim, i will check it out and post my results for comments.
>>>> -Usman
>>>>
>>>>         
>>>>> Might it be worth running the http://wiki.apache.org/hadoop/Sort and
>>>>> posting your results for comment?
>>>>>
>>>>> Tim
>>>>>
>>>>>
>>>>> On Wed, Oct 14, 2009 at 10:48 AM, Usman Waheed <us...@opera.com> wrote:
>>>>>
>>>>>
>>>>>           
>>>>>> Hi,
>>>>>>
>>>>>> Is there a way to tell what kind of performance numbers one can expect
>>>>>> out
>>>>>> of their cluster given a certain set of specs.
>>>>>>
>>>>>> For example i have 5 nodes in my cluster that all have the following
>>>>>> hardware configuration(s):
>>>>>> Quad Core 2.0GHz, 8GB RAM, 4x1TB disks and are all on the same rack.
>>>>>>
>>>>>> Thanks,
>>>>>> Usman
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>             
>>>>>           
>>>>         
>>>       
>>     
>
>

Re: Hardware performance from HADOOP cluster

Posted by Todd Lipcon <to...@cloudera.com>.

This seems a bit slow for that setup (4-5 MB/sec/node sorting). Have
you changed the configurations at all? There are some notes on this
blog post that might help your performance a bit:

http://www.cloudera.com/blog/2009/03/30/configuration-parameters-what-can-you-just-ignore/

How many map and reduce slots did you configure for the daemons? If
you have Ganglia installed you can usually get a good idea of whether
you're using your resources well by looking at the graphs while
running a job like this sort.

-Todd

On Wed, Oct 14, 2009 at 4:04 AM, Usman Waheed <us...@opera.com> wrote:
> Here are the results i got from my 4 node cluster (correction i noted 5
> earlier). One of my nodes out of the 4 is a namenode+datanode both.
>
> GENERATE RANDOM DATA
> Wrote out 40GB of random binary data:
> Map output records=4088301
> The job took 358 seconds. (approximately: 6 minutes).
>
> SORT RANDOM GENERATED DATA
> Map output records=4088301
> Reduce input records=4088301
> The job took 2136 seconds. (approximately: 35 minutes).
>
> VALIDATION OF SORTED DATA
> The job took 183 seconds.
> SUCCESS! Validated the MapReduce framework's 'sort' successfully.
>
> It would be interesting to see what performance numbers others with a
> similar setup have obtained.
>
> Thanks,
> Usman
>
>> I am setting up a new cluster of 10 nodes of 2.83G Quadcore (2x6MB
>> cache), 8G RAM and 2x500G drives, and will do the same soon.  Got some
>> issues though so it won't start up...
>>
>> Tim
>>
>>
>> On Wed, Oct 14, 2009 at 11:36 AM, Usman Waheed <us...@opera.com> wrote:
>>
>>>
>>> Thanks Tim, i will check it out and post my results for comments.
>>> -Usman
>>>
>>>>
>>>> Might it be worth running the http://wiki.apache.org/hadoop/Sort and
>>>> posting your results for comment?
>>>>
>>>> Tim
>>>>
>>>>
>>>> On Wed, Oct 14, 2009 at 10:48 AM, Usman Waheed <us...@opera.com> wrote:
>>>>
>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> Is there a way to tell what kind of performance numbers one can expect
>>>>> out
>>>>> of their cluster given a certain set of specs.
>>>>>
>>>>> For example i have 5 nodes in my cluster that all have the following
>>>>> hardware configuration(s):
>>>>> Quad Core 2.0GHz, 8GB RAM, 4x1TB disks and are all on the same rack.
>>>>>
>>>>> Thanks,
>>>>> Usman
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Re: Hardware performance from HADOOP cluster

Posted by Usman Waheed <us...@opera.com>.

Here are the results i got from my 4 node cluster (correction i noted 5 
earlier). One of my nodes out of the 4 is a namenode+datanode both.

GENERATE RANDOM DATA
Wrote out 40GB of random binary data:
Map output records=4088301
The job took 358 seconds. (approximately: 6 minutes).

SORT RANDOM GENERATED DATA
Map output records=4088301
Reduce input records=4088301
The job took 2136 seconds. (approximately: 35 minutes).

VALIDATION OF SORTED DATA
The job took 183 seconds.
SUCCESS! Validated the MapReduce framework's 'sort' successfully.

It would be interesting to see what performance numbers others with a 
similar setup have obtained.

Thanks,
Usman

> I am setting up a new cluster of 10 nodes of 2.83G Quadcore (2x6MB
> cache), 8G RAM and 2x500G drives, and will do the same soon.  Got some
> issues though so it won't start up...
>
> Tim
>
>
> On Wed, Oct 14, 2009 at 11:36 AM, Usman Waheed <us...@opera.com> wrote:
>   
>> Thanks Tim, i will check it out and post my results for comments.
>> -Usman
>>     
>>> Might it be worth running the http://wiki.apache.org/hadoop/Sort and
>>> posting your results for comment?
>>>
>>> Tim
>>>
>>>
>>> On Wed, Oct 14, 2009 at 10:48 AM, Usman Waheed <us...@opera.com> wrote:
>>>
>>>       
>>>> Hi,
>>>>
>>>> Is there a way to tell what kind of performance numbers one can expect
>>>> out
>>>> of their cluster given a certain set of specs.
>>>>
>>>> For example i have 5 nodes in my cluster that all have the following
>>>> hardware configuration(s):
>>>> Quad Core 2.0GHz, 8GB RAM, 4x1TB disks and are all on the same rack.
>>>>
>>>> Thanks,
>>>> Usman
>>>>
>>>>
>>>>
>>>>         
>>>       
>>     
>
>

Re: Hardware performance from HADOOP cluster

Posted by sudha sadhasivam <su...@yahoo.com>.

Can use terrabyte sort to check up performance
G Sudha Sadasivam

--- On Wed, 10/14/09, tim robertson <ti...@gmail.com> wrote:


From: tim robertson <ti...@gmail.com>
Subject: Re: Hardware performance from HADOOP cluster
To: common-user@hadoop.apache.org
Date: Wednesday, October 14, 2009, 3:16 PM


I am setting up a new cluster of 10 nodes of 2.83G Quadcore (2x6MB
cache), 8G RAM and 2x500G drives, and will do the same soon.  Got some
issues though so it won't start up...

Tim


On Wed, Oct 14, 2009 at 11:36 AM, Usman Waheed <us...@opera.com> wrote:
> Thanks Tim, i will check it out and post my results for comments.
> -Usman
>>
>> Might it be worth running the http://wiki.apache.org/hadoop/Sort and
>> posting your results for comment?
>>
>> Tim
>>
>>
>> On Wed, Oct 14, 2009 at 10:48 AM, Usman Waheed <us...@opera.com> wrote:
>>
>>>
>>> Hi,
>>>
>>> Is there a way to tell what kind of performance numbers one can expect
>>> out
>>> of their cluster given a certain set of specs.
>>>
>>> For example i have 5 nodes in my cluster that all have the following
>>> hardware configuration(s):
>>> Quad Core 2.0GHz, 8GB RAM, 4x1TB disks and are all on the same rack.
>>>
>>> Thanks,
>>> Usman
>>>
>>>
>>>
>>
>>
>
>

Re: Hardware performance from HADOOP cluster

Posted by tim robertson <ti...@gmail.com>.

I am setting up a new cluster of 10 nodes of 2.83G Quadcore (2x6MB
cache), 8G RAM and 2x500G drives, and will do the same soon.  Got some
issues though so it won't start up...

Tim


On Wed, Oct 14, 2009 at 11:36 AM, Usman Waheed <us...@opera.com> wrote:
> Thanks Tim, i will check it out and post my results for comments.
> -Usman
>>
>> Might it be worth running the http://wiki.apache.org/hadoop/Sort and
>> posting your results for comment?
>>
>> Tim
>>
>>
>> On Wed, Oct 14, 2009 at 10:48 AM, Usman Waheed <us...@opera.com> wrote:
>>
>>>
>>> Hi,
>>>
>>> Is there a way to tell what kind of performance numbers one can expect
>>> out
>>> of their cluster given a certain set of specs.
>>>
>>> For example i have 5 nodes in my cluster that all have the following
>>> hardware configuration(s):
>>> Quad Core 2.0GHz, 8GB RAM, 4x1TB disks and are all on the same rack.
>>>
>>> Thanks,
>>> Usman
>>>
>>>
>>>
>>
>>
>
>

Re: Hardware performance from HADOOP cluster

Posted by Usman Waheed <us...@opera.com>.

Thanks Tim, i will check it out and post my results for comments.
-Usman
> Might it be worth running the http://wiki.apache.org/hadoop/Sort and
> posting your results for comment?
>
> Tim
>
>
> On Wed, Oct 14, 2009 at 10:48 AM, Usman Waheed <us...@opera.com> wrote:
>   
>> Hi,
>>
>> Is there a way to tell what kind of performance numbers one can expect out
>> of their cluster given a certain set of specs.
>>
>> For example i have 5 nodes in my cluster that all have the following
>> hardware configuration(s):
>> Quad Core 2.0GHz, 8GB RAM, 4x1TB disks and are all on the same rack.
>>
>> Thanks,
>> Usman
>>
>>
>>     
>
>

Re: Hardware performance from HADOOP cluster

Posted by tim robertson <ti...@gmail.com>.

Might it be worth running the http://wiki.apache.org/hadoop/Sort and
posting your results for comment?

Tim


On Wed, Oct 14, 2009 at 10:48 AM, Usman Waheed <us...@opera.com> wrote:
> Hi,
>
> Is there a way to tell what kind of performance numbers one can expect out
> of their cluster given a certain set of specs.
>
> For example i have 5 nodes in my cluster that all have the following
> hardware configuration(s):
> Quad Core 2.0GHz, 8GB RAM, 4x1TB disks and are all on the same rack.
>
> Thanks,
> Usman
>
>