You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Sandeep Dhawan <ds...@hcl.in> on 2008/12/30 12:57:34 UTC

Performance testing

Hi,

I am trying to create a hadoop cluster which can handle 2000 write requests
per second.
In each write request I would writing a line of size 1KB in a file.

I would be using machine having following configuration:
Platfom: Red Hat Linux 9.0 
CPU : 2.07 GHz
RAM : 1GB

Can anyone help in giving me some pointers/guideline as to how to go about
setting up such a cluster.
What are the configuration parameters in hadoop with which we can tweak to
ehance the performance of the hadoop cluster. 

Thanks,
Sandeep
-- 
View this message in context: http://www.nabble.com/Performance-testing-tp21216266p21216266.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Performance testing

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

I should add that your test should both create and delete files.

Raghu.

Raghu Angadi wrote:
> Sandeep Dhawan wrote:
>> Hi,
>>
>> I am trying to create a hadoop cluster which can handle 2000 write 
>> requests
>> per second.
>> In each write request I would writing a line of size 1KB in a file.
> 
> This is essentially a matter of deciding how many datanodes (with the 
> given configuration) do you need to write 3*2000*2 files per second 
> (assuming each 1KB is a separate HDFS file).
> 
> You can test this on single datanode. For e.g. if your datanode supports 
> 1000 of 1KB files per second (even with multiple processes creating at 
> the same time), then you you need 12 datanodes (+ any factor of safety 
> you want to add).
> 
> How many nodes or disks do you have approximately?
> 
> Raghu.
> 
> 
>> I would be using machine having following configuration:
>> Platfom: Red Hat Linux 9.0 CPU : 2.07 GHz
>> RAM : 1GB
>>
>> Can anyone help in giving me some pointers/guideline as to how to go 
>> about
>> setting up such a cluster.
>> What are the configuration parameters in hadoop with which we can 
>> tweak to
>> ehance the performance of the hadoop cluster.
>> Thanks,
>> Sandeep
>

Re: Performance testing

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

Sandeep Dhawan wrote:
> Hi,
> 
> I am trying to create a hadoop cluster which can handle 2000 write requests
> per second.
> In each write request I would writing a line of size 1KB in a file.

This is essentially a matter of deciding how many datanodes (with the 
given configuration) do you need to write 3*2000*2 files per second 
(assuming each 1KB is a separate HDFS file).

You can test this on single datanode. For e.g. if your datanode supports 
1000 of 1KB files per second (even with multiple processes creating at 
the same time), then you you need 12 datanodes (+ any factor of safety 
you want to add).

How many nodes or disks do you have approximately?

Raghu.

> I would be using machine having following configuration:
> Platfom: Red Hat Linux 9.0 
> CPU : 2.07 GHz
> RAM : 1GB
> 
> Can anyone help in giving me some pointers/guideline as to how to go about
> setting up such a cluster.
> What are the configuration parameters in hadoop with which we can tweak to
> ehance the performance of the hadoop cluster. 
> 
> Thanks,
> Sandeep

Re: Performance testing

Posted by Jothi Padmanabhan <jo...@yahoo-inc.com>.

Hi, see answers inline below

HTH,
Jothi

> I would like to know:
> 
> 1. How can block size impact the performance of a mapred job.

>From the M/R side, the fileSystem block size of the input files is treated
as an upper bound for input splits. . Since each input split translates into
one map, this can affect the actual number of maps for the job

> 2. Does the performance improve if I setup NameNode and JobTracker on
> different machine. At present,
> I am running Namenode and JobTracker on the same machine as Master
> interconnected to 2 slave machines running Datanode and TaskTracker

Intuitively, it should help. Namenode is really memory intensive and the job
tracker could also be heavily loaded depending on the number of concurrent
jobs running and the number of maps and reducers of these jobs (for
scheduling).

> 3. What should be the replication factor for a 3 node cluster

I think having a higher replication factor might not increase performance
for a 3 node cluster, it might degrade if at all because of the extra
replication. If replication is only for performance and not for
availability/fault tolerance, you could try setting the replication factor
to a smaller number (1?).

> 4. How does io.sort.mb impact the performance of the cluster

Look here
http://hadoop.apache.org/core/docs/r0.19.0/mapred_tutorial.html

> 
> Thanks,
> Sandeep 
> 
> 
> Brian Bockelman wrote:
>> 
>> Hey Sandeep,
>> 
>> I'd do a couple of things:
>> 1) Run your test.  Do something which will be similar to your actual
>> workflow.
>> 2) Save the resulting Ganglia plots.  This will give you a hint as to
>> where things are bottlenecking (memory, CPU, wait I/O).
>> 3) Watch iostat and find out the I/O rates during the test.  Compare
>> this to the I/O rates of a known I/O benchmark (i.e., Bonnie+).
>> 4) Finally, watch the logfiles closely.  If you start to overload
>> things, you'll usually get a pretty good indication from Hadoop where
>> things go wrong.  Once something does go wrong, *then* look through
>> the parameters to see what can be done.
>> 
>> There's about a hundred things which can go wrong between the kernel,
>> the OS, Java, and the application code.  It's difficult to make an
>> educated guess beforehand without some hint from the data.
>> 
>> Brian
>> 
>> On Dec 31, 2008, at 1:30 AM, Sandeep Dhawan wrote:
>> 
>>> 
>>> Hi Brian,
>>> 
>>> That's what my issue is i.e. "How do I ascertain the bottleneck" or
>>> in other
>>> words if the results obtained after doing the performance testing
>>> are not
>>> upto the mark then How do I find the bottleneck.
>>> 
>>> How can we confidently say that OS and hardware are the culprits. I
>>> understand that by using the latest OS and hardware can improve the
>>> performance irrespective of the application but my real worry is
>>> "What Next
>>> ". How can I further increase the performance. What should I look
>>> for which
>>> can suggest or point the areas which can be potential problems or
>>> "hotspot".
>>> 
>>> Thanks for your comments.
>>> 
>>> ~Sandeep~
>>> 
>>> 
>>> Brian Bockelman wrote:
>>>> 
>>>> Hey Sandeep,
>>>> 
>>>> I would warn against premature optimization: first, run your test,
>>>> then see how far from your target you are.
>>>> 
>>>> Of course, I'd wager you'd find that the hardware you are using is
>>>> woefully underpowered and that your OS is 5 years old.
>>>> 
>>>> Brian
>>>> 
>>>> On Dec 30, 2008, at 5:57 AM, Sandeep Dhawan wrote:
>>>> 
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I am trying to create a hadoop cluster which can handle 2000 write
>>>>> requests
>>>>> per second.
>>>>> In each write request I would writing a line of size 1KB in a file.
>>>>> 
>>>>> I would be using machine having following configuration:
>>>>> Platfom: Red Hat Linux 9.0
>>>>> CPU : 2.07 GHz
>>>>> RAM : 1GB
>>>>> 
>>>>> Can anyone help in giving me some pointers/guideline as to how to go
>>>>> about
>>>>> setting up such a cluster.
>>>>> What are the configuration parameters in hadoop with which we can
>>>>> tweak to
>>>>> ehance the performance of the hadoop cluster.
>>>>> 
>>>>> Thanks,
>>>>> Sandeep
>>>>> -- 
>>>>> View this message in context:
>>>>> http://www.nabble.com/Performance-testing-tp21216266p21216266.html
>>>>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>>> 
>>>> 
>>>> 
>>> 
>>> -- 
>>> View this message in context:
>>> http://www.nabble.com/Performance-testing-tp21216266p21228264.html
>>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>> 
>> 
>>

Re: Performance testing

Posted by Sandeep Dhawan <ds...@hcl.in>.

Hi,

I am in the process of following your guidelines. 

I would like to know:

1. How can block size impact the performance of a mapred job.
2. Does the performance improve if I setup NameNode and JobTracker on
different machine. At present,
I am running Namenode and JobTracker on the same machine as Master
interconnected to 2 slave machines running Datanode and TaskTracker
3. What should be the replication factor for a 3 node cluster
4. How does io.sort.mb impact the performance of the cluster

Thanks,
Sandeep 


Brian Bockelman wrote:
> 
> Hey Sandeep,
> 
> I'd do a couple of things:
> 1) Run your test.  Do something which will be similar to your actual  
> workflow.
> 2) Save the resulting Ganglia plots.  This will give you a hint as to  
> where things are bottlenecking (memory, CPU, wait I/O).
> 3) Watch iostat and find out the I/O rates during the test.  Compare  
> this to the I/O rates of a known I/O benchmark (i.e., Bonnie+).
> 4) Finally, watch the logfiles closely.  If you start to overload  
> things, you'll usually get a pretty good indication from Hadoop where  
> things go wrong.  Once something does go wrong, *then* look through  
> the parameters to see what can be done.
> 
> There's about a hundred things which can go wrong between the kernel,  
> the OS, Java, and the application code.  It's difficult to make an  
> educated guess beforehand without some hint from the data.
> 
> Brian
> 
> On Dec 31, 2008, at 1:30 AM, Sandeep Dhawan wrote:
> 
>>
>> Hi Brian,
>>
>> That's what my issue is i.e. "How do I ascertain the bottleneck" or  
>> in other
>> words if the results obtained after doing the performance testing  
>> are not
>> upto the mark then How do I find the bottleneck.
>>
>> How can we confidently say that OS and hardware are the culprits. I
>> understand that by using the latest OS and hardware can improve the
>> performance irrespective of the application but my real worry is  
>> "What Next
>> ". How can I further increase the performance. What should I look  
>> for which
>> can suggest or point the areas which can be potential problems or  
>> "hotspot".
>>
>> Thanks for your comments.
>>
>> ~Sandeep~
>>
>>
>> Brian Bockelman wrote:
>>>
>>> Hey Sandeep,
>>>
>>> I would warn against premature optimization: first, run your test,
>>> then see how far from your target you are.
>>>
>>> Of course, I'd wager you'd find that the hardware you are using is
>>> woefully underpowered and that your OS is 5 years old.
>>>
>>> Brian
>>>
>>> On Dec 30, 2008, at 5:57 AM, Sandeep Dhawan wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> I am trying to create a hadoop cluster which can handle 2000 write
>>>> requests
>>>> per second.
>>>> In each write request I would writing a line of size 1KB in a file.
>>>>
>>>> I would be using machine having following configuration:
>>>> Platfom: Red Hat Linux 9.0
>>>> CPU : 2.07 GHz
>>>> RAM : 1GB
>>>>
>>>> Can anyone help in giving me some pointers/guideline as to how to go
>>>> about
>>>> setting up such a cluster.
>>>> What are the configuration parameters in hadoop with which we can
>>>> tweak to
>>>> ehance the performance of the hadoop cluster.
>>>>
>>>> Thanks,
>>>> Sandeep
>>>> -- 
>>>> View this message in context:
>>>> http://www.nabble.com/Performance-testing-tp21216266p21216266.html
>>>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>>
>>>
>>>
>>
>> -- 
>> View this message in context:
>> http://www.nabble.com/Performance-testing-tp21216266p21228264.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Performance-testing-tp21216266p21548160.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Performance testing

Posted by Brian Bockelman <bb...@cse.unl.edu>.

Hey Sandeep,

I'd do a couple of things:
1) Run your test.  Do something which will be similar to your actual  
workflow.
2) Save the resulting Ganglia plots.  This will give you a hint as to  
where things are bottlenecking (memory, CPU, wait I/O).
3) Watch iostat and find out the I/O rates during the test.  Compare  
this to the I/O rates of a known I/O benchmark (i.e., Bonnie+).
4) Finally, watch the logfiles closely.  If you start to overload  
things, you'll usually get a pretty good indication from Hadoop where  
things go wrong.  Once something does go wrong, *then* look through  
the parameters to see what can be done.

There's about a hundred things which can go wrong between the kernel,  
the OS, Java, and the application code.  It's difficult to make an  
educated guess beforehand without some hint from the data.

Brian

On Dec 31, 2008, at 1:30 AM, Sandeep Dhawan wrote:

>
> Hi Brian,
>
> That's what my issue is i.e. "How do I ascertain the bottleneck" or  
> in other
> words if the results obtained after doing the performance testing  
> are not
> upto the mark then How do I find the bottleneck.
>
> How can we confidently say that OS and hardware are the culprits. I
> understand that by using the latest OS and hardware can improve the
> performance irrespective of the application but my real worry is  
> "What Next
> ". How can I further increase the performance. What should I look  
> for which
> can suggest or point the areas which can be potential problems or  
> "hotspot".
>
> Thanks for your comments.
>
> ~Sandeep~
>
>
> Brian Bockelman wrote:
>>
>> Hey Sandeep,
>>
>> I would warn against premature optimization: first, run your test,
>> then see how far from your target you are.
>>
>> Of course, I'd wager you'd find that the hardware you are using is
>> woefully underpowered and that your OS is 5 years old.
>>
>> Brian
>>
>> On Dec 30, 2008, at 5:57 AM, Sandeep Dhawan wrote:
>>
>>>
>>> Hi,
>>>
>>> I am trying to create a hadoop cluster which can handle 2000 write
>>> requests
>>> per second.
>>> In each write request I would writing a line of size 1KB in a file.
>>>
>>> I would be using machine having following configuration:
>>> Platfom: Red Hat Linux 9.0
>>> CPU : 2.07 GHz
>>> RAM : 1GB
>>>
>>> Can anyone help in giving me some pointers/guideline as to how to go
>>> about
>>> setting up such a cluster.
>>> What are the configuration parameters in hadoop with which we can
>>> tweak to
>>> ehance the performance of the hadoop cluster.
>>>
>>> Thanks,
>>> Sandeep
>>> -- 
>>> View this message in context:
>>> http://www.nabble.com/Performance-testing-tp21216266p21216266.html
>>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
>>
>
> -- 
> View this message in context: http://www.nabble.com/Performance-testing-tp21216266p21228264.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Performance testing

Posted by Sandeep Dhawan <ds...@hcl.in>.

Hi Brian,

That's what my issue is i.e. "How do I ascertain the bottleneck" or in other
words if the results obtained after doing the performance testing are not
upto the mark then How do I find the bottleneck. 

How can we confidently say that OS and hardware are the culprits. I
understand that by using the latest OS and hardware can improve the
performance irrespective of the application but my real worry is "What Next
". How can I further increase the performance. What should I look for which
can suggest or point the areas which can be potential problems or "hotspot".

Thanks for your comments.

~Sandeep~

Brian Bockelman wrote:
> 
> Hey Sandeep,
> 
> I would warn against premature optimization: first, run your test,  
> then see how far from your target you are.
> 
> Of course, I'd wager you'd find that the hardware you are using is  
> woefully underpowered and that your OS is 5 years old.
> 
> Brian
> 
> On Dec 30, 2008, at 5:57 AM, Sandeep Dhawan wrote:
> 
>>
>> Hi,
>>
>> I am trying to create a hadoop cluster which can handle 2000 write  
>> requests
>> per second.
>> In each write request I would writing a line of size 1KB in a file.
>>
>> I would be using machine having following configuration:
>> Platfom: Red Hat Linux 9.0
>> CPU : 2.07 GHz
>> RAM : 1GB
>>
>> Can anyone help in giving me some pointers/guideline as to how to go  
>> about
>> setting up such a cluster.
>> What are the configuration parameters in hadoop with which we can  
>> tweak to
>> ehance the performance of the hadoop cluster.
>>
>> Thanks,
>> Sandeep
>> -- 
>> View this message in context:
>> http://www.nabble.com/Performance-testing-tp21216266p21216266.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Performance-testing-tp21216266p21228264.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Performance testing

Posted by Brian Bockelman <bb...@cse.unl.edu>.

Hey Sandeep,

I would warn against premature optimization: first, run your test,  
then see how far from your target you are.

Of course, I'd wager you'd find that the hardware you are using is  
woefully underpowered and that your OS is 5 years old.

Brian

On Dec 30, 2008, at 5:57 AM, Sandeep Dhawan wrote:

>
> Hi,
>
> I am trying to create a hadoop cluster which can handle 2000 write  
> requests
> per second.
> In each write request I would writing a line of size 1KB in a file.
>
> I would be using machine having following configuration:
> Platfom: Red Hat Linux 9.0
> CPU : 2.07 GHz
> RAM : 1GB
>
> Can anyone help in giving me some pointers/guideline as to how to go  
> about
> setting up such a cluster.
> What are the configuration parameters in hadoop with which we can  
> tweak to
> ehance the performance of the hadoop cluster.
>
> Thanks,
> Sandeep
> -- 
> View this message in context: http://www.nabble.com/Performance-testing-tp21216266p21216266.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.