You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Rohan Rai <ro...@inmobi.com> on 2010/05/14 10:23:10 UTC

HDFS Read ThroughPut and DISK Read ThroughPut

Hi

Is there a relationship between HDFS Read throught put and Disk Read
throughput.

If yes what would be that.

Lets say we have a disk giving us 120 MB/s

And a Cluster of 6 Nodes

Each Node having 6 disk.

So in an absolutely ideal world it should give us a through put
of 120*6*6 MB/s if used in parallel
In a non ideal world we can divide above by a factor of x

Then why is that the general CLUSTER read throughput is so very less.

Generally it hovers around 90MB/s.

How is the throughput which cluster provides is accounted for.

Just for information, configs are , 8 GB RAM, 250 GB HDD, 8 Maps per
node, 128 Kb Block size


Regards
Rohan




The information contained in this communication is intended solely for the use of the individual or entity to whom it is addressed and others authorized to receive it. It may contain confidential or legally privileged information. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by responding to this email and then delete it from your system. The firm is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt.

Re: HDFS Read ThroughPut and DISK Read ThroughPut

Posted by st...@yahoo.com.

As a random, but related aside, there's a nice blog entry on yahoo's dev hadoop blog page. It's all about throughput and scaling your cluster, etc. 

Take care,
-stu

Sent from my Verizon Wireless BlackBerry

-----Original Message-----
From: Rohan Rai <ro...@inmobi.com>
Date: Fri, 14 May 2010 23:23:00 
To: hdfs-user@hadoop.apache.org<hd...@hadoop.apache.org>
Subject: Re: HDFS Read ThroughPut and DISK Read ThroughPut

In addendum the cluster invokes the max of 44 Maps at a time

Regards
Rohan

Rohan Rai wrote:
> Hi Todd
>
> The Node comprises of multi disk (7 to be precise), and there are 6 data
> nodes.
>
> The measurement used is that provided by TestDFSIO which comes with
> hadoop*test.jar
>
> With the defined block size of 128 MB
> 44 files of 120 MB was written giving an throughput of 90MB/s
>
> 44 files of 100MB gave a read throughput of 30 MB/s
>
> The problem is how do I conclude and compare these numbers with raw disk
> read throughputs
>
> Regards
> Rohan
>
> Todd Lipcon wrote:
>
>> Hi Rohan,
>>
>> How are you measuring throughput? The throughput from a single client
>> will not scale up as the cluster size increases, as it does not
>> parallelize reads across multiple nodes. Of course it will also be
>> limited by the inbound bandwidth of that node.
>>
>> -Todd
>>
>> On Fri, May 14, 2010 at 1:23 AM, Rohan Rai <rohan.rai@inmobi.com
>> <ma...@inmobi.com>> wrote:
>>
>>     Hi
>>
>>     Is there a relationship between HDFS Read throught put and Disk Read
>>     throughput.
>>
>>     If yes what would be that.
>>
>>     Lets say we have a disk giving us 120 MB/s
>>
>>     And a Cluster of 6 Nodes
>>
>>     Each Node having 6 disk.
>>
>>     So in an absolutely ideal world it should give us a through put
>>     of 120*6*6 MB/s if used in parallel
>>     In a non ideal world we can divide above by a factor of x
>>
>>     Then why is that the general CLUSTER read throughput is so very less.
>>
>>     Generally it hovers around 90MB/s.
>>
>>     How is the throughput which cluster provides is accounted for.
>>
>>     Just for information, configs are , 8 GB RAM, 250 GB HDD, 8 Maps per
>>     node, 128 Kb Block size
>>
>>
>>     Regards
>>     Rohan
>>
>>
>>
>>
>>     The information contained in this communication is intended solely
>>     for the use of the individual or entity to whom it is addressed
>>     and others authorized to receive it. It may contain confidential
>>     or legally privileged information. If you are not the intended
>>     recipient you are hereby notified that any disclosure, copying,
>>     distribution or taking any action in reliance on the contents of
>>     this information is strictly prohibited and may be unlawful. If
>>     you have received this communication in error, please notify us
>>     immediately by responding to this email and then delete it from
>>     your system. The firm is neither liable for the proper and
>>     complete transmission of the information contained in this
>>     communication nor for any delay in its receipt.
>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>
> The information contained in this communication is intended solely for the use of the individual or entity to whom it is addressed and others authorized to receive it. It may contain confidential or legally privileged information. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by responding to this email and then delete it from your system. The firm is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt.
> .
>
>


The information contained in this communication is intended solely for the use of the individual or entity to whom it is addressed and others authorized to receive it. It may contain confidential or legally privileged information. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by responding to this email and then delete it from your system. The firm is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt.

Re: HDFS Read ThroughPut and DISK Read ThroughPut

Posted by Rohan Rai <ro...@inmobi.com>.

In addendum the cluster invokes the max of 44 Maps at a time

Regards
Rohan

Rohan Rai wrote:
> Hi Todd
>
> The Node comprises of multi disk (7 to be precise), and there are 6 data
> nodes.
>
> The measurement used is that provided by TestDFSIO which comes with
> hadoop*test.jar
>
> With the defined block size of 128 MB
> 44 files of 120 MB was written giving an throughput of 90MB/s
>
> 44 files of 100MB gave a read throughput of 30 MB/s
>
> The problem is how do I conclude and compare these numbers with raw disk
> read throughputs
>
> Regards
> Rohan
>
> Todd Lipcon wrote:
>
>> Hi Rohan,
>>
>> How are you measuring throughput? The throughput from a single client
>> will not scale up as the cluster size increases, as it does not
>> parallelize reads across multiple nodes. Of course it will also be
>> limited by the inbound bandwidth of that node.
>>
>> -Todd
>>
>> On Fri, May 14, 2010 at 1:23 AM, Rohan Rai <rohan.rai@inmobi.com
>> <ma...@inmobi.com>> wrote:
>>
>>     Hi
>>
>>     Is there a relationship between HDFS Read throught put and Disk Read
>>     throughput.
>>
>>     If yes what would be that.
>>
>>     Lets say we have a disk giving us 120 MB/s
>>
>>     And a Cluster of 6 Nodes
>>
>>     Each Node having 6 disk.
>>
>>     So in an absolutely ideal world it should give us a through put
>>     of 120*6*6 MB/s if used in parallel
>>     In a non ideal world we can divide above by a factor of x
>>
>>     Then why is that the general CLUSTER read throughput is so very less.
>>
>>     Generally it hovers around 90MB/s.
>>
>>     How is the throughput which cluster provides is accounted for.
>>
>>     Just for information, configs are , 8 GB RAM, 250 GB HDD, 8 Maps per
>>     node, 128 Kb Block size
>>
>>
>>     Regards
>>     Rohan
>>
>>
>>
>>
>>     The information contained in this communication is intended solely
>>     for the use of the individual or entity to whom it is addressed
>>     and others authorized to receive it. It may contain confidential
>>     or legally privileged information. If you are not the intended
>>     recipient you are hereby notified that any disclosure, copying,
>>     distribution or taking any action in reliance on the contents of
>>     this information is strictly prohibited and may be unlawful. If
>>     you have received this communication in error, please notify us
>>     immediately by responding to this email and then delete it from
>>     your system. The firm is neither liable for the proper and
>>     complete transmission of the information contained in this
>>     communication nor for any delay in its receipt.
>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>
> The information contained in this communication is intended solely for the use of the individual or entity to whom it is addressed and others authorized to receive it. It may contain confidential or legally privileged information. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by responding to this email and then delete it from your system. The firm is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt.
> .
>
>


The information contained in this communication is intended solely for the use of the individual or entity to whom it is addressed and others authorized to receive it. It may contain confidential or legally privileged information. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by responding to this email and then delete it from your system. The firm is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt.

Re: HDFS Read ThroughPut and DISK Read ThroughPut

Posted by Rohan Rai <ro...@inmobi.com>.

Hi Todd

The Node comprises of multi disk (7 to be precise), and there are 6 data
nodes.

The measurement used is that provided by TestDFSIO which comes with
hadoop*test.jar

With the defined block size of 128 MB
44 files of 120 MB was written giving an throughput of 90MB/s

44 files of 100MB gave a read throughput of 30 MB/s

The problem is how do I conclude and compare these numbers with raw disk
read throughputs

Regards
Rohan

Todd Lipcon wrote:
> Hi Rohan,
>
> How are you measuring throughput? The throughput from a single client
> will not scale up as the cluster size increases, as it does not
> parallelize reads across multiple nodes. Of course it will also be
> limited by the inbound bandwidth of that node.
>
> -Todd
>
> On Fri, May 14, 2010 at 1:23 AM, Rohan Rai <rohan.rai@inmobi.com
> <ma...@inmobi.com>> wrote:
>
>     Hi
>
>     Is there a relationship between HDFS Read throught put and Disk Read
>     throughput.
>
>     If yes what would be that.
>
>     Lets say we have a disk giving us 120 MB/s
>
>     And a Cluster of 6 Nodes
>
>     Each Node having 6 disk.
>
>     So in an absolutely ideal world it should give us a through put
>     of 120*6*6 MB/s if used in parallel
>     In a non ideal world we can divide above by a factor of x
>
>     Then why is that the general CLUSTER read throughput is so very less.
>
>     Generally it hovers around 90MB/s.
>
>     How is the throughput which cluster provides is accounted for.
>
>     Just for information, configs are , 8 GB RAM, 250 GB HDD, 8 Maps per
>     node, 128 Kb Block size
>
>
>     Regards
>     Rohan
>
>
>
>
>     The information contained in this communication is intended solely
>     for the use of the individual or entity to whom it is addressed
>     and others authorized to receive it. It may contain confidential
>     or legally privileged information. If you are not the intended
>     recipient you are hereby notified that any disclosure, copying,
>     distribution or taking any action in reliance on the contents of
>     this information is strictly prohibited and may be unlawful. If
>     you have received this communication in error, please notify us
>     immediately by responding to this email and then delete it from
>     your system. The firm is neither liable for the proper and
>     complete transmission of the information contained in this
>     communication nor for any delay in its receipt.
>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera


The information contained in this communication is intended solely for the use of the individual or entity to whom it is addressed and others authorized to receive it. It may contain confidential or legally privileged information. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by responding to this email and then delete it from your system. The firm is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt.

Re: HDFS Read ThroughPut and DISK Read ThroughPut

Posted by Todd Lipcon <to...@cloudera.com>.

Hi Rohan,

How are you measuring throughput? The throughput from a single client will
not scale up as the cluster size increases, as it does not parallelize reads
across multiple nodes. Of course it will also be limited by the inbound
bandwidth of that node.

-Todd

On Fri, May 14, 2010 at 1:23 AM, Rohan Rai <ro...@inmobi.com> wrote:

> Hi
>
> Is there a relationship between HDFS Read throught put and Disk Read
> throughput.
>
> If yes what would be that.
>
> Lets say we have a disk giving us 120 MB/s
>
> And a Cluster of 6 Nodes
>
> Each Node having 6 disk.
>
> So in an absolutely ideal world it should give us a through put
> of 120*6*6 MB/s if used in parallel
> In a non ideal world we can divide above by a factor of x
>
> Then why is that the general CLUSTER read throughput is so very less.
>
> Generally it hovers around 90MB/s.
>
> How is the throughput which cluster provides is accounted for.
>
> Just for information, configs are , 8 GB RAM, 250 GB HDD, 8 Maps per
> node, 128 Kb Block size
>
>
> Regards
> Rohan
>
>
>
>
> The information contained in this communication is intended solely for the
> use of the individual or entity to whom it is addressed and others
> authorized to receive it. It may contain confidential or legally privileged
> information. If you are not the intended recipient you are hereby notified
> that any disclosure, copying, distribution or taking any action in reliance
> on the contents of this information is strictly prohibited and may be
> unlawful. If you have received this communication in error, please notify us
> immediately by responding to this email and then delete it from your system.
> The firm is neither liable for the proper and complete transmission of the
> information contained in this communication nor for any delay in its
> receipt.
>



-- 
Todd Lipcon
Software Engineer, Cloudera