You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Dibyendu Karmakar <di...@gmail.com> on 2013/01/23 10:24:05 UTC

Understanding harpoon - help needed

Hi,
I am doing some performance testing in HADOOP. But while testing, I faced a
situation. I need your help.

My HADOOP cluster :
6 Datanodes, 1 Namenode, 2 Clients.

Replication factor = 3

2 clients write a file(put operation) whose size is 2 x block size.
DFS.DATA.DIR in each Datanodes is equal and is same as block size. That
means each Datanodes stores a single block.

Now, if 2 clients simultaneously reads the file( get operation),
Will 2 clients read 2 blocks from different Datanodes ?
Or they will read from the same datanodes?

Does Namenode know which Datanode is busy and which one is idle?

What I am trying to find is that...
Is it possible to decrease the read time by increasing replication factor?

I have attached an image to better understand my question. Kindly take a
look. Thank you. And if possible please give references.

Re: Understanding harpoon - help needed

Posted by Harsh J <ha...@cloudera.com>.
Iterating on Bharath's responses, my answers to each of your questions
inline:


On Wed, Jan 23, 2013 at 2:54 PM, Dibyendu Karmakar
<di...@gmail.com>wrote:

> Hi,
> I am doing some performance testing in HADOOP. But while testing, I faced
> a situation. I need your help.
>
> My HADOOP cluster :
> 6 Datanodes, 1 Namenode, 2 Clients.
>
> Replication factor = 3
>
> 2 clients write a file(put operation) whose size is 2 x block size.
> DFS.DATA.DIR in each Datanodes is equal and is same as block size. That
> means each Datanodes stores a single block.
>

This isn't theoretically correct (a randomity and dependence of client's
location exists here in spreading of the blocks), but for a balanced state
assumption let it be so.


> Now, if 2 clients simultaneously reads the file( get operation),
> Will 2 clients read 2 blocks from different Datanodes ?
> Or they will read from the same datanodes?
>

Depends on where the client's location is. If its among the DNs, a local
read is incurred. If elsewhere, it is possible that each may read from a
unique DN or even the same DN (randomly ordered returns from the NN). But
ideally the closest to the DN is picked, at least rack-wise, if the NN is
aware of this.

> Does Namenode know which Datanode is busy and which one is idle?
>

NN does health checks upon writes (stuff like space, load and recent
availability). At read time, the client does more of a failing-over act,
trying DNs one at a time in provided order until one accepts its request,
if they are all highly busy.


> What I am trying to find is that...
> Is it possible to decrease the read time by increasing replication factor?
>

Yes, more replicas generally mean more available DNs to serve its read, but
at the same time it impacts write speeds as there's more synchronous wait
to take care of.


> I have attached an image to better understand my question. Kindly take a
> look. Thank you. And if possible please give references.
>
"Will access be distributed [for a series of block of the same file]?" -
Yes, for remote client reads. Access order is randomized for these form of
clients, leading to possibly different patterns each time.

-- 
Harsh J

Re: Understanding harpoon - help needed

Posted by bharath vissapragada <bh...@gmail.com>.
Hi,

Link [1] partly answers your question. Namenode chooses the "nearest"
data-node that can cater this request. So replication definitely helps, in
the sense that a replica might be placed on a node nearer to the client.
I'm not sure whether the namenode checks if a datanode is busy serving
other requests, So I'll leave that for others to answer.

[1] http://hadoop.apache.org/docs/r0.20.2/hdfs_design.html#Replica+Selection

Thanks,
Bharath

On Wed, Jan 23, 2013 at 2:54 PM, Dibyendu Karmakar
<di...@gmail.com>wrote:

> Hi,
> I am doing some performance testing in HADOOP. But while testing, I faced
> a situation. I need your help.
>
> My HADOOP cluster :
> 6 Datanodes, 1 Namenode, 2 Clients.
>
> Replication factor = 3
>
> 2 clients write a file(put operation) whose size is 2 x block size.
> DFS.DATA.DIR in each Datanodes is equal and is same as block size. That
> means each Datanodes stores a single block.
>
> Now, if 2 clients simultaneously reads the file( get operation),
> Will 2 clients read 2 blocks from different Datanodes ?
> Or they will read from the same datanodes?
>
> Does Namenode know which Datanode is busy and which one is idle?
>
> What I am trying to find is that...
> Is it possible to decrease the read time by increasing replication factor?
>
> I have attached an image to better understand my question. Kindly take a
> look. Thank you. And if possible please give references.
>

Re: Understanding harpoon - help needed

Posted by Harsh J <ha...@cloudera.com>.
Iterating on Bharath's responses, my answers to each of your questions
inline:


On Wed, Jan 23, 2013 at 2:54 PM, Dibyendu Karmakar
<di...@gmail.com>wrote:

> Hi,
> I am doing some performance testing in HADOOP. But while testing, I faced
> a situation. I need your help.
>
> My HADOOP cluster :
> 6 Datanodes, 1 Namenode, 2 Clients.
>
> Replication factor = 3
>
> 2 clients write a file(put operation) whose size is 2 x block size.
> DFS.DATA.DIR in each Datanodes is equal and is same as block size. That
> means each Datanodes stores a single block.
>

This isn't theoretically correct (a randomity and dependence of client's
location exists here in spreading of the blocks), but for a balanced state
assumption let it be so.


> Now, if 2 clients simultaneously reads the file( get operation),
> Will 2 clients read 2 blocks from different Datanodes ?
> Or they will read from the same datanodes?
>

Depends on where the client's location is. If its among the DNs, a local
read is incurred. If elsewhere, it is possible that each may read from a
unique DN or even the same DN (randomly ordered returns from the NN). But
ideally the closest to the DN is picked, at least rack-wise, if the NN is
aware of this.

> Does Namenode know which Datanode is busy and which one is idle?
>

NN does health checks upon writes (stuff like space, load and recent
availability). At read time, the client does more of a failing-over act,
trying DNs one at a time in provided order until one accepts its request,
if they are all highly busy.


> What I am trying to find is that...
> Is it possible to decrease the read time by increasing replication factor?
>

Yes, more replicas generally mean more available DNs to serve its read, but
at the same time it impacts write speeds as there's more synchronous wait
to take care of.


> I have attached an image to better understand my question. Kindly take a
> look. Thank you. And if possible please give references.
>
"Will access be distributed [for a series of block of the same file]?" -
Yes, for remote client reads. Access order is randomized for these form of
clients, leading to possibly different patterns each time.

-- 
Harsh J

Re: Understanding harpoon - help needed

Posted by Harsh J <ha...@cloudera.com>.
Iterating on Bharath's responses, my answers to each of your questions
inline:


On Wed, Jan 23, 2013 at 2:54 PM, Dibyendu Karmakar
<di...@gmail.com>wrote:

> Hi,
> I am doing some performance testing in HADOOP. But while testing, I faced
> a situation. I need your help.
>
> My HADOOP cluster :
> 6 Datanodes, 1 Namenode, 2 Clients.
>
> Replication factor = 3
>
> 2 clients write a file(put operation) whose size is 2 x block size.
> DFS.DATA.DIR in each Datanodes is equal and is same as block size. That
> means each Datanodes stores a single block.
>

This isn't theoretically correct (a randomity and dependence of client's
location exists here in spreading of the blocks), but for a balanced state
assumption let it be so.


> Now, if 2 clients simultaneously reads the file( get operation),
> Will 2 clients read 2 blocks from different Datanodes ?
> Or they will read from the same datanodes?
>

Depends on where the client's location is. If its among the DNs, a local
read is incurred. If elsewhere, it is possible that each may read from a
unique DN or even the same DN (randomly ordered returns from the NN). But
ideally the closest to the DN is picked, at least rack-wise, if the NN is
aware of this.

> Does Namenode know which Datanode is busy and which one is idle?
>

NN does health checks upon writes (stuff like space, load and recent
availability). At read time, the client does more of a failing-over act,
trying DNs one at a time in provided order until one accepts its request,
if they are all highly busy.


> What I am trying to find is that...
> Is it possible to decrease the read time by increasing replication factor?
>

Yes, more replicas generally mean more available DNs to serve its read, but
at the same time it impacts write speeds as there's more synchronous wait
to take care of.


> I have attached an image to better understand my question. Kindly take a
> look. Thank you. And if possible please give references.
>
"Will access be distributed [for a series of block of the same file]?" -
Yes, for remote client reads. Access order is randomized for these form of
clients, leading to possibly different patterns each time.

-- 
Harsh J

Re: Understanding harpoon - help needed

Posted by bharath vissapragada <bh...@gmail.com>.
Hi,

Link [1] partly answers your question. Namenode chooses the "nearest"
data-node that can cater this request. So replication definitely helps, in
the sense that a replica might be placed on a node nearer to the client.
I'm not sure whether the namenode checks if a datanode is busy serving
other requests, So I'll leave that for others to answer.

[1] http://hadoop.apache.org/docs/r0.20.2/hdfs_design.html#Replica+Selection

Thanks,
Bharath

On Wed, Jan 23, 2013 at 2:54 PM, Dibyendu Karmakar
<di...@gmail.com>wrote:

> Hi,
> I am doing some performance testing in HADOOP. But while testing, I faced
> a situation. I need your help.
>
> My HADOOP cluster :
> 6 Datanodes, 1 Namenode, 2 Clients.
>
> Replication factor = 3
>
> 2 clients write a file(put operation) whose size is 2 x block size.
> DFS.DATA.DIR in each Datanodes is equal and is same as block size. That
> means each Datanodes stores a single block.
>
> Now, if 2 clients simultaneously reads the file( get operation),
> Will 2 clients read 2 blocks from different Datanodes ?
> Or they will read from the same datanodes?
>
> Does Namenode know which Datanode is busy and which one is idle?
>
> What I am trying to find is that...
> Is it possible to decrease the read time by increasing replication factor?
>
> I have attached an image to better understand my question. Kindly take a
> look. Thank you. And if possible please give references.
>

Re: Understanding harpoon - help needed

Posted by Harsh J <ha...@cloudera.com>.
Iterating on Bharath's responses, my answers to each of your questions
inline:


On Wed, Jan 23, 2013 at 2:54 PM, Dibyendu Karmakar
<di...@gmail.com>wrote:

> Hi,
> I am doing some performance testing in HADOOP. But while testing, I faced
> a situation. I need your help.
>
> My HADOOP cluster :
> 6 Datanodes, 1 Namenode, 2 Clients.
>
> Replication factor = 3
>
> 2 clients write a file(put operation) whose size is 2 x block size.
> DFS.DATA.DIR in each Datanodes is equal and is same as block size. That
> means each Datanodes stores a single block.
>

This isn't theoretically correct (a randomity and dependence of client's
location exists here in spreading of the blocks), but for a balanced state
assumption let it be so.


> Now, if 2 clients simultaneously reads the file( get operation),
> Will 2 clients read 2 blocks from different Datanodes ?
> Or they will read from the same datanodes?
>

Depends on where the client's location is. If its among the DNs, a local
read is incurred. If elsewhere, it is possible that each may read from a
unique DN or even the same DN (randomly ordered returns from the NN). But
ideally the closest to the DN is picked, at least rack-wise, if the NN is
aware of this.

> Does Namenode know which Datanode is busy and which one is idle?
>

NN does health checks upon writes (stuff like space, load and recent
availability). At read time, the client does more of a failing-over act,
trying DNs one at a time in provided order until one accepts its request,
if they are all highly busy.


> What I am trying to find is that...
> Is it possible to decrease the read time by increasing replication factor?
>

Yes, more replicas generally mean more available DNs to serve its read, but
at the same time it impacts write speeds as there's more synchronous wait
to take care of.


> I have attached an image to better understand my question. Kindly take a
> look. Thank you. And if possible please give references.
>
"Will access be distributed [for a series of block of the same file]?" -
Yes, for remote client reads. Access order is randomized for these form of
clients, leading to possibly different patterns each time.

-- 
Harsh J

Re: Understanding harpoon - help needed

Posted by bharath vissapragada <bh...@gmail.com>.
Hi,

Link [1] partly answers your question. Namenode chooses the "nearest"
data-node that can cater this request. So replication definitely helps, in
the sense that a replica might be placed on a node nearer to the client.
I'm not sure whether the namenode checks if a datanode is busy serving
other requests, So I'll leave that for others to answer.

[1] http://hadoop.apache.org/docs/r0.20.2/hdfs_design.html#Replica+Selection

Thanks,
Bharath

On Wed, Jan 23, 2013 at 2:54 PM, Dibyendu Karmakar
<di...@gmail.com>wrote:

> Hi,
> I am doing some performance testing in HADOOP. But while testing, I faced
> a situation. I need your help.
>
> My HADOOP cluster :
> 6 Datanodes, 1 Namenode, 2 Clients.
>
> Replication factor = 3
>
> 2 clients write a file(put operation) whose size is 2 x block size.
> DFS.DATA.DIR in each Datanodes is equal and is same as block size. That
> means each Datanodes stores a single block.
>
> Now, if 2 clients simultaneously reads the file( get operation),
> Will 2 clients read 2 blocks from different Datanodes ?
> Or they will read from the same datanodes?
>
> Does Namenode know which Datanode is busy and which one is idle?
>
> What I am trying to find is that...
> Is it possible to decrease the read time by increasing replication factor?
>
> I have attached an image to better understand my question. Kindly take a
> look. Thank you. And if possible please give references.
>

Re: Understanding harpoon - help needed

Posted by bharath vissapragada <bh...@gmail.com>.
Hi,

Link [1] partly answers your question. Namenode chooses the "nearest"
data-node that can cater this request. So replication definitely helps, in
the sense that a replica might be placed on a node nearer to the client.
I'm not sure whether the namenode checks if a datanode is busy serving
other requests, So I'll leave that for others to answer.

[1] http://hadoop.apache.org/docs/r0.20.2/hdfs_design.html#Replica+Selection

Thanks,
Bharath

On Wed, Jan 23, 2013 at 2:54 PM, Dibyendu Karmakar
<di...@gmail.com>wrote:

> Hi,
> I am doing some performance testing in HADOOP. But while testing, I faced
> a situation. I need your help.
>
> My HADOOP cluster :
> 6 Datanodes, 1 Namenode, 2 Clients.
>
> Replication factor = 3
>
> 2 clients write a file(put operation) whose size is 2 x block size.
> DFS.DATA.DIR in each Datanodes is equal and is same as block size. That
> means each Datanodes stores a single block.
>
> Now, if 2 clients simultaneously reads the file( get operation),
> Will 2 clients read 2 blocks from different Datanodes ?
> Or they will read from the same datanodes?
>
> Does Namenode know which Datanode is busy and which one is idle?
>
> What I am trying to find is that...
> Is it possible to decrease the read time by increasing replication factor?
>
> I have attached an image to better understand my question. Kindly take a
> look. Thank you. And if possible please give references.
>