You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Jeff Eastman <je...@collab.net> on 2007/12/20 16:15:25 UTC

DFS Block Allocation

I've brought up a small cluster and uploaded some large files. The
master node is cu027 and it seems to be getting an unfair percentage of
the blocks allocated to it, especially compared to cu171 which has the
same size disk. Can somebody shed some light on the reasons for this?

 

Jeff

 

 

Node 

Last Contact 

Admin State 

Size (GB) 

Used (%) 

Blocks 

cu009
<http://cu009.cubit.sp.collab.net:50075/browseDirectory.jsp?namenodeInfo
Port=50070&dir=%2F> 

0

In Service

69.22

75.07

81 

cu027
<http://cu027.cubit.sp.collab.net:50075/browseDirectory.jsp?namenodeInfo
Port=50070&dir=%2F> 

1

In Service

291.24

59.74

333 

cu028
<http://cu028.cubit.sp.collab.net:50075/browseDirectory.jsp?namenodeInfo
Port=50070&dir=%2F> 

1

In Service

71.19

35.34

76 

cu034
<http://cu034.cubit.sp.collab.net:50075/browseDirectory.jsp?namenodeInfo
Port=50070&dir=%2F> 

1

In Service

71.19

17.79

81 

cu035
<http://cu035.cubit.sp.collab.net:50075/browseDirectory.jsp?namenodeInfo
Port=50070&dir=%2F> 

0

In Service

71.19

18.49

64 

cu045
<http://cu045.cubit.sp.collab.net:50075/browseDirectory.jsp?namenodeInfo
Port=50070&dir=%2F> 

1

In Service

71.19

16.68

70 

cu050
<http://cu050.cubit.sp.collab.net:50075/browseDirectory.jsp?namenodeInfo
Port=50070&dir=%2F> 

2

In Service

71.19

31.99

69 

cu062
<http://cu062.cubit.sp.collab.net:50075/browseDirectory.jsp?namenodeInfo
Port=50070&dir=%2F> 

2

In Service

71.19

39.64

72 

cu063
<http://cu063.cubit.sp.collab.net:50075/browseDirectory.jsp?namenodeInfo
Port=50070&dir=%2F> 

2

In Service

71.19

25.03

77 

cu171
<http://cu171.cubit.sp.collab.net:50075/browseDirectory.jsp?namenodeInfo
Port=50070&dir=%2F> 

2

In Service

291.24

13.05

79

Re: DFS Block Allocation

Posted by C G <pa...@yahoo.com>.

I am indeed a happy man...our data acq. systems can see and interact with the compute grid proper so configuring nodes outside the grid to speak with HDFS should be reasonably straightforward.

  C G

Ted Dunning <td...@veoh.com> wrote:

On 12/20/07 5:52 PM, "C G" 
wrote:

> Ted, when you say "copy in the distro" do you need to include the
> configuration files from the running grid? You don't need to actually start
> HDFS on this node do you?

You are correct. You only need the config files (and the hadoop script
helps make things easier).

> If I'm following this approach correctly, I would want to have an "xfer
> server" whose job it is to essentially run dfs -copyFromLocal on all
> inbound-to-HDFS data. Once I'm certain that my data has copied correctly, I
> can delete the local files on the xfer server.

Yes.

> This is great news, as my current system wastes a lot of time copying data
> from data acquisition servers to the master node. If I can copy to HDFS
> directly from ny acquisition servers then I am a happy guy....

You are a happy guy.

If your acquisition systems can see all of your datanodes. 

---------------------------------
Never miss a thing.   Make Yahoo your homepage.

Re: DFS Block Allocation

Posted by Ted Dunning <td...@veoh.com>.



On 12/20/07 5:52 PM, "C G" <pa...@yahoo.com> wrote:

>   Ted, when you say "copy in the distro" do you need to include the
> configuration files from the running grid?  You don't need to actually start
> HDFS on this node do you?

You are correct.  You only need the config files (and the hadoop script
helps make things easier).

>   If I'm following this approach correctly, I would want to have an "xfer
> server" whose job it is to essentially run dfs -copyFromLocal on all
> inbound-to-HDFS data. Once I'm certain that my data has copied correctly, I
> can delete the local files on the xfer server.

Yes.

>   This is great news, as my current system wastes a lot of time copying data
> from data acquisition servers to the master node. If I can copy to HDFS
> directly from ny acquisition servers then I am a happy guy....

You are a happy guy.

If your acquisition systems can see all of your datanodes.

Re: DFS Block Allocation

Posted by C G <pa...@yahoo.com>.

Hmmm....this thread is very interesting - I didn't know most of the stuff mentioned here.

  Ted, when you say "copy in the distro" do you need to include the configuration files from the running grid?  You don't need to actually start HDFS on this node do you?

  If I'm following this approach correctly, I would want to have an "xfer server" whose job it is to essentially run dfs -copyFromLocal on all inbound-to-HDFS data. Once I'm certain that my data has copied correctly, I can delete the local files on the xfer server.

  This is great news, as my current system wastes a lot of time copying data from data acquisition servers to the master node. If I can copy to HDFS directly from ny acquisition servers then I am a happy guy....

  Thanks,
  C G

Ted Dunning <td...@veoh.com> wrote:

Just copy the hadoop distro directory to the other machine and use whatever
command you were using before.

A program that uses hadoop just have to have access to all of the nodes
across the net. It doesn't assume anything else.

On 12/20/07 2:35 PM, "Jeff Eastman" wrote:

> .... Can you give me a pointer on how to accomplish this (upload from other
> machine)? 

---------------------------------
Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  Try it now.

Re: DFS Block Allocation

Posted by Ted Dunning <td...@veoh.com>.

Just copy the hadoop distro directory to the other machine and use whatever
command you were using before.

A program that uses hadoop just have to have access to all of the nodes
across the net.  It doesn't assume anything else.

On 12/20/07 2:35 PM, "Jeff Eastman" <je...@collab.net> wrote:

> .... Can you give me a pointer on how to accomplish this (upload from other
> machine)?

RE: DFS Block Allocation

Posted by Jeff Eastman <je...@collab.net>.

Ted,

I'm still learning, obviously. I was not aware one could upload from any
machine other than the master (which did seem overly restrictive), and
uploading from one outside the cloud would be even better. Can you give
me a pointer on how to accomplish this? Is there a relevant FAQ or
documents I have missed?

My experience with balancing is similar to yours; the upload is uniform,
independent of disk size or availability. I will try rebalancing.

Thanks,
Jeff

-----Original Message-----
From: Ted Dunning [mailto:tdunning@veoh.com] 
Sent: Thursday, December 20, 2007 12:02 PM
To: hadoop-user@lucene.apache.org
Subject: Re: DFS Block Allocation

Yes.

I try to always upload data from a machine that is not part of the
cluster
for exactly that reason.

I still find that I need to rebalance due to a strange problem in
placement.
My datanodes have 10x different sized HDFS disks and I suspect that the
upload is picking datanodes uniformly rather than according to available
space.

Oddly enough, my rebalancing code works well.  All it does is iterate
through all files of interest, increasing the replication count for 30
seconds and then decreasing it again (obviously this has to thread to
manipulate more than 2 files per minute).  The replication code seems to
select a home for new blocks more correctly than the original placement.

On 12/20/07 10:16 AM, "Jeff Eastman" <je...@collab.net> wrote:

> Noting your use of the word "attempts", can I conclude that at some
point it
> might be impossible to upload blocks from a local file to the DFS on
the same
> node and at that point the blocks would all be loaded elsewhere?

Re: DFS Block Allocation

Posted by Ted Dunning <td...@veoh.com>.

Yes.

I try to always upload data from a machine that is not part of the cluster
for exactly that reason.

I still find that I need to rebalance due to a strange problem in placement.
My datanodes have 10x different sized HDFS disks and I suspect that the
upload is picking datanodes uniformly rather than according to available
space.

Oddly enough, my rebalancing code works well.  All it does is iterate
through all files of interest, increasing the replication count for 30
seconds and then decreasing it again (obviously this has to thread to
manipulate more than 2 files per minute).  The replication code seems to
select a home for new blocks more correctly than the original placement.

On 12/20/07 10:16 AM, "Jeff Eastman" <je...@collab.net> wrote:

> Noting your use of the word "attempts", can I conclude that at some point it
> might be impossible to upload blocks from a local file to the DFS on the same
> node and at that point the blocks would all be loaded elsewhere?

RE: DFS Block Allocation

Posted by Jeff Eastman <je...@collab.net>.

Thanks Dhruba, 

That makes sense. The data was already on the master node and I did not
consider that I could upload from other nodes too. The distribution on
the slave nodes is uniform and your response explains why the one other
bigger box did not get a larger number of blocks. Noting your use of the
word "attempts", can I conclude that at some point it might be
impossible to upload blocks from a local file to the DFS on the same
node and at that point the blocks would all be loaded elsewhere?

Jeff

-----Original Message-----
From: dhruba Borthakur [mailto:dhruba@yahoo-inc.com] 
Sent: Thursday, December 20, 2007 9:38 AM
To: hadoop-user@lucene.apache.org
Subject: RE: DFS Block Allocation

Hi Jeff,

Did you run the file-upload command on the master node itself? The DFS
client attempts to store one replica of the data on the node on which
the DFSClient is running.

To get a uniform distribution, it would be good if you upload your data
from multiple nodes in your cluster. 

Thanks,
dhruba


-----Original Message-----
From: Jeff Eastman [mailto:jeastman@collab.net] 
Sent: Thursday, December 20, 2007 7:15 AM
To: hadoop-user@lucene.apache.org
Subject: DFS Block Allocation

I've brought up a small cluster and uploaded some large files. The
master node is cu027 and it seems to be getting an unfair percentage of
the blocks allocated to it, especially compared to cu171 which has the
same size disk. Can somebody shed some light on the reasons for this?

 

Jeff

RE: DFS Block Allocation

Posted by dhruba Borthakur <dh...@yahoo-inc.com>.

Hi Jeff,

Did you run the file-upload command on the master node itself? The DFS
client attempts to store one replica of the data on the node on which
the DFSClient is running.

To get a uniform distribution, it would be good if you upload your data
from multiple nodes in your cluster. 

Thanks,
dhruba


-----Original Message-----
From: Jeff Eastman [mailto:jeastman@collab.net] 
Sent: Thursday, December 20, 2007 7:15 AM
To: hadoop-user@lucene.apache.org
Subject: DFS Block Allocation

I've brought up a small cluster and uploaded some large files. The
master node is cu027 and it seems to be getting an unfair percentage of
the blocks allocated to it, especially compared to cu171 which has the
same size disk. Can somebody shed some light on the reasons for this?

 

Jeff

 

 

Node 

Last Contact 

Admin State 

Size (GB) 

Used (%) 

Blocks 

cu009
<http://cu009.cubit.sp.collab.net:50075/browseDirectory.jsp?namenodeInfo
Port=50070&dir=%2F> 

0

In Service

69.22

75.07

81 

cu027
<http://cu027.cubit.sp.collab.net:50075/browseDirectory.jsp?namenodeInfo
Port=50070&dir=%2F> 

1

In Service

291.24

59.74

333 

cu028
<http://cu028.cubit.sp.collab.net:50075/browseDirectory.jsp?namenodeInfo
Port=50070&dir=%2F> 

1

In Service

71.19

35.34

76 

cu034
<http://cu034.cubit.sp.collab.net:50075/browseDirectory.jsp?namenodeInfo
Port=50070&dir=%2F> 

1

In Service

71.19

17.79

81 

cu035
<http://cu035.cubit.sp.collab.net:50075/browseDirectory.jsp?namenodeInfo
Port=50070&dir=%2F> 

0

In Service

71.19

18.49

64 

cu045
<http://cu045.cubit.sp.collab.net:50075/browseDirectory.jsp?namenodeInfo
Port=50070&dir=%2F> 

1

In Service

71.19

16.68

70 

cu050
<http://cu050.cubit.sp.collab.net:50075/browseDirectory.jsp?namenodeInfo
Port=50070&dir=%2F> 

2

In Service

71.19

31.99

69 

cu062
<http://cu062.cubit.sp.collab.net:50075/browseDirectory.jsp?namenodeInfo
Port=50070&dir=%2F> 

2

In Service

71.19

39.64

72 

cu063
<http://cu063.cubit.sp.collab.net:50075/browseDirectory.jsp?namenodeInfo
Port=50070&dir=%2F> 

2

In Service

71.19

25.03

77 

cu171
<http://cu171.cubit.sp.collab.net:50075/browseDirectory.jsp?namenodeInfo
Port=50070&dir=%2F> 

2

In Service

291.24

13.05

79

RE: DFS Block Allocation

Posted by Jeff Eastman <je...@collab.net>.

Oooh, too much information. I won't to that again<grin>. Here's the
table in plain text.


Node 	Last Contact	State 	Size (GB) 	Used (%) 	Blocks
cu009	0			In Service	69.22		75.07
81
cu027	1			In Service	291.24	59.74
333
cu028	1			In Service	71.19		35.34
76
cu034	1			In Service	71.19		17.79
81
cu035	0			In Service	71.19		18.49
64
cu045	1			In Service	71.19		16.68
70
cu050	2			In Service	71.19		31.99
69
cu062	2			In Service	71.19		39.64
72
cu063	2			In Service	71.19		25.03
77
cu171	2			In Service	291.24	13.05
79

Jeff