You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Ognen Duzlevski <og...@nengoiksvelzud.com> on 2014/01/28 20:22:56 UTC

Configuring hadoop 2.2.0

Hello,

I have set up an HDFS cluster by running a name node and a bunch of data
nodes. I ran into a problem where the files are only stored on the node
that uses the hdfs command and was told that this is because I do not have
a job tracker and task nodes set up.

However, the documentation for 2.2.0 does not mention any of these (at
least not this page:
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html).
I browsed some of the earlier docs and they do mention job tracker nodes
etc.

So, for 2.2.0 - what is the way to set this up? Do I need a separate
machine to be the "job tracker"? Did this job tracker node change its name
to something else in the current docs?

Thanks,
Ognen

Re: Configuring hadoop 2.2.0

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

By the way, I discovered the start-balancer.sh script that comes with HDFS
- after running it with -threshold 5, I get the following output in the
logs:

2014-01-29 14:04:16,503 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: 1 over-utilized: [Source[
10.10.0.200:50010, utilization=76.45474474120932]]
2014-01-29 14:04:16,503 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: 0 underutilized: []
2014-01-29 14:04:16,503 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Need to move 936.81 GB to
make the cluster balanced.
2014-01-29 14:04:16,503 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Decided to move 10 GB
bytes from 10.10.0.200:50010 to 10.10.0.203:50010
2014-01-29 14:04:16,503 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 10 GB in this
iteration

Maybe this sheds more light on what I am talking about? In any case, why do
I need to run the balancer manually? Or do I?
Ognen


On Wed, Jan 29, 2014 at 8:05 AM, Ognen Duzlevski
<og...@nengoiksvelzud.com>wrote:

> Hello (and thanks for replying!) :)
>
> On Wed, Jan 29, 2014 at 7:38 AM, java8964 <ja...@hotmail.com> wrote:
>
>> Hi, Ognen:
>>
>> I noticed you were asking this question before under a different subject
>> line. I think you need to tell us where you mean unbalance space, is it on
>> HDFS or the local disk.
>>
>> 1) The HDFS is independent as MR. They are not related to each other.
>>
>
> OK good to know.
>
>
>> 2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means all
>> HDFS command, API will just work.
>>
>
> Good to know. Does this also mean that when I put or distcp file to
> hdfs://namenode:54310/path/file - it will "decide" how to split the file
> across all the datanodes so as the nodes are utilized equally in terms of
> space?
>
>
>> 3) But when you tried to copy file into HDFS using distcp, you need MR
>> component (Doesn't matter it is MR1 or MR2), as distcp indeed uses
>> MapReduce to do the massively parallel copying files.
>>
>
> Understood.
>
>
>> 4) Your original problem is that when you run the distcp command, you
>> didn't start the MR component in your cluster, so distcp in fact copy your
>> files to the LOCAL file system, based on some one else's reply to your
>> original question. I didn't test this myself before, but I kind of believe
>> that.
>>
>
> Sure. But even if distcp is running in one thread, its destination is
> hdfs://namenode:54310/path/file - should this not ensure equal "split" of
> files across the whole HDFS cluster? Or am I delusional? :)
>
>
>> 5) If the above is true, then you should see under node your were running
>> distcp command there should be having these files in the local file system,
>> in the path you specified. You should check and verify that.
>>
>
> OK - so the command is this:
>
> hadoop --config /etc/hadoop distcp s3n://<credentials>@bucket/file hdfs://
> 10.10.0.198:54310/test/file where 10.10.0.198 is the HDFS Name node. I am
> running this on 10.10.0.200 which is one of the Data nodes and I am making
> no mention of the local data node storage in this command. My expectation
> is that the files obtained this way from S3 will end up distributed
> somewhat evenly across all of the 16 Data nodes in this HDSF cluster. Am I
> wrong to expect this?
>
> 6) After you start yarn/resource manager, you see the unbalance after you
>> distcp files again. Where is this unbalance? In the HDFS or local file
>> system. List the commands  and outputs here, so we can understand your
>> problem more clearly, instead of misleading sometimes by your words.
>>
>
> The imbalance is as follows: the machine I run the distcp command on (one
> of the Data nodes) ends up with 70+% of the space it is contributing to the
> HDFS cluster occupied with these files while the rest of the data nodes in
> the cluster only get 10% of their contributed space occupied. Since HDFS is
> a distributed, parallel file system I would expect that the file space
> occupied would be spread evenly or somewhat evenly across all the data
> nodes.
>
> Thanks!
> Ognen
>

Re: Configuring hadoop 2.2.0

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

Also, does anyone know how I can "force" the rebalancer to move more data
in one run? At the current settings, it will take about a week to rebalance
the nodes ;)

Ognen


On Wed, Jan 29, 2014 at 8:12 AM, Ognen Duzlevski
<og...@nengoiksvelzud.com>wrote:

> Ahh, OK :)
>
> However, this seems kind of silly - it may be stored in the datanode but I
> find the need to "force" the balancing manually somewhat strange. I mean
> why use hdfs://namenode:port/path/file if the copies end up being stored
> locally anyway? ;)
>
> Ognen
>
>
> On Wed, Jan 29, 2014 at 8:10 AM, Selçuk Şenkul <ss...@gmail.com> wrote:
>
>> Try to run the command from the namenode, or another node which is not a
>> datanode, the files should distribute. As far as I know, if you copy a file
>> to hdfs from a datanode, the first copy is stored in that datanode.
>>
>> On Wed, Jan 29, 2014 at 4:05 PM, Ognen Duzlevski <
>> ognen@nengoiksvelzud.com> wrote:
>>
>>> Hello (and thanks for replying!) :)
>>>
>>> On Wed, Jan 29, 2014 at 7:38 AM, java8964 <ja...@hotmail.com> wrote:
>>>
>>>> Hi, Ognen:
>>>>
>>>> I noticed you were asking this question before under a different
>>>> subject line. I think you need to tell us where you mean unbalance space,
>>>> is it on HDFS or the local disk.
>>>>
>>>> 1) The HDFS is independent as MR. They are not related to each other.
>>>>
>>>
>>> OK good to know.
>>>
>>>
>>>> 2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means
>>>> all HDFS command, API will just work.
>>>>
>>>
>>> Good to know. Does this also mean that when I put or distcp file to
>>> hdfs://namenode:54310/path/file - it will "decide" how to split the file
>>> across all the datanodes so as the nodes are utilized equally in terms of
>>> space?
>>>
>>>
>>>> 3) But when you tried to copy file into HDFS using distcp, you need MR
>>>> component (Doesn't matter it is MR1 or MR2), as distcp indeed uses
>>>> MapReduce to do the massively parallel copying files.
>>>>
>>>
>>> Understood.
>>>
>>>
>>>> 4) Your original problem is that when you run the distcp command, you
>>>> didn't start the MR component in your cluster, so distcp in fact copy your
>>>> files to the LOCAL file system, based on some one else's reply to your
>>>> original question. I didn't test this myself before, but I kind of believe
>>>> that.
>>>>
>>>
>>> Sure. But even if distcp is running in one thread, its destination is
>>> hdfs://namenode:54310/path/file - should this not ensure equal "split" of
>>> files across the whole HDFS cluster? Or am I delusional? :)
>>>
>>>
>>>> 5) If the above is true, then you should see under node your were
>>>> running distcp command there should be having these files in the local file
>>>> system, in the path you specified. You should check and verify that.
>>>>
>>>
>>> OK - so the command is this:
>>>
>>> hadoop --config /etc/hadoop distcp s3n://<credentials>@bucket/file
>>> hdfs://10.10.0.198:54310/test/file where 10.10.0.198 is the HDFS Name
>>> node. I am running this on 10.10.0.200 which is one of the Data nodes and I
>>> am making no mention of the local data node storage in this command. My
>>> expectation is that the files obtained this way from S3 will end up
>>> distributed somewhat evenly across all of the 16 Data nodes in this HDSF
>>> cluster. Am I wrong to expect this?
>>>
>>> 6) After you start yarn/resource manager, you see the unbalance after
>>>> you distcp files again. Where is this unbalance? In the HDFS or local file
>>>> system. List the commands  and outputs here, so we can understand your
>>>> problem more clearly, instead of misleading sometimes by your words.
>>>>
>>>
>>> The imbalance is as follows: the machine I run the distcp command on
>>> (one of the Data nodes) ends up with 70+% of the space it is contributing
>>> to the HDFS cluster occupied with these files while the rest of the data
>>> nodes in the cluster only get 10% of their contributed space occupied.
>>> Since HDFS is a distributed, parallel file system I would expect that the
>>> file space occupied would be spread evenly or somewhat evenly across all
>>> the data nodes.
>>>
>>> Thanks!
>>> Ognen
>>>
>>
>>
>

Re: Configuring hadoop 2.2.0

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

Also, does anyone know how I can "force" the rebalancer to move more data
in one run? At the current settings, it will take about a week to rebalance
the nodes ;)

Ognen


On Wed, Jan 29, 2014 at 8:12 AM, Ognen Duzlevski
<og...@nengoiksvelzud.com>wrote:

> Ahh, OK :)
>
> However, this seems kind of silly - it may be stored in the datanode but I
> find the need to "force" the balancing manually somewhat strange. I mean
> why use hdfs://namenode:port/path/file if the copies end up being stored
> locally anyway? ;)
>
> Ognen
>
>
> On Wed, Jan 29, 2014 at 8:10 AM, Selçuk Şenkul <ss...@gmail.com> wrote:
>
>> Try to run the command from the namenode, or another node which is not a
>> datanode, the files should distribute. As far as I know, if you copy a file
>> to hdfs from a datanode, the first copy is stored in that datanode.
>>
>> On Wed, Jan 29, 2014 at 4:05 PM, Ognen Duzlevski <
>> ognen@nengoiksvelzud.com> wrote:
>>
>>> Hello (and thanks for replying!) :)
>>>
>>> On Wed, Jan 29, 2014 at 7:38 AM, java8964 <ja...@hotmail.com> wrote:
>>>
>>>> Hi, Ognen:
>>>>
>>>> I noticed you were asking this question before under a different
>>>> subject line. I think you need to tell us where you mean unbalance space,
>>>> is it on HDFS or the local disk.
>>>>
>>>> 1) The HDFS is independent as MR. They are not related to each other.
>>>>
>>>
>>> OK good to know.
>>>
>>>
>>>> 2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means
>>>> all HDFS command, API will just work.
>>>>
>>>
>>> Good to know. Does this also mean that when I put or distcp file to
>>> hdfs://namenode:54310/path/file - it will "decide" how to split the file
>>> across all the datanodes so as the nodes are utilized equally in terms of
>>> space?
>>>
>>>
>>>> 3) But when you tried to copy file into HDFS using distcp, you need MR
>>>> component (Doesn't matter it is MR1 or MR2), as distcp indeed uses
>>>> MapReduce to do the massively parallel copying files.
>>>>
>>>
>>> Understood.
>>>
>>>
>>>> 4) Your original problem is that when you run the distcp command, you
>>>> didn't start the MR component in your cluster, so distcp in fact copy your
>>>> files to the LOCAL file system, based on some one else's reply to your
>>>> original question. I didn't test this myself before, but I kind of believe
>>>> that.
>>>>
>>>
>>> Sure. But even if distcp is running in one thread, its destination is
>>> hdfs://namenode:54310/path/file - should this not ensure equal "split" of
>>> files across the whole HDFS cluster? Or am I delusional? :)
>>>
>>>
>>>> 5) If the above is true, then you should see under node your were
>>>> running distcp command there should be having these files in the local file
>>>> system, in the path you specified. You should check and verify that.
>>>>
>>>
>>> OK - so the command is this:
>>>
>>> hadoop --config /etc/hadoop distcp s3n://<credentials>@bucket/file
>>> hdfs://10.10.0.198:54310/test/file where 10.10.0.198 is the HDFS Name
>>> node. I am running this on 10.10.0.200 which is one of the Data nodes and I
>>> am making no mention of the local data node storage in this command. My
>>> expectation is that the files obtained this way from S3 will end up
>>> distributed somewhat evenly across all of the 16 Data nodes in this HDSF
>>> cluster. Am I wrong to expect this?
>>>
>>> 6) After you start yarn/resource manager, you see the unbalance after
>>>> you distcp files again. Where is this unbalance? In the HDFS or local file
>>>> system. List the commands  and outputs here, so we can understand your
>>>> problem more clearly, instead of misleading sometimes by your words.
>>>>
>>>
>>> The imbalance is as follows: the machine I run the distcp command on
>>> (one of the Data nodes) ends up with 70+% of the space it is contributing
>>> to the HDFS cluster occupied with these files while the rest of the data
>>> nodes in the cluster only get 10% of their contributed space occupied.
>>> Since HDFS is a distributed, parallel file system I would expect that the
>>> file space occupied would be spread evenly or somewhat evenly across all
>>> the data nodes.
>>>
>>> Thanks!
>>> Ognen
>>>
>>
>>
>

Re: Configuring hadoop 2.2.0

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

Also, does anyone know how I can "force" the rebalancer to move more data
in one run? At the current settings, it will take about a week to rebalance
the nodes ;)

Ognen


On Wed, Jan 29, 2014 at 8:12 AM, Ognen Duzlevski
<og...@nengoiksvelzud.com>wrote:

> Ahh, OK :)
>
> However, this seems kind of silly - it may be stored in the datanode but I
> find the need to "force" the balancing manually somewhat strange. I mean
> why use hdfs://namenode:port/path/file if the copies end up being stored
> locally anyway? ;)
>
> Ognen
>
>
> On Wed, Jan 29, 2014 at 8:10 AM, Selçuk Şenkul <ss...@gmail.com> wrote:
>
>> Try to run the command from the namenode, or another node which is not a
>> datanode, the files should distribute. As far as I know, if you copy a file
>> to hdfs from a datanode, the first copy is stored in that datanode.
>>
>> On Wed, Jan 29, 2014 at 4:05 PM, Ognen Duzlevski <
>> ognen@nengoiksvelzud.com> wrote:
>>
>>> Hello (and thanks for replying!) :)
>>>
>>> On Wed, Jan 29, 2014 at 7:38 AM, java8964 <ja...@hotmail.com> wrote:
>>>
>>>> Hi, Ognen:
>>>>
>>>> I noticed you were asking this question before under a different
>>>> subject line. I think you need to tell us where you mean unbalance space,
>>>> is it on HDFS or the local disk.
>>>>
>>>> 1) The HDFS is independent as MR. They are not related to each other.
>>>>
>>>
>>> OK good to know.
>>>
>>>
>>>> 2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means
>>>> all HDFS command, API will just work.
>>>>
>>>
>>> Good to know. Does this also mean that when I put or distcp file to
>>> hdfs://namenode:54310/path/file - it will "decide" how to split the file
>>> across all the datanodes so as the nodes are utilized equally in terms of
>>> space?
>>>
>>>
>>>> 3) But when you tried to copy file into HDFS using distcp, you need MR
>>>> component (Doesn't matter it is MR1 or MR2), as distcp indeed uses
>>>> MapReduce to do the massively parallel copying files.
>>>>
>>>
>>> Understood.
>>>
>>>
>>>> 4) Your original problem is that when you run the distcp command, you
>>>> didn't start the MR component in your cluster, so distcp in fact copy your
>>>> files to the LOCAL file system, based on some one else's reply to your
>>>> original question. I didn't test this myself before, but I kind of believe
>>>> that.
>>>>
>>>
>>> Sure. But even if distcp is running in one thread, its destination is
>>> hdfs://namenode:54310/path/file - should this not ensure equal "split" of
>>> files across the whole HDFS cluster? Or am I delusional? :)
>>>
>>>
>>>> 5) If the above is true, then you should see under node your were
>>>> running distcp command there should be having these files in the local file
>>>> system, in the path you specified. You should check and verify that.
>>>>
>>>
>>> OK - so the command is this:
>>>
>>> hadoop --config /etc/hadoop distcp s3n://<credentials>@bucket/file
>>> hdfs://10.10.0.198:54310/test/file where 10.10.0.198 is the HDFS Name
>>> node. I am running this on 10.10.0.200 which is one of the Data nodes and I
>>> am making no mention of the local data node storage in this command. My
>>> expectation is that the files obtained this way from S3 will end up
>>> distributed somewhat evenly across all of the 16 Data nodes in this HDSF
>>> cluster. Am I wrong to expect this?
>>>
>>> 6) After you start yarn/resource manager, you see the unbalance after
>>>> you distcp files again. Where is this unbalance? In the HDFS or local file
>>>> system. List the commands  and outputs here, so we can understand your
>>>> problem more clearly, instead of misleading sometimes by your words.
>>>>
>>>
>>> The imbalance is as follows: the machine I run the distcp command on
>>> (one of the Data nodes) ends up with 70+% of the space it is contributing
>>> to the HDFS cluster occupied with these files while the rest of the data
>>> nodes in the cluster only get 10% of their contributed space occupied.
>>> Since HDFS is a distributed, parallel file system I would expect that the
>>> file space occupied would be spread evenly or somewhat evenly across all
>>> the data nodes.
>>>
>>> Thanks!
>>> Ognen
>>>
>>
>>
>

Re: Configuring hadoop 2.2.0

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

Also, does anyone know how I can "force" the rebalancer to move more data
in one run? At the current settings, it will take about a week to rebalance
the nodes ;)

Ognen


On Wed, Jan 29, 2014 at 8:12 AM, Ognen Duzlevski
<og...@nengoiksvelzud.com>wrote:

> Ahh, OK :)
>
> However, this seems kind of silly - it may be stored in the datanode but I
> find the need to "force" the balancing manually somewhat strange. I mean
> why use hdfs://namenode:port/path/file if the copies end up being stored
> locally anyway? ;)
>
> Ognen
>
>
> On Wed, Jan 29, 2014 at 8:10 AM, Selçuk Şenkul <ss...@gmail.com> wrote:
>
>> Try to run the command from the namenode, or another node which is not a
>> datanode, the files should distribute. As far as I know, if you copy a file
>> to hdfs from a datanode, the first copy is stored in that datanode.
>>
>> On Wed, Jan 29, 2014 at 4:05 PM, Ognen Duzlevski <
>> ognen@nengoiksvelzud.com> wrote:
>>
>>> Hello (and thanks for replying!) :)
>>>
>>> On Wed, Jan 29, 2014 at 7:38 AM, java8964 <ja...@hotmail.com> wrote:
>>>
>>>> Hi, Ognen:
>>>>
>>>> I noticed you were asking this question before under a different
>>>> subject line. I think you need to tell us where you mean unbalance space,
>>>> is it on HDFS or the local disk.
>>>>
>>>> 1) The HDFS is independent as MR. They are not related to each other.
>>>>
>>>
>>> OK good to know.
>>>
>>>
>>>> 2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means
>>>> all HDFS command, API will just work.
>>>>
>>>
>>> Good to know. Does this also mean that when I put or distcp file to
>>> hdfs://namenode:54310/path/file - it will "decide" how to split the file
>>> across all the datanodes so as the nodes are utilized equally in terms of
>>> space?
>>>
>>>
>>>> 3) But when you tried to copy file into HDFS using distcp, you need MR
>>>> component (Doesn't matter it is MR1 or MR2), as distcp indeed uses
>>>> MapReduce to do the massively parallel copying files.
>>>>
>>>
>>> Understood.
>>>
>>>
>>>> 4) Your original problem is that when you run the distcp command, you
>>>> didn't start the MR component in your cluster, so distcp in fact copy your
>>>> files to the LOCAL file system, based on some one else's reply to your
>>>> original question. I didn't test this myself before, but I kind of believe
>>>> that.
>>>>
>>>
>>> Sure. But even if distcp is running in one thread, its destination is
>>> hdfs://namenode:54310/path/file - should this not ensure equal "split" of
>>> files across the whole HDFS cluster? Or am I delusional? :)
>>>
>>>
>>>> 5) If the above is true, then you should see under node your were
>>>> running distcp command there should be having these files in the local file
>>>> system, in the path you specified. You should check and verify that.
>>>>
>>>
>>> OK - so the command is this:
>>>
>>> hadoop --config /etc/hadoop distcp s3n://<credentials>@bucket/file
>>> hdfs://10.10.0.198:54310/test/file where 10.10.0.198 is the HDFS Name
>>> node. I am running this on 10.10.0.200 which is one of the Data nodes and I
>>> am making no mention of the local data node storage in this command. My
>>> expectation is that the files obtained this way from S3 will end up
>>> distributed somewhat evenly across all of the 16 Data nodes in this HDSF
>>> cluster. Am I wrong to expect this?
>>>
>>> 6) After you start yarn/resource manager, you see the unbalance after
>>>> you distcp files again. Where is this unbalance? In the HDFS or local file
>>>> system. List the commands  and outputs here, so we can understand your
>>>> problem more clearly, instead of misleading sometimes by your words.
>>>>
>>>
>>> The imbalance is as follows: the machine I run the distcp command on
>>> (one of the Data nodes) ends up with 70+% of the space it is contributing
>>> to the HDFS cluster occupied with these files while the rest of the data
>>> nodes in the cluster only get 10% of their contributed space occupied.
>>> Since HDFS is a distributed, parallel file system I would expect that the
>>> file space occupied would be spread evenly or somewhat evenly across all
>>> the data nodes.
>>>
>>> Thanks!
>>> Ognen
>>>
>>
>>
>

Re: Configuring hadoop 2.2.0

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

Ahh, OK :)

However, this seems kind of silly - it may be stored in the datanode but I
find the need to "force" the balancing manually somewhat strange. I mean
why use hdfs://namenode:port/path/file if the copies end up being stored
locally anyway? ;)

Ognen


On Wed, Jan 29, 2014 at 8:10 AM, Selçuk Şenkul <ss...@gmail.com> wrote:

> Try to run the command from the namenode, or another node which is not a
> datanode, the files should distribute. As far as I know, if you copy a file
> to hdfs from a datanode, the first copy is stored in that datanode.
>
> On Wed, Jan 29, 2014 at 4:05 PM, Ognen Duzlevski <ognen@nengoiksvelzud.com
> > wrote:
>
>> Hello (and thanks for replying!) :)
>>
>> On Wed, Jan 29, 2014 at 7:38 AM, java8964 <ja...@hotmail.com> wrote:
>>
>>> Hi, Ognen:
>>>
>>> I noticed you were asking this question before under a different subject
>>> line. I think you need to tell us where you mean unbalance space, is it on
>>> HDFS or the local disk.
>>>
>>> 1) The HDFS is independent as MR. They are not related to each other.
>>>
>>
>> OK good to know.
>>
>>
>>> 2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means
>>> all HDFS command, API will just work.
>>>
>>
>> Good to know. Does this also mean that when I put or distcp file to
>> hdfs://namenode:54310/path/file - it will "decide" how to split the file
>> across all the datanodes so as the nodes are utilized equally in terms of
>> space?
>>
>>
>>> 3) But when you tried to copy file into HDFS using distcp, you need MR
>>> component (Doesn't matter it is MR1 or MR2), as distcp indeed uses
>>> MapReduce to do the massively parallel copying files.
>>>
>>
>> Understood.
>>
>>
>>> 4) Your original problem is that when you run the distcp command, you
>>> didn't start the MR component in your cluster, so distcp in fact copy your
>>> files to the LOCAL file system, based on some one else's reply to your
>>> original question. I didn't test this myself before, but I kind of believe
>>> that.
>>>
>>
>> Sure. But even if distcp is running in one thread, its destination is
>> hdfs://namenode:54310/path/file - should this not ensure equal "split" of
>> files across the whole HDFS cluster? Or am I delusional? :)
>>
>>
>>> 5) If the above is true, then you should see under node your were
>>> running distcp command there should be having these files in the local file
>>> system, in the path you specified. You should check and verify that.
>>>
>>
>> OK - so the command is this:
>>
>> hadoop --config /etc/hadoop distcp s3n://<credentials>@bucket/file hdfs://
>> 10.10.0.198:54310/test/file where 10.10.0.198 is the HDFS Name node. I
>> am running this on 10.10.0.200 which is one of the Data nodes and I am
>> making no mention of the local data node storage in this command. My
>> expectation is that the files obtained this way from S3 will end up
>> distributed somewhat evenly across all of the 16 Data nodes in this HDSF
>> cluster. Am I wrong to expect this?
>>
>> 6) After you start yarn/resource manager, you see the unbalance after you
>>> distcp files again. Where is this unbalance? In the HDFS or local file
>>> system. List the commands  and outputs here, so we can understand your
>>> problem more clearly, instead of misleading sometimes by your words.
>>>
>>
>> The imbalance is as follows: the machine I run the distcp command on (one
>> of the Data nodes) ends up with 70+% of the space it is contributing to the
>> HDFS cluster occupied with these files while the rest of the data nodes in
>> the cluster only get 10% of their contributed space occupied. Since HDFS is
>> a distributed, parallel file system I would expect that the file space
>> occupied would be spread evenly or somewhat evenly across all the data
>> nodes.
>>
>> Thanks!
>> Ognen
>>
>
>

Re: Configuring hadoop 2.2.0

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

Ahh, OK :)

However, this seems kind of silly - it may be stored in the datanode but I
find the need to "force" the balancing manually somewhat strange. I mean
why use hdfs://namenode:port/path/file if the copies end up being stored
locally anyway? ;)

Ognen


On Wed, Jan 29, 2014 at 8:10 AM, Selçuk Şenkul <ss...@gmail.com> wrote:

> Try to run the command from the namenode, or another node which is not a
> datanode, the files should distribute. As far as I know, if you copy a file
> to hdfs from a datanode, the first copy is stored in that datanode.
>
> On Wed, Jan 29, 2014 at 4:05 PM, Ognen Duzlevski <ognen@nengoiksvelzud.com
> > wrote:
>
>> Hello (and thanks for replying!) :)
>>
>> On Wed, Jan 29, 2014 at 7:38 AM, java8964 <ja...@hotmail.com> wrote:
>>
>>> Hi, Ognen:
>>>
>>> I noticed you were asking this question before under a different subject
>>> line. I think you need to tell us where you mean unbalance space, is it on
>>> HDFS or the local disk.
>>>
>>> 1) The HDFS is independent as MR. They are not related to each other.
>>>
>>
>> OK good to know.
>>
>>
>>> 2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means
>>> all HDFS command, API will just work.
>>>
>>
>> Good to know. Does this also mean that when I put or distcp file to
>> hdfs://namenode:54310/path/file - it will "decide" how to split the file
>> across all the datanodes so as the nodes are utilized equally in terms of
>> space?
>>
>>
>>> 3) But when you tried to copy file into HDFS using distcp, you need MR
>>> component (Doesn't matter it is MR1 or MR2), as distcp indeed uses
>>> MapReduce to do the massively parallel copying files.
>>>
>>
>> Understood.
>>
>>
>>> 4) Your original problem is that when you run the distcp command, you
>>> didn't start the MR component in your cluster, so distcp in fact copy your
>>> files to the LOCAL file system, based on some one else's reply to your
>>> original question. I didn't test this myself before, but I kind of believe
>>> that.
>>>
>>
>> Sure. But even if distcp is running in one thread, its destination is
>> hdfs://namenode:54310/path/file - should this not ensure equal "split" of
>> files across the whole HDFS cluster? Or am I delusional? :)
>>
>>
>>> 5) If the above is true, then you should see under node your were
>>> running distcp command there should be having these files in the local file
>>> system, in the path you specified. You should check and verify that.
>>>
>>
>> OK - so the command is this:
>>
>> hadoop --config /etc/hadoop distcp s3n://<credentials>@bucket/file hdfs://
>> 10.10.0.198:54310/test/file where 10.10.0.198 is the HDFS Name node. I
>> am running this on 10.10.0.200 which is one of the Data nodes and I am
>> making no mention of the local data node storage in this command. My
>> expectation is that the files obtained this way from S3 will end up
>> distributed somewhat evenly across all of the 16 Data nodes in this HDSF
>> cluster. Am I wrong to expect this?
>>
>> 6) After you start yarn/resource manager, you see the unbalance after you
>>> distcp files again. Where is this unbalance? In the HDFS or local file
>>> system. List the commands  and outputs here, so we can understand your
>>> problem more clearly, instead of misleading sometimes by your words.
>>>
>>
>> The imbalance is as follows: the machine I run the distcp command on (one
>> of the Data nodes) ends up with 70+% of the space it is contributing to the
>> HDFS cluster occupied with these files while the rest of the data nodes in
>> the cluster only get 10% of their contributed space occupied. Since HDFS is
>> a distributed, parallel file system I would expect that the file space
>> occupied would be spread evenly or somewhat evenly across all the data
>> nodes.
>>
>> Thanks!
>> Ognen
>>
>
>

Re: Configuring hadoop 2.2.0

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

Ahh, OK :)

However, this seems kind of silly - it may be stored in the datanode but I
find the need to "force" the balancing manually somewhat strange. I mean
why use hdfs://namenode:port/path/file if the copies end up being stored
locally anyway? ;)

Ognen


On Wed, Jan 29, 2014 at 8:10 AM, Selçuk Şenkul <ss...@gmail.com> wrote:

> Try to run the command from the namenode, or another node which is not a
> datanode, the files should distribute. As far as I know, if you copy a file
> to hdfs from a datanode, the first copy is stored in that datanode.
>
> On Wed, Jan 29, 2014 at 4:05 PM, Ognen Duzlevski <ognen@nengoiksvelzud.com
> > wrote:
>
>> Hello (and thanks for replying!) :)
>>
>> On Wed, Jan 29, 2014 at 7:38 AM, java8964 <ja...@hotmail.com> wrote:
>>
>>> Hi, Ognen:
>>>
>>> I noticed you were asking this question before under a different subject
>>> line. I think you need to tell us where you mean unbalance space, is it on
>>> HDFS or the local disk.
>>>
>>> 1) The HDFS is independent as MR. They are not related to each other.
>>>
>>
>> OK good to know.
>>
>>
>>> 2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means
>>> all HDFS command, API will just work.
>>>
>>
>> Good to know. Does this also mean that when I put or distcp file to
>> hdfs://namenode:54310/path/file - it will "decide" how to split the file
>> across all the datanodes so as the nodes are utilized equally in terms of
>> space?
>>
>>
>>> 3) But when you tried to copy file into HDFS using distcp, you need MR
>>> component (Doesn't matter it is MR1 or MR2), as distcp indeed uses
>>> MapReduce to do the massively parallel copying files.
>>>
>>
>> Understood.
>>
>>
>>> 4) Your original problem is that when you run the distcp command, you
>>> didn't start the MR component in your cluster, so distcp in fact copy your
>>> files to the LOCAL file system, based on some one else's reply to your
>>> original question. I didn't test this myself before, but I kind of believe
>>> that.
>>>
>>
>> Sure. But even if distcp is running in one thread, its destination is
>> hdfs://namenode:54310/path/file - should this not ensure equal "split" of
>> files across the whole HDFS cluster? Or am I delusional? :)
>>
>>
>>> 5) If the above is true, then you should see under node your were
>>> running distcp command there should be having these files in the local file
>>> system, in the path you specified. You should check and verify that.
>>>
>>
>> OK - so the command is this:
>>
>> hadoop --config /etc/hadoop distcp s3n://<credentials>@bucket/file hdfs://
>> 10.10.0.198:54310/test/file where 10.10.0.198 is the HDFS Name node. I
>> am running this on 10.10.0.200 which is one of the Data nodes and I am
>> making no mention of the local data node storage in this command. My
>> expectation is that the files obtained this way from S3 will end up
>> distributed somewhat evenly across all of the 16 Data nodes in this HDSF
>> cluster. Am I wrong to expect this?
>>
>> 6) After you start yarn/resource manager, you see the unbalance after you
>>> distcp files again. Where is this unbalance? In the HDFS or local file
>>> system. List the commands  and outputs here, so we can understand your
>>> problem more clearly, instead of misleading sometimes by your words.
>>>
>>
>> The imbalance is as follows: the machine I run the distcp command on (one
>> of the Data nodes) ends up with 70+% of the space it is contributing to the
>> HDFS cluster occupied with these files while the rest of the data nodes in
>> the cluster only get 10% of their contributed space occupied. Since HDFS is
>> a distributed, parallel file system I would expect that the file space
>> occupied would be spread evenly or somewhat evenly across all the data
>> nodes.
>>
>> Thanks!
>> Ognen
>>
>
>

Re: Configuring hadoop 2.2.0

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

Ahh, OK :)

However, this seems kind of silly - it may be stored in the datanode but I
find the need to "force" the balancing manually somewhat strange. I mean
why use hdfs://namenode:port/path/file if the copies end up being stored
locally anyway? ;)

Ognen


On Wed, Jan 29, 2014 at 8:10 AM, Selçuk Şenkul <ss...@gmail.com> wrote:

> Try to run the command from the namenode, or another node which is not a
> datanode, the files should distribute. As far as I know, if you copy a file
> to hdfs from a datanode, the first copy is stored in that datanode.
>
> On Wed, Jan 29, 2014 at 4:05 PM, Ognen Duzlevski <ognen@nengoiksvelzud.com
> > wrote:
>
>> Hello (and thanks for replying!) :)
>>
>> On Wed, Jan 29, 2014 at 7:38 AM, java8964 <ja...@hotmail.com> wrote:
>>
>>> Hi, Ognen:
>>>
>>> I noticed you were asking this question before under a different subject
>>> line. I think you need to tell us where you mean unbalance space, is it on
>>> HDFS or the local disk.
>>>
>>> 1) The HDFS is independent as MR. They are not related to each other.
>>>
>>
>> OK good to know.
>>
>>
>>> 2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means
>>> all HDFS command, API will just work.
>>>
>>
>> Good to know. Does this also mean that when I put or distcp file to
>> hdfs://namenode:54310/path/file - it will "decide" how to split the file
>> across all the datanodes so as the nodes are utilized equally in terms of
>> space?
>>
>>
>>> 3) But when you tried to copy file into HDFS using distcp, you need MR
>>> component (Doesn't matter it is MR1 or MR2), as distcp indeed uses
>>> MapReduce to do the massively parallel copying files.
>>>
>>
>> Understood.
>>
>>
>>> 4) Your original problem is that when you run the distcp command, you
>>> didn't start the MR component in your cluster, so distcp in fact copy your
>>> files to the LOCAL file system, based on some one else's reply to your
>>> original question. I didn't test this myself before, but I kind of believe
>>> that.
>>>
>>
>> Sure. But even if distcp is running in one thread, its destination is
>> hdfs://namenode:54310/path/file - should this not ensure equal "split" of
>> files across the whole HDFS cluster? Or am I delusional? :)
>>
>>
>>> 5) If the above is true, then you should see under node your were
>>> running distcp command there should be having these files in the local file
>>> system, in the path you specified. You should check and verify that.
>>>
>>
>> OK - so the command is this:
>>
>> hadoop --config /etc/hadoop distcp s3n://<credentials>@bucket/file hdfs://
>> 10.10.0.198:54310/test/file where 10.10.0.198 is the HDFS Name node. I
>> am running this on 10.10.0.200 which is one of the Data nodes and I am
>> making no mention of the local data node storage in this command. My
>> expectation is that the files obtained this way from S3 will end up
>> distributed somewhat evenly across all of the 16 Data nodes in this HDSF
>> cluster. Am I wrong to expect this?
>>
>> 6) After you start yarn/resource manager, you see the unbalance after you
>>> distcp files again. Where is this unbalance? In the HDFS or local file
>>> system. List the commands  and outputs here, so we can understand your
>>> problem more clearly, instead of misleading sometimes by your words.
>>>
>>
>> The imbalance is as follows: the machine I run the distcp command on (one
>> of the Data nodes) ends up with 70+% of the space it is contributing to the
>> HDFS cluster occupied with these files while the rest of the data nodes in
>> the cluster only get 10% of their contributed space occupied. Since HDFS is
>> a distributed, parallel file system I would expect that the file space
>> occupied would be spread evenly or somewhat evenly across all the data
>> nodes.
>>
>> Thanks!
>> Ognen
>>
>
>

Re: Configuring hadoop 2.2.0

Posted by Selçuk Şenkul <ss...@gmail.com>.

Try to run the command from the namenode, or another node which is not a
datanode, the files should distribute. As far as I know, if you copy a file
to hdfs from a datanode, the first copy is stored in that datanode.

On Wed, Jan 29, 2014 at 4:05 PM, Ognen Duzlevski
<og...@nengoiksvelzud.com>wrote:

> Hello (and thanks for replying!) :)
>
> On Wed, Jan 29, 2014 at 7:38 AM, java8964 <ja...@hotmail.com> wrote:
>
>> Hi, Ognen:
>>
>> I noticed you were asking this question before under a different subject
>> line. I think you need to tell us where you mean unbalance space, is it on
>> HDFS or the local disk.
>>
>> 1) The HDFS is independent as MR. They are not related to each other.
>>
>
> OK good to know.
>
>
>> 2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means all
>> HDFS command, API will just work.
>>
>
> Good to know. Does this also mean that when I put or distcp file to
> hdfs://namenode:54310/path/file - it will "decide" how to split the file
> across all the datanodes so as the nodes are utilized equally in terms of
> space?
>
>
>> 3) But when you tried to copy file into HDFS using distcp, you need MR
>> component (Doesn't matter it is MR1 or MR2), as distcp indeed uses
>> MapReduce to do the massively parallel copying files.
>>
>
> Understood.
>
>
>> 4) Your original problem is that when you run the distcp command, you
>> didn't start the MR component in your cluster, so distcp in fact copy your
>> files to the LOCAL file system, based on some one else's reply to your
>> original question. I didn't test this myself before, but I kind of believe
>> that.
>>
>
> Sure. But even if distcp is running in one thread, its destination is
> hdfs://namenode:54310/path/file - should this not ensure equal "split" of
> files across the whole HDFS cluster? Or am I delusional? :)
>
>
>> 5) If the above is true, then you should see under node your were running
>> distcp command there should be having these files in the local file system,
>> in the path you specified. You should check and verify that.
>>
>
> OK - so the command is this:
>
> hadoop --config /etc/hadoop distcp s3n://<credentials>@bucket/file hdfs://
> 10.10.0.198:54310/test/file where 10.10.0.198 is the HDFS Name node. I am
> running this on 10.10.0.200 which is one of the Data nodes and I am making
> no mention of the local data node storage in this command. My expectation
> is that the files obtained this way from S3 will end up distributed
> somewhat evenly across all of the 16 Data nodes in this HDSF cluster. Am I
> wrong to expect this?
>
> 6) After you start yarn/resource manager, you see the unbalance after you
>> distcp files again. Where is this unbalance? In the HDFS or local file
>> system. List the commands  and outputs here, so we can understand your
>> problem more clearly, instead of misleading sometimes by your words.
>>
>
> The imbalance is as follows: the machine I run the distcp command on (one
> of the Data nodes) ends up with 70+% of the space it is contributing to the
> HDFS cluster occupied with these files while the rest of the data nodes in
> the cluster only get 10% of their contributed space occupied. Since HDFS is
> a distributed, parallel file system I would expect that the file space
> occupied would be spread evenly or somewhat evenly across all the data
> nodes.
>
> Thanks!
> Ognen
>

Re: Configuring hadoop 2.2.0

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

By the way, I discovered the start-balancer.sh script that comes with HDFS
- after running it with -threshold 5, I get the following output in the
logs:

2014-01-29 14:04:16,503 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: 1 over-utilized: [Source[
10.10.0.200:50010, utilization=76.45474474120932]]
2014-01-29 14:04:16,503 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: 0 underutilized: []
2014-01-29 14:04:16,503 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Need to move 936.81 GB to
make the cluster balanced.
2014-01-29 14:04:16,503 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Decided to move 10 GB
bytes from 10.10.0.200:50010 to 10.10.0.203:50010
2014-01-29 14:04:16,503 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 10 GB in this
iteration

Maybe this sheds more light on what I am talking about? In any case, why do
I need to run the balancer manually? Or do I?
Ognen


On Wed, Jan 29, 2014 at 8:05 AM, Ognen Duzlevski
<og...@nengoiksvelzud.com>wrote:

> Hello (and thanks for replying!) :)
>
> On Wed, Jan 29, 2014 at 7:38 AM, java8964 <ja...@hotmail.com> wrote:
>
>> Hi, Ognen:
>>
>> I noticed you were asking this question before under a different subject
>> line. I think you need to tell us where you mean unbalance space, is it on
>> HDFS or the local disk.
>>
>> 1) The HDFS is independent as MR. They are not related to each other.
>>
>
> OK good to know.
>
>
>> 2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means all
>> HDFS command, API will just work.
>>
>
> Good to know. Does this also mean that when I put or distcp file to
> hdfs://namenode:54310/path/file - it will "decide" how to split the file
> across all the datanodes so as the nodes are utilized equally in terms of
> space?
>
>
>> 3) But when you tried to copy file into HDFS using distcp, you need MR
>> component (Doesn't matter it is MR1 or MR2), as distcp indeed uses
>> MapReduce to do the massively parallel copying files.
>>
>
> Understood.
>
>
>> 4) Your original problem is that when you run the distcp command, you
>> didn't start the MR component in your cluster, so distcp in fact copy your
>> files to the LOCAL file system, based on some one else's reply to your
>> original question. I didn't test this myself before, but I kind of believe
>> that.
>>
>
> Sure. But even if distcp is running in one thread, its destination is
> hdfs://namenode:54310/path/file - should this not ensure equal "split" of
> files across the whole HDFS cluster? Or am I delusional? :)
>
>
>> 5) If the above is true, then you should see under node your were running
>> distcp command there should be having these files in the local file system,
>> in the path you specified. You should check and verify that.
>>
>
> OK - so the command is this:
>
> hadoop --config /etc/hadoop distcp s3n://<credentials>@bucket/file hdfs://
> 10.10.0.198:54310/test/file where 10.10.0.198 is the HDFS Name node. I am
> running this on 10.10.0.200 which is one of the Data nodes and I am making
> no mention of the local data node storage in this command. My expectation
> is that the files obtained this way from S3 will end up distributed
> somewhat evenly across all of the 16 Data nodes in this HDSF cluster. Am I
> wrong to expect this?
>
> 6) After you start yarn/resource manager, you see the unbalance after you
>> distcp files again. Where is this unbalance? In the HDFS or local file
>> system. List the commands  and outputs here, so we can understand your
>> problem more clearly, instead of misleading sometimes by your words.
>>
>
> The imbalance is as follows: the machine I run the distcp command on (one
> of the Data nodes) ends up with 70+% of the space it is contributing to the
> HDFS cluster occupied with these files while the rest of the data nodes in
> the cluster only get 10% of their contributed space occupied. Since HDFS is
> a distributed, parallel file system I would expect that the file space
> occupied would be spread evenly or somewhat evenly across all the data
> nodes.
>
> Thanks!
> Ognen
>

Re: Configuring hadoop 2.2.0

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

By the way, I discovered the start-balancer.sh script that comes with HDFS
- after running it with -threshold 5, I get the following output in the
logs:

2014-01-29 14:04:16,503 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: 1 over-utilized: [Source[
10.10.0.200:50010, utilization=76.45474474120932]]
2014-01-29 14:04:16,503 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: 0 underutilized: []
2014-01-29 14:04:16,503 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Need to move 936.81 GB to
make the cluster balanced.
2014-01-29 14:04:16,503 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Decided to move 10 GB
bytes from 10.10.0.200:50010 to 10.10.0.203:50010
2014-01-29 14:04:16,503 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 10 GB in this
iteration

Maybe this sheds more light on what I am talking about? In any case, why do
I need to run the balancer manually? Or do I?
Ognen


On Wed, Jan 29, 2014 at 8:05 AM, Ognen Duzlevski
<og...@nengoiksvelzud.com>wrote:

> Hello (and thanks for replying!) :)
>
> On Wed, Jan 29, 2014 at 7:38 AM, java8964 <ja...@hotmail.com> wrote:
>
>> Hi, Ognen:
>>
>> I noticed you were asking this question before under a different subject
>> line. I think you need to tell us where you mean unbalance space, is it on
>> HDFS or the local disk.
>>
>> 1) The HDFS is independent as MR. They are not related to each other.
>>
>
> OK good to know.
>
>
>> 2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means all
>> HDFS command, API will just work.
>>
>
> Good to know. Does this also mean that when I put or distcp file to
> hdfs://namenode:54310/path/file - it will "decide" how to split the file
> across all the datanodes so as the nodes are utilized equally in terms of
> space?
>
>
>> 3) But when you tried to copy file into HDFS using distcp, you need MR
>> component (Doesn't matter it is MR1 or MR2), as distcp indeed uses
>> MapReduce to do the massively parallel copying files.
>>
>
> Understood.
>
>
>> 4) Your original problem is that when you run the distcp command, you
>> didn't start the MR component in your cluster, so distcp in fact copy your
>> files to the LOCAL file system, based on some one else's reply to your
>> original question. I didn't test this myself before, but I kind of believe
>> that.
>>
>
> Sure. But even if distcp is running in one thread, its destination is
> hdfs://namenode:54310/path/file - should this not ensure equal "split" of
> files across the whole HDFS cluster? Or am I delusional? :)
>
>
>> 5) If the above is true, then you should see under node your were running
>> distcp command there should be having these files in the local file system,
>> in the path you specified. You should check and verify that.
>>
>
> OK - so the command is this:
>
> hadoop --config /etc/hadoop distcp s3n://<credentials>@bucket/file hdfs://
> 10.10.0.198:54310/test/file where 10.10.0.198 is the HDFS Name node. I am
> running this on 10.10.0.200 which is one of the Data nodes and I am making
> no mention of the local data node storage in this command. My expectation
> is that the files obtained this way from S3 will end up distributed
> somewhat evenly across all of the 16 Data nodes in this HDSF cluster. Am I
> wrong to expect this?
>
> 6) After you start yarn/resource manager, you see the unbalance after you
>> distcp files again. Where is this unbalance? In the HDFS or local file
>> system. List the commands  and outputs here, so we can understand your
>> problem more clearly, instead of misleading sometimes by your words.
>>
>
> The imbalance is as follows: the machine I run the distcp command on (one
> of the Data nodes) ends up with 70+% of the space it is contributing to the
> HDFS cluster occupied with these files while the rest of the data nodes in
> the cluster only get 10% of their contributed space occupied. Since HDFS is
> a distributed, parallel file system I would expect that the file space
> occupied would be spread evenly or somewhat evenly across all the data
> nodes.
>
> Thanks!
> Ognen
>

Re: Configuring hadoop 2.2.0

Posted by Selçuk Şenkul <ss...@gmail.com>.

Try to run the command from the namenode, or another node which is not a
datanode, the files should distribute. As far as I know, if you copy a file
to hdfs from a datanode, the first copy is stored in that datanode.

On Wed, Jan 29, 2014 at 4:05 PM, Ognen Duzlevski
<og...@nengoiksvelzud.com>wrote:

> Hello (and thanks for replying!) :)
>
> On Wed, Jan 29, 2014 at 7:38 AM, java8964 <ja...@hotmail.com> wrote:
>
>> Hi, Ognen:
>>
>> I noticed you were asking this question before under a different subject
>> line. I think you need to tell us where you mean unbalance space, is it on
>> HDFS or the local disk.
>>
>> 1) The HDFS is independent as MR. They are not related to each other.
>>
>
> OK good to know.
>
>
>> 2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means all
>> HDFS command, API will just work.
>>
>
> Good to know. Does this also mean that when I put or distcp file to
> hdfs://namenode:54310/path/file - it will "decide" how to split the file
> across all the datanodes so as the nodes are utilized equally in terms of
> space?
>
>
>> 3) But when you tried to copy file into HDFS using distcp, you need MR
>> component (Doesn't matter it is MR1 or MR2), as distcp indeed uses
>> MapReduce to do the massively parallel copying files.
>>
>
> Understood.
>
>
>> 4) Your original problem is that when you run the distcp command, you
>> didn't start the MR component in your cluster, so distcp in fact copy your
>> files to the LOCAL file system, based on some one else's reply to your
>> original question. I didn't test this myself before, but I kind of believe
>> that.
>>
>
> Sure. But even if distcp is running in one thread, its destination is
> hdfs://namenode:54310/path/file - should this not ensure equal "split" of
> files across the whole HDFS cluster? Or am I delusional? :)
>
>
>> 5) If the above is true, then you should see under node your were running
>> distcp command there should be having these files in the local file system,
>> in the path you specified. You should check and verify that.
>>
>
> OK - so the command is this:
>
> hadoop --config /etc/hadoop distcp s3n://<credentials>@bucket/file hdfs://
> 10.10.0.198:54310/test/file where 10.10.0.198 is the HDFS Name node. I am
> running this on 10.10.0.200 which is one of the Data nodes and I am making
> no mention of the local data node storage in this command. My expectation
> is that the files obtained this way from S3 will end up distributed
> somewhat evenly across all of the 16 Data nodes in this HDSF cluster. Am I
> wrong to expect this?
>
> 6) After you start yarn/resource manager, you see the unbalance after you
>> distcp files again. Where is this unbalance? In the HDFS or local file
>> system. List the commands  and outputs here, so we can understand your
>> problem more clearly, instead of misleading sometimes by your words.
>>
>
> The imbalance is as follows: the machine I run the distcp command on (one
> of the Data nodes) ends up with 70+% of the space it is contributing to the
> HDFS cluster occupied with these files while the rest of the data nodes in
> the cluster only get 10% of their contributed space occupied. Since HDFS is
> a distributed, parallel file system I would expect that the file space
> occupied would be spread evenly or somewhat evenly across all the data
> nodes.
>
> Thanks!
> Ognen
>

Re: Configuring hadoop 2.2.0

Posted by Selçuk Şenkul <ss...@gmail.com>.

Try to run the command from the namenode, or another node which is not a
datanode, the files should distribute. As far as I know, if you copy a file
to hdfs from a datanode, the first copy is stored in that datanode.

On Wed, Jan 29, 2014 at 4:05 PM, Ognen Duzlevski
<og...@nengoiksvelzud.com>wrote:

> Hello (and thanks for replying!) :)
>
> On Wed, Jan 29, 2014 at 7:38 AM, java8964 <ja...@hotmail.com> wrote:
>
>> Hi, Ognen:
>>
>> I noticed you were asking this question before under a different subject
>> line. I think you need to tell us where you mean unbalance space, is it on
>> HDFS or the local disk.
>>
>> 1) The HDFS is independent as MR. They are not related to each other.
>>
>
> OK good to know.
>
>
>> 2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means all
>> HDFS command, API will just work.
>>
>
> Good to know. Does this also mean that when I put or distcp file to
> hdfs://namenode:54310/path/file - it will "decide" how to split the file
> across all the datanodes so as the nodes are utilized equally in terms of
> space?
>
>
>> 3) But when you tried to copy file into HDFS using distcp, you need MR
>> component (Doesn't matter it is MR1 or MR2), as distcp indeed uses
>> MapReduce to do the massively parallel copying files.
>>
>
> Understood.
>
>
>> 4) Your original problem is that when you run the distcp command, you
>> didn't start the MR component in your cluster, so distcp in fact copy your
>> files to the LOCAL file system, based on some one else's reply to your
>> original question. I didn't test this myself before, but I kind of believe
>> that.
>>
>
> Sure. But even if distcp is running in one thread, its destination is
> hdfs://namenode:54310/path/file - should this not ensure equal "split" of
> files across the whole HDFS cluster? Or am I delusional? :)
>
>
>> 5) If the above is true, then you should see under node your were running
>> distcp command there should be having these files in the local file system,
>> in the path you specified. You should check and verify that.
>>
>
> OK - so the command is this:
>
> hadoop --config /etc/hadoop distcp s3n://<credentials>@bucket/file hdfs://
> 10.10.0.198:54310/test/file where 10.10.0.198 is the HDFS Name node. I am
> running this on 10.10.0.200 which is one of the Data nodes and I am making
> no mention of the local data node storage in this command. My expectation
> is that the files obtained this way from S3 will end up distributed
> somewhat evenly across all of the 16 Data nodes in this HDSF cluster. Am I
> wrong to expect this?
>
> 6) After you start yarn/resource manager, you see the unbalance after you
>> distcp files again. Where is this unbalance? In the HDFS or local file
>> system. List the commands  and outputs here, so we can understand your
>> problem more clearly, instead of misleading sometimes by your words.
>>
>
> The imbalance is as follows: the machine I run the distcp command on (one
> of the Data nodes) ends up with 70+% of the space it is contributing to the
> HDFS cluster occupied with these files while the rest of the data nodes in
> the cluster only get 10% of their contributed space occupied. Since HDFS is
> a distributed, parallel file system I would expect that the file space
> occupied would be spread evenly or somewhat evenly across all the data
> nodes.
>
> Thanks!
> Ognen
>

Re: Configuring hadoop 2.2.0

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

By the way, I discovered the start-balancer.sh script that comes with HDFS
- after running it with -threshold 5, I get the following output in the
logs:

2014-01-29 14:04:16,503 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: 1 over-utilized: [Source[
10.10.0.200:50010, utilization=76.45474474120932]]
2014-01-29 14:04:16,503 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: 0 underutilized: []
2014-01-29 14:04:16,503 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Need to move 936.81 GB to
make the cluster balanced.
2014-01-29 14:04:16,503 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Decided to move 10 GB
bytes from 10.10.0.200:50010 to 10.10.0.203:50010
2014-01-29 14:04:16,503 INFO
org.apache.hadoop.hdfs.server.balancer.Balancer: Will move 10 GB in this
iteration

Maybe this sheds more light on what I am talking about? In any case, why do
I need to run the balancer manually? Or do I?
Ognen


On Wed, Jan 29, 2014 at 8:05 AM, Ognen Duzlevski
<og...@nengoiksvelzud.com>wrote:

> Hello (and thanks for replying!) :)
>
> On Wed, Jan 29, 2014 at 7:38 AM, java8964 <ja...@hotmail.com> wrote:
>
>> Hi, Ognen:
>>
>> I noticed you were asking this question before under a different subject
>> line. I think you need to tell us where you mean unbalance space, is it on
>> HDFS or the local disk.
>>
>> 1) The HDFS is independent as MR. They are not related to each other.
>>
>
> OK good to know.
>
>
>> 2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means all
>> HDFS command, API will just work.
>>
>
> Good to know. Does this also mean that when I put or distcp file to
> hdfs://namenode:54310/path/file - it will "decide" how to split the file
> across all the datanodes so as the nodes are utilized equally in terms of
> space?
>
>
>> 3) But when you tried to copy file into HDFS using distcp, you need MR
>> component (Doesn't matter it is MR1 or MR2), as distcp indeed uses
>> MapReduce to do the massively parallel copying files.
>>
>
> Understood.
>
>
>> 4) Your original problem is that when you run the distcp command, you
>> didn't start the MR component in your cluster, so distcp in fact copy your
>> files to the LOCAL file system, based on some one else's reply to your
>> original question. I didn't test this myself before, but I kind of believe
>> that.
>>
>
> Sure. But even if distcp is running in one thread, its destination is
> hdfs://namenode:54310/path/file - should this not ensure equal "split" of
> files across the whole HDFS cluster? Or am I delusional? :)
>
>
>> 5) If the above is true, then you should see under node your were running
>> distcp command there should be having these files in the local file system,
>> in the path you specified. You should check and verify that.
>>
>
> OK - so the command is this:
>
> hadoop --config /etc/hadoop distcp s3n://<credentials>@bucket/file hdfs://
> 10.10.0.198:54310/test/file where 10.10.0.198 is the HDFS Name node. I am
> running this on 10.10.0.200 which is one of the Data nodes and I am making
> no mention of the local data node storage in this command. My expectation
> is that the files obtained this way from S3 will end up distributed
> somewhat evenly across all of the 16 Data nodes in this HDSF cluster. Am I
> wrong to expect this?
>
> 6) After you start yarn/resource manager, you see the unbalance after you
>> distcp files again. Where is this unbalance? In the HDFS or local file
>> system. List the commands  and outputs here, so we can understand your
>> problem more clearly, instead of misleading sometimes by your words.
>>
>
> The imbalance is as follows: the machine I run the distcp command on (one
> of the Data nodes) ends up with 70+% of the space it is contributing to the
> HDFS cluster occupied with these files while the rest of the data nodes in
> the cluster only get 10% of their contributed space occupied. Since HDFS is
> a distributed, parallel file system I would expect that the file space
> occupied would be spread evenly or somewhat evenly across all the data
> nodes.
>
> Thanks!
> Ognen
>

Re: Configuring hadoop 2.2.0

Posted by Selçuk Şenkul <ss...@gmail.com>.

Try to run the command from the namenode, or another node which is not a
datanode, the files should distribute. As far as I know, if you copy a file
to hdfs from a datanode, the first copy is stored in that datanode.

On Wed, Jan 29, 2014 at 4:05 PM, Ognen Duzlevski
<og...@nengoiksvelzud.com>wrote:

> Hello (and thanks for replying!) :)
>
> On Wed, Jan 29, 2014 at 7:38 AM, java8964 <ja...@hotmail.com> wrote:
>
>> Hi, Ognen:
>>
>> I noticed you were asking this question before under a different subject
>> line. I think you need to tell us where you mean unbalance space, is it on
>> HDFS or the local disk.
>>
>> 1) The HDFS is independent as MR. They are not related to each other.
>>
>
> OK good to know.
>
>
>> 2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means all
>> HDFS command, API will just work.
>>
>
> Good to know. Does this also mean that when I put or distcp file to
> hdfs://namenode:54310/path/file - it will "decide" how to split the file
> across all the datanodes so as the nodes are utilized equally in terms of
> space?
>
>
>> 3) But when you tried to copy file into HDFS using distcp, you need MR
>> component (Doesn't matter it is MR1 or MR2), as distcp indeed uses
>> MapReduce to do the massively parallel copying files.
>>
>
> Understood.
>
>
>> 4) Your original problem is that when you run the distcp command, you
>> didn't start the MR component in your cluster, so distcp in fact copy your
>> files to the LOCAL file system, based on some one else's reply to your
>> original question. I didn't test this myself before, but I kind of believe
>> that.
>>
>
> Sure. But even if distcp is running in one thread, its destination is
> hdfs://namenode:54310/path/file - should this not ensure equal "split" of
> files across the whole HDFS cluster? Or am I delusional? :)
>
>
>> 5) If the above is true, then you should see under node your were running
>> distcp command there should be having these files in the local file system,
>> in the path you specified. You should check and verify that.
>>
>
> OK - so the command is this:
>
> hadoop --config /etc/hadoop distcp s3n://<credentials>@bucket/file hdfs://
> 10.10.0.198:54310/test/file where 10.10.0.198 is the HDFS Name node. I am
> running this on 10.10.0.200 which is one of the Data nodes and I am making
> no mention of the local data node storage in this command. My expectation
> is that the files obtained this way from S3 will end up distributed
> somewhat evenly across all of the 16 Data nodes in this HDSF cluster. Am I
> wrong to expect this?
>
> 6) After you start yarn/resource manager, you see the unbalance after you
>> distcp files again. Where is this unbalance? In the HDFS or local file
>> system. List the commands  and outputs here, so we can understand your
>> problem more clearly, instead of misleading sometimes by your words.
>>
>
> The imbalance is as follows: the machine I run the distcp command on (one
> of the Data nodes) ends up with 70+% of the space it is contributing to the
> HDFS cluster occupied with these files while the rest of the data nodes in
> the cluster only get 10% of their contributed space occupied. Since HDFS is
> a distributed, parallel file system I would expect that the file space
> occupied would be spread evenly or somewhat evenly across all the data
> nodes.
>
> Thanks!
> Ognen
>

Re: Configuring hadoop 2.2.0

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

Hello (and thanks for replying!) :)

On Wed, Jan 29, 2014 at 7:38 AM, java8964 <ja...@hotmail.com> wrote:

> Hi, Ognen:
>
> I noticed you were asking this question before under a different subject
> line. I think you need to tell us where you mean unbalance space, is it on
> HDFS or the local disk.
>
> 1) The HDFS is independent as MR. They are not related to each other.
>

OK good to know.

> 2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means all
> HDFS command, API will just work.
>

Good to know. Does this also mean that when I put or distcp file to
hdfs://namenode:54310/path/file - it will "decide" how to split the file
across all the datanodes so as the nodes are utilized equally in terms of
space?

> 3) But when you tried to copy file into HDFS using distcp, you need MR
> component (Doesn't matter it is MR1 or MR2), as distcp indeed uses
> MapReduce to do the massively parallel copying files.
>

Understood.

> 4) Your original problem is that when you run the distcp command, you
> didn't start the MR component in your cluster, so distcp in fact copy your
> files to the LOCAL file system, based on some one else's reply to your
> original question. I didn't test this myself before, but I kind of believe
> that.
>

Sure. But even if distcp is running in one thread, its destination is
hdfs://namenode:54310/path/file - should this not ensure equal "split" of
files across the whole HDFS cluster? Or am I delusional? :)

> 5) If the above is true, then you should see under node your were running
> distcp command there should be having these files in the local file system,
> in the path you specified. You should check and verify that.
>

OK - so the command is this:

hadoop --config /etc/hadoop distcp s3n://<credentials>@bucket/file hdfs://
10.10.0.198:54310/test/file where 10.10.0.198 is the HDFS Name node. I am
running this on 10.10.0.200 which is one of the Data nodes and I am making
no mention of the local data node storage in this command. My expectation
is that the files obtained this way from S3 will end up distributed
somewhat evenly across all of the 16 Data nodes in this HDSF cluster. Am I
wrong to expect this?

6) After you start yarn/resource manager, you see the unbalance after you
> distcp files again. Where is this unbalance? In the HDFS or local file
> system. List the commands  and outputs here, so we can understand your
> problem more clearly, instead of misleading sometimes by your words.
>

The imbalance is as follows: the machine I run the distcp command on (one
of the Data nodes) ends up with 70+% of the space it is contributing to the
HDFS cluster occupied with these files while the rest of the data nodes in
the cluster only get 10% of their contributed space occupied. Since HDFS is
a distributed, parallel file system I would expect that the file space
occupied would be spread evenly or somewhat evenly across all the data
nodes.

Thanks!
Ognen

Re: Configuring hadoop 2.2.0

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

Hello (and thanks for replying!) :)

On Wed, Jan 29, 2014 at 7:38 AM, java8964 <ja...@hotmail.com> wrote:

> Hi, Ognen:
>
> I noticed you were asking this question before under a different subject
> line. I think you need to tell us where you mean unbalance space, is it on
> HDFS or the local disk.
>
> 1) The HDFS is independent as MR. They are not related to each other.
>

OK good to know.

> 2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means all
> HDFS command, API will just work.
>

Good to know. Does this also mean that when I put or distcp file to
hdfs://namenode:54310/path/file - it will "decide" how to split the file
across all the datanodes so as the nodes are utilized equally in terms of
space?

> 3) But when you tried to copy file into HDFS using distcp, you need MR
> component (Doesn't matter it is MR1 or MR2), as distcp indeed uses
> MapReduce to do the massively parallel copying files.
>

Understood.

> 4) Your original problem is that when you run the distcp command, you
> didn't start the MR component in your cluster, so distcp in fact copy your
> files to the LOCAL file system, based on some one else's reply to your
> original question. I didn't test this myself before, but I kind of believe
> that.
>

Sure. But even if distcp is running in one thread, its destination is
hdfs://namenode:54310/path/file - should this not ensure equal "split" of
files across the whole HDFS cluster? Or am I delusional? :)

> 5) If the above is true, then you should see under node your were running
> distcp command there should be having these files in the local file system,
> in the path you specified. You should check and verify that.
>

OK - so the command is this:

hadoop --config /etc/hadoop distcp s3n://<credentials>@bucket/file hdfs://
10.10.0.198:54310/test/file where 10.10.0.198 is the HDFS Name node. I am
running this on 10.10.0.200 which is one of the Data nodes and I am making
no mention of the local data node storage in this command. My expectation
is that the files obtained this way from S3 will end up distributed
somewhat evenly across all of the 16 Data nodes in this HDSF cluster. Am I
wrong to expect this?

6) After you start yarn/resource manager, you see the unbalance after you
> distcp files again. Where is this unbalance? In the HDFS or local file
> system. List the commands  and outputs here, so we can understand your
> problem more clearly, instead of misleading sometimes by your words.
>

The imbalance is as follows: the machine I run the distcp command on (one
of the Data nodes) ends up with 70+% of the space it is contributing to the
HDFS cluster occupied with these files while the rest of the data nodes in
the cluster only get 10% of their contributed space occupied. Since HDFS is
a distributed, parallel file system I would expect that the file space
occupied would be spread evenly or somewhat evenly across all the data
nodes.

Thanks!
Ognen

Re: Configuring hadoop 2.2.0

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

Hello (and thanks for replying!) :)

On Wed, Jan 29, 2014 at 7:38 AM, java8964 <ja...@hotmail.com> wrote:

> Hi, Ognen:
>
> I noticed you were asking this question before under a different subject
> line. I think you need to tell us where you mean unbalance space, is it on
> HDFS or the local disk.
>
> 1) The HDFS is independent as MR. They are not related to each other.
>

OK good to know.

> 2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means all
> HDFS command, API will just work.
>

Good to know. Does this also mean that when I put or distcp file to
hdfs://namenode:54310/path/file - it will "decide" how to split the file
across all the datanodes so as the nodes are utilized equally in terms of
space?

> 3) But when you tried to copy file into HDFS using distcp, you need MR
> component (Doesn't matter it is MR1 or MR2), as distcp indeed uses
> MapReduce to do the massively parallel copying files.
>

Understood.

> 4) Your original problem is that when you run the distcp command, you
> didn't start the MR component in your cluster, so distcp in fact copy your
> files to the LOCAL file system, based on some one else's reply to your
> original question. I didn't test this myself before, but I kind of believe
> that.
>

Sure. But even if distcp is running in one thread, its destination is
hdfs://namenode:54310/path/file - should this not ensure equal "split" of
files across the whole HDFS cluster? Or am I delusional? :)

> 5) If the above is true, then you should see under node your were running
> distcp command there should be having these files in the local file system,
> in the path you specified. You should check and verify that.
>

OK - so the command is this:

hadoop --config /etc/hadoop distcp s3n://<credentials>@bucket/file hdfs://
10.10.0.198:54310/test/file where 10.10.0.198 is the HDFS Name node. I am
running this on 10.10.0.200 which is one of the Data nodes and I am making
no mention of the local data node storage in this command. My expectation
is that the files obtained this way from S3 will end up distributed
somewhat evenly across all of the 16 Data nodes in this HDSF cluster. Am I
wrong to expect this?

6) After you start yarn/resource manager, you see the unbalance after you
> distcp files again. Where is this unbalance? In the HDFS or local file
> system. List the commands  and outputs here, so we can understand your
> problem more clearly, instead of misleading sometimes by your words.
>

The imbalance is as follows: the machine I run the distcp command on (one
of the Data nodes) ends up with 70+% of the space it is contributing to the
HDFS cluster occupied with these files while the rest of the data nodes in
the cluster only get 10% of their contributed space occupied. Since HDFS is
a distributed, parallel file system I would expect that the file space
occupied would be spread evenly or somewhat evenly across all the data
nodes.

Thanks!
Ognen

Re: Configuring hadoop 2.2.0

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

Hello (and thanks for replying!) :)

On Wed, Jan 29, 2014 at 7:38 AM, java8964 <ja...@hotmail.com> wrote:

> Hi, Ognen:
>
> I noticed you were asking this question before under a different subject
> line. I think you need to tell us where you mean unbalance space, is it on
> HDFS or the local disk.
>
> 1) The HDFS is independent as MR. They are not related to each other.
>

OK good to know.

> 2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means all
> HDFS command, API will just work.
>

Good to know. Does this also mean that when I put or distcp file to
hdfs://namenode:54310/path/file - it will "decide" how to split the file
across all the datanodes so as the nodes are utilized equally in terms of
space?

> 3) But when you tried to copy file into HDFS using distcp, you need MR
> component (Doesn't matter it is MR1 or MR2), as distcp indeed uses
> MapReduce to do the massively parallel copying files.
>

Understood.

> 4) Your original problem is that when you run the distcp command, you
> didn't start the MR component in your cluster, so distcp in fact copy your
> files to the LOCAL file system, based on some one else's reply to your
> original question. I didn't test this myself before, but I kind of believe
> that.
>

Sure. But even if distcp is running in one thread, its destination is
hdfs://namenode:54310/path/file - should this not ensure equal "split" of
files across the whole HDFS cluster? Or am I delusional? :)

> 5) If the above is true, then you should see under node your were running
> distcp command there should be having these files in the local file system,
> in the path you specified. You should check and verify that.
>

OK - so the command is this:

hadoop --config /etc/hadoop distcp s3n://<credentials>@bucket/file hdfs://
10.10.0.198:54310/test/file where 10.10.0.198 is the HDFS Name node. I am
running this on 10.10.0.200 which is one of the Data nodes and I am making
no mention of the local data node storage in this command. My expectation
is that the files obtained this way from S3 will end up distributed
somewhat evenly across all of the 16 Data nodes in this HDSF cluster. Am I
wrong to expect this?

6) After you start yarn/resource manager, you see the unbalance after you
> distcp files again. Where is this unbalance? In the HDFS or local file
> system. List the commands  and outputs here, so we can understand your
> problem more clearly, instead of misleading sometimes by your words.
>

The imbalance is as follows: the machine I run the distcp command on (one
of the Data nodes) ends up with 70+% of the space it is contributing to the
HDFS cluster occupied with these files while the rest of the data nodes in
the cluster only get 10% of their contributed space occupied. Since HDFS is
a distributed, parallel file system I would expect that the file space
occupied would be spread evenly or somewhat evenly across all the data
nodes.

Thanks!
Ognen

RE: Configuring hadoop 2.2.0

Posted by java8964 <ja...@hotmail.com>.

Hi, Ognen:
I noticed you were asking this question before under a different subject line. I think you need to tell us where you mean unbalance space, is it on HDFS or the local disk.
1) The HDFS is independent as MR. They are not related to each other.2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means all HDFS command, API will just work.3) But when you tried to copy file into HDFS using distcp, you need MR component (Doesn't matter it is MR1 or MR2), as distcp indeed uses MapReduce to do the massively parallel copying files.4) Your original problem is that when you run the distcp command, you didn't start the MR component in your cluster, so distcp in fact copy your files to the LOCAL file system, based on some one else's reply to your original question. I didn't test this myself before, but I kind of believe that. 5) If the above is true, then you should see under node your were running distcp command there should be having these files in the local file system, in the path you specified. You should check and verify that.6) After you start yarn/resource manager, you see the unbalance after you distcp files again. Where is this unbalance? In the HDFS or local file system. List the commands  and outputs here, so we can understand your problem more clearly, instead of misleading sometimes by your words.7) My suggest is that after you start the yarn/resource managers, run some examples MR jobs coming with hadoop, to make sure your cluster working as normal, then try your distcp command.
Thanks
Yong

Date: Wed, 29 Jan 2014 06:38:54 -0600
Subject: Re: Configuring hadoop 2.2.0
From: ognen@nengoiksvelzud.com
To: user@hadoop.apache.org

So, the question is: do I or don't I need to run the yarn/resource manager/node manager combination in addition to HDFS? My impression was what you are saying - that HDFS is independent of the MR component.

Thanks! :)
Ognen

On Wed, Jan 29, 2014 at 6:37 AM, Ognen Duzlevski <og...@nengoiksvelzud.com> wrote:

Harsh,

Thanks for your reply. What happens is this: I have about 70 files, all about 20GB in size in an Amazon S3 bucket. I got them from the bucket in a for loop, file by file using the -distcp command from a single node.

When I look at the distribution of space consumed on the HDFS cluster now, the node I ran the command on has 70% of its space taken up while the rest of the nodes are at 10% local space usage. All of the nodes started out with the same local space of 1.6TB mounted in the same exact partition /extra (ephemeral space on an Amazon instance put into a RAID0 array).

Hence, the distribution of space is not balanced.

However, I did discover the start-balancer.sh script and ran it with -threshold 5. It has been running since yesterday, maybe the 5% balancing threshold is too much?

Ognen

On Wed, Jan 29, 2014 at 4:08 AM, Harsh J <ha...@cloudera.com> wrote:

I don't believe what you've been told is correct (IIUC). HDFS is an

independent component and does not require presence of YARN (or MR) to

function correctly.

What do you exactly mean when you say "files are only stored on the

node that uses the hdfs command"? Does your "hdfs dfs -ls /" show a

local FS / result list or does it show a true HDFS directory listing?

Your problem may simply be configuring clients right - depending on

this.

On Wed, Jan 29, 2014 at 12:52 AM, Ognen Duzlevski

<og...@nengoiksvelzud.com> wrote:

> Hello,

>

> I have set up an HDFS cluster by running a name node and a bunch of data

> nodes. I ran into a problem where the files are only stored on the node that

> uses the hdfs command and was told that this is because I do not have a job

> tracker and task nodes set up.

>

> However, the documentation for 2.2.0 does not mention any of these (at least

> not this page:

> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html).

> I browsed some of the earlier docs and they do mention job tracker nodes

> etc.

>

> So, for 2.2.0 - what is the way to set this up? Do I need a separate machine

> to be the "job tracker"? Did this job tracker node change its name to

> something else in the current docs?

>

> Thanks,

> Ognen

--

Harsh J

RE: Configuring hadoop 2.2.0

Posted by java8964 <ja...@hotmail.com>.

Hi, Ognen:
I noticed you were asking this question before under a different subject line. I think you need to tell us where you mean unbalance space, is it on HDFS or the local disk.
1) The HDFS is independent as MR. They are not related to each other.2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means all HDFS command, API will just work.3) But when you tried to copy file into HDFS using distcp, you need MR component (Doesn't matter it is MR1 or MR2), as distcp indeed uses MapReduce to do the massively parallel copying files.4) Your original problem is that when you run the distcp command, you didn't start the MR component in your cluster, so distcp in fact copy your files to the LOCAL file system, based on some one else's reply to your original question. I didn't test this myself before, but I kind of believe that. 5) If the above is true, then you should see under node your were running distcp command there should be having these files in the local file system, in the path you specified. You should check and verify that.6) After you start yarn/resource manager, you see the unbalance after you distcp files again. Where is this unbalance? In the HDFS or local file system. List the commands  and outputs here, so we can understand your problem more clearly, instead of misleading sometimes by your words.7) My suggest is that after you start the yarn/resource managers, run some examples MR jobs coming with hadoop, to make sure your cluster working as normal, then try your distcp command.
Thanks
Yong

Date: Wed, 29 Jan 2014 06:38:54 -0600
Subject: Re: Configuring hadoop 2.2.0
From: ognen@nengoiksvelzud.com
To: user@hadoop.apache.org

So, the question is: do I or don't I need to run the yarn/resource manager/node manager combination in addition to HDFS? My impression was what you are saying - that HDFS is independent of the MR component.

Thanks! :)
Ognen

On Wed, Jan 29, 2014 at 6:37 AM, Ognen Duzlevski <og...@nengoiksvelzud.com> wrote:

Harsh,

Thanks for your reply. What happens is this: I have about 70 files, all about 20GB in size in an Amazon S3 bucket. I got them from the bucket in a for loop, file by file using the -distcp command from a single node.

When I look at the distribution of space consumed on the HDFS cluster now, the node I ran the command on has 70% of its space taken up while the rest of the nodes are at 10% local space usage. All of the nodes started out with the same local space of 1.6TB mounted in the same exact partition /extra (ephemeral space on an Amazon instance put into a RAID0 array).

Hence, the distribution of space is not balanced.

However, I did discover the start-balancer.sh script and ran it with -threshold 5. It has been running since yesterday, maybe the 5% balancing threshold is too much?

Ognen

On Wed, Jan 29, 2014 at 4:08 AM, Harsh J <ha...@cloudera.com> wrote:

I don't believe what you've been told is correct (IIUC). HDFS is an

independent component and does not require presence of YARN (or MR) to

function correctly.

What do you exactly mean when you say "files are only stored on the

node that uses the hdfs command"? Does your "hdfs dfs -ls /" show a

local FS / result list or does it show a true HDFS directory listing?

Your problem may simply be configuring clients right - depending on

this.

On Wed, Jan 29, 2014 at 12:52 AM, Ognen Duzlevski

<og...@nengoiksvelzud.com> wrote:

> Hello,

>

> I have set up an HDFS cluster by running a name node and a bunch of data

> nodes. I ran into a problem where the files are only stored on the node that

> uses the hdfs command and was told that this is because I do not have a job

> tracker and task nodes set up.

>

> However, the documentation for 2.2.0 does not mention any of these (at least

> not this page:

> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html).

> I browsed some of the earlier docs and they do mention job tracker nodes

> etc.

>

> So, for 2.2.0 - what is the way to set this up? Do I need a separate machine

> to be the "job tracker"? Did this job tracker node change its name to

> something else in the current docs?

>

> Thanks,

> Ognen

--

Harsh J

RE: Configuring hadoop 2.2.0

Posted by java8964 <ja...@hotmail.com>.

Hi, Ognen:
I noticed you were asking this question before under a different subject line. I think you need to tell us where you mean unbalance space, is it on HDFS or the local disk.
1) The HDFS is independent as MR. They are not related to each other.2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means all HDFS command, API will just work.3) But when you tried to copy file into HDFS using distcp, you need MR component (Doesn't matter it is MR1 or MR2), as distcp indeed uses MapReduce to do the massively parallel copying files.4) Your original problem is that when you run the distcp command, you didn't start the MR component in your cluster, so distcp in fact copy your files to the LOCAL file system, based on some one else's reply to your original question. I didn't test this myself before, but I kind of believe that. 5) If the above is true, then you should see under node your were running distcp command there should be having these files in the local file system, in the path you specified. You should check and verify that.6) After you start yarn/resource manager, you see the unbalance after you distcp files again. Where is this unbalance? In the HDFS or local file system. List the commands  and outputs here, so we can understand your problem more clearly, instead of misleading sometimes by your words.7) My suggest is that after you start the yarn/resource managers, run some examples MR jobs coming with hadoop, to make sure your cluster working as normal, then try your distcp command.
Thanks
Yong

Date: Wed, 29 Jan 2014 06:38:54 -0600
Subject: Re: Configuring hadoop 2.2.0
From: ognen@nengoiksvelzud.com
To: user@hadoop.apache.org

So, the question is: do I or don't I need to run the yarn/resource manager/node manager combination in addition to HDFS? My impression was what you are saying - that HDFS is independent of the MR component.

Thanks! :)
Ognen

On Wed, Jan 29, 2014 at 6:37 AM, Ognen Duzlevski <og...@nengoiksvelzud.com> wrote:

Harsh,

Thanks for your reply. What happens is this: I have about 70 files, all about 20GB in size in an Amazon S3 bucket. I got them from the bucket in a for loop, file by file using the -distcp command from a single node.

When I look at the distribution of space consumed on the HDFS cluster now, the node I ran the command on has 70% of its space taken up while the rest of the nodes are at 10% local space usage. All of the nodes started out with the same local space of 1.6TB mounted in the same exact partition /extra (ephemeral space on an Amazon instance put into a RAID0 array).

Hence, the distribution of space is not balanced.

However, I did discover the start-balancer.sh script and ran it with -threshold 5. It has been running since yesterday, maybe the 5% balancing threshold is too much?

Ognen

On Wed, Jan 29, 2014 at 4:08 AM, Harsh J <ha...@cloudera.com> wrote:

I don't believe what you've been told is correct (IIUC). HDFS is an

independent component and does not require presence of YARN (or MR) to

function correctly.

What do you exactly mean when you say "files are only stored on the

node that uses the hdfs command"? Does your "hdfs dfs -ls /" show a

local FS / result list or does it show a true HDFS directory listing?

Your problem may simply be configuring clients right - depending on

this.

On Wed, Jan 29, 2014 at 12:52 AM, Ognen Duzlevski

<og...@nengoiksvelzud.com> wrote:

> Hello,

>

> I have set up an HDFS cluster by running a name node and a bunch of data

> nodes. I ran into a problem where the files are only stored on the node that

> uses the hdfs command and was told that this is because I do not have a job

> tracker and task nodes set up.

>

> However, the documentation for 2.2.0 does not mention any of these (at least

> not this page:

> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html).

> I browsed some of the earlier docs and they do mention job tracker nodes

> etc.

>

> So, for 2.2.0 - what is the way to set this up? Do I need a separate machine

> to be the "job tracker"? Did this job tracker node change its name to

> something else in the current docs?

>

> Thanks,

> Ognen

--

Harsh J

RE: Configuring hadoop 2.2.0

Posted by java8964 <ja...@hotmail.com>.

Hi, Ognen:
I noticed you were asking this question before under a different subject line. I think you need to tell us where you mean unbalance space, is it on HDFS or the local disk.
1) The HDFS is independent as MR. They are not related to each other.2) Without MR1 or MR2 (Yarn), HDFS should work as itself, which means all HDFS command, API will just work.3) But when you tried to copy file into HDFS using distcp, you need MR component (Doesn't matter it is MR1 or MR2), as distcp indeed uses MapReduce to do the massively parallel copying files.4) Your original problem is that when you run the distcp command, you didn't start the MR component in your cluster, so distcp in fact copy your files to the LOCAL file system, based on some one else's reply to your original question. I didn't test this myself before, but I kind of believe that. 5) If the above is true, then you should see under node your were running distcp command there should be having these files in the local file system, in the path you specified. You should check and verify that.6) After you start yarn/resource manager, you see the unbalance after you distcp files again. Where is this unbalance? In the HDFS or local file system. List the commands  and outputs here, so we can understand your problem more clearly, instead of misleading sometimes by your words.7) My suggest is that after you start the yarn/resource managers, run some examples MR jobs coming with hadoop, to make sure your cluster working as normal, then try your distcp command.
Thanks
Yong

Date: Wed, 29 Jan 2014 06:38:54 -0600
Subject: Re: Configuring hadoop 2.2.0
From: ognen@nengoiksvelzud.com
To: user@hadoop.apache.org

So, the question is: do I or don't I need to run the yarn/resource manager/node manager combination in addition to HDFS? My impression was what you are saying - that HDFS is independent of the MR component.

Thanks! :)
Ognen

On Wed, Jan 29, 2014 at 6:37 AM, Ognen Duzlevski <og...@nengoiksvelzud.com> wrote:

Harsh,

Thanks for your reply. What happens is this: I have about 70 files, all about 20GB in size in an Amazon S3 bucket. I got them from the bucket in a for loop, file by file using the -distcp command from a single node.

When I look at the distribution of space consumed on the HDFS cluster now, the node I ran the command on has 70% of its space taken up while the rest of the nodes are at 10% local space usage. All of the nodes started out with the same local space of 1.6TB mounted in the same exact partition /extra (ephemeral space on an Amazon instance put into a RAID0 array).

Hence, the distribution of space is not balanced.

However, I did discover the start-balancer.sh script and ran it with -threshold 5. It has been running since yesterday, maybe the 5% balancing threshold is too much?

Ognen

On Wed, Jan 29, 2014 at 4:08 AM, Harsh J <ha...@cloudera.com> wrote:

I don't believe what you've been told is correct (IIUC). HDFS is an

independent component and does not require presence of YARN (or MR) to

function correctly.

What do you exactly mean when you say "files are only stored on the

node that uses the hdfs command"? Does your "hdfs dfs -ls /" show a

local FS / result list or does it show a true HDFS directory listing?

Your problem may simply be configuring clients right - depending on

this.

On Wed, Jan 29, 2014 at 12:52 AM, Ognen Duzlevski

<og...@nengoiksvelzud.com> wrote:

> Hello,

>

> I have set up an HDFS cluster by running a name node and a bunch of data

> nodes. I ran into a problem where the files are only stored on the node that

> uses the hdfs command and was told that this is because I do not have a job

> tracker and task nodes set up.

>

> However, the documentation for 2.2.0 does not mention any of these (at least

> not this page:

> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html).

> I browsed some of the earlier docs and they do mention job tracker nodes

> etc.

>

> So, for 2.2.0 - what is the way to set this up? Do I need a separate machine

> to be the "job tracker"? Did this job tracker node change its name to

> something else in the current docs?

>

> Thanks,

> Ognen

--

Harsh J

Re: Configuring hadoop 2.2.0

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

So, the question is: do I or don't I need to run the yarn/resource
manager/node manager combination in addition to HDFS? My impression was
what you are saying - that HDFS is independent of the MR component.

Thanks! :)
Ognen


On Wed, Jan 29, 2014 at 6:37 AM, Ognen Duzlevski
<og...@nengoiksvelzud.com>wrote:

> Harsh,
>
> Thanks for your reply. What happens is this: I have about 70 files, all
> about 20GB in size in an Amazon S3 bucket. I got them from the bucket in a
> for loop, file by file using the -distcp command from a single node.
>
> When I look at the distribution of space consumed on the HDFS cluster now,
> the node I ran the command on has 70% of its space taken up while the rest
> of the nodes are at 10% local space usage. All of the nodes started out
> with the same local space of 1.6TB mounted in the same exact partition
> /extra (ephemeral space on an Amazon instance put into a RAID0 array).
>
> Hence, the distribution of space is not balanced.
>
> However, I did discover the start-balancer.sh script and ran it with
> -threshold 5. It has been running since yesterday, maybe the 5% balancing
> threshold is too much?
>
> Ognen
>
>
>
>
> On Wed, Jan 29, 2014 at 4:08 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> I don't believe what you've been told is correct (IIUC). HDFS is an
>> independent component and does not require presence of YARN (or MR) to
>> function correctly.
>>
>> What do you exactly mean when you say "files are only stored on the
>> node that uses the hdfs command"? Does your "hdfs dfs -ls /" show a
>> local FS / result list or does it show a true HDFS directory listing?
>> Your problem may simply be configuring clients right - depending on
>> this.
>>
>> On Wed, Jan 29, 2014 at 12:52 AM, Ognen Duzlevski
>> <og...@nengoiksvelzud.com> wrote:
>> > Hello,
>> >
>> > I have set up an HDFS cluster by running a name node and a bunch of data
>> > nodes. I ran into a problem where the files are only stored on the node
>> that
>> > uses the hdfs command and was told that this is because I do not have a
>> job
>> > tracker and task nodes set up.
>> >
>> > However, the documentation for 2.2.0 does not mention any of these (at
>> least
>> > not this page:
>> >
>> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html
>> ).
>> > I browsed some of the earlier docs and they do mention job tracker nodes
>> > etc.
>> >
>> > So, for 2.2.0 - what is the way to set this up? Do I need a separate
>> machine
>> > to be the "job tracker"? Did this job tracker node change its name to
>> > something else in the current docs?
>> >
>> > Thanks,
>> > Ognen
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: Configuring hadoop 2.2.0

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

So, the question is: do I or don't I need to run the yarn/resource
manager/node manager combination in addition to HDFS? My impression was
what you are saying - that HDFS is independent of the MR component.

Thanks! :)
Ognen


On Wed, Jan 29, 2014 at 6:37 AM, Ognen Duzlevski
<og...@nengoiksvelzud.com>wrote:

> Harsh,
>
> Thanks for your reply. What happens is this: I have about 70 files, all
> about 20GB in size in an Amazon S3 bucket. I got them from the bucket in a
> for loop, file by file using the -distcp command from a single node.
>
> When I look at the distribution of space consumed on the HDFS cluster now,
> the node I ran the command on has 70% of its space taken up while the rest
> of the nodes are at 10% local space usage. All of the nodes started out
> with the same local space of 1.6TB mounted in the same exact partition
> /extra (ephemeral space on an Amazon instance put into a RAID0 array).
>
> Hence, the distribution of space is not balanced.
>
> However, I did discover the start-balancer.sh script and ran it with
> -threshold 5. It has been running since yesterday, maybe the 5% balancing
> threshold is too much?
>
> Ognen
>
>
>
>
> On Wed, Jan 29, 2014 at 4:08 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> I don't believe what you've been told is correct (IIUC). HDFS is an
>> independent component and does not require presence of YARN (or MR) to
>> function correctly.
>>
>> What do you exactly mean when you say "files are only stored on the
>> node that uses the hdfs command"? Does your "hdfs dfs -ls /" show a
>> local FS / result list or does it show a true HDFS directory listing?
>> Your problem may simply be configuring clients right - depending on
>> this.
>>
>> On Wed, Jan 29, 2014 at 12:52 AM, Ognen Duzlevski
>> <og...@nengoiksvelzud.com> wrote:
>> > Hello,
>> >
>> > I have set up an HDFS cluster by running a name node and a bunch of data
>> > nodes. I ran into a problem where the files are only stored on the node
>> that
>> > uses the hdfs command and was told that this is because I do not have a
>> job
>> > tracker and task nodes set up.
>> >
>> > However, the documentation for 2.2.0 does not mention any of these (at
>> least
>> > not this page:
>> >
>> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html
>> ).
>> > I browsed some of the earlier docs and they do mention job tracker nodes
>> > etc.
>> >
>> > So, for 2.2.0 - what is the way to set this up? Do I need a separate
>> machine
>> > to be the "job tracker"? Did this job tracker node change its name to
>> > something else in the current docs?
>> >
>> > Thanks,
>> > Ognen
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: Configuring hadoop 2.2.0

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

So, the question is: do I or don't I need to run the yarn/resource
manager/node manager combination in addition to HDFS? My impression was
what you are saying - that HDFS is independent of the MR component.

Thanks! :)
Ognen


On Wed, Jan 29, 2014 at 6:37 AM, Ognen Duzlevski
<og...@nengoiksvelzud.com>wrote:

> Harsh,
>
> Thanks for your reply. What happens is this: I have about 70 files, all
> about 20GB in size in an Amazon S3 bucket. I got them from the bucket in a
> for loop, file by file using the -distcp command from a single node.
>
> When I look at the distribution of space consumed on the HDFS cluster now,
> the node I ran the command on has 70% of its space taken up while the rest
> of the nodes are at 10% local space usage. All of the nodes started out
> with the same local space of 1.6TB mounted in the same exact partition
> /extra (ephemeral space on an Amazon instance put into a RAID0 array).
>
> Hence, the distribution of space is not balanced.
>
> However, I did discover the start-balancer.sh script and ran it with
> -threshold 5. It has been running since yesterday, maybe the 5% balancing
> threshold is too much?
>
> Ognen
>
>
>
>
> On Wed, Jan 29, 2014 at 4:08 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> I don't believe what you've been told is correct (IIUC). HDFS is an
>> independent component and does not require presence of YARN (or MR) to
>> function correctly.
>>
>> What do you exactly mean when you say "files are only stored on the
>> node that uses the hdfs command"? Does your "hdfs dfs -ls /" show a
>> local FS / result list or does it show a true HDFS directory listing?
>> Your problem may simply be configuring clients right - depending on
>> this.
>>
>> On Wed, Jan 29, 2014 at 12:52 AM, Ognen Duzlevski
>> <og...@nengoiksvelzud.com> wrote:
>> > Hello,
>> >
>> > I have set up an HDFS cluster by running a name node and a bunch of data
>> > nodes. I ran into a problem where the files are only stored on the node
>> that
>> > uses the hdfs command and was told that this is because I do not have a
>> job
>> > tracker and task nodes set up.
>> >
>> > However, the documentation for 2.2.0 does not mention any of these (at
>> least
>> > not this page:
>> >
>> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html
>> ).
>> > I browsed some of the earlier docs and they do mention job tracker nodes
>> > etc.
>> >
>> > So, for 2.2.0 - what is the way to set this up? Do I need a separate
>> machine
>> > to be the "job tracker"? Did this job tracker node change its name to
>> > something else in the current docs?
>> >
>> > Thanks,
>> > Ognen
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: Configuring hadoop 2.2.0

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

So, the question is: do I or don't I need to run the yarn/resource
manager/node manager combination in addition to HDFS? My impression was
what you are saying - that HDFS is independent of the MR component.

Thanks! :)
Ognen


On Wed, Jan 29, 2014 at 6:37 AM, Ognen Duzlevski
<og...@nengoiksvelzud.com>wrote:

> Harsh,
>
> Thanks for your reply. What happens is this: I have about 70 files, all
> about 20GB in size in an Amazon S3 bucket. I got them from the bucket in a
> for loop, file by file using the -distcp command from a single node.
>
> When I look at the distribution of space consumed on the HDFS cluster now,
> the node I ran the command on has 70% of its space taken up while the rest
> of the nodes are at 10% local space usage. All of the nodes started out
> with the same local space of 1.6TB mounted in the same exact partition
> /extra (ephemeral space on an Amazon instance put into a RAID0 array).
>
> Hence, the distribution of space is not balanced.
>
> However, I did discover the start-balancer.sh script and ran it with
> -threshold 5. It has been running since yesterday, maybe the 5% balancing
> threshold is too much?
>
> Ognen
>
>
>
>
> On Wed, Jan 29, 2014 at 4:08 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> I don't believe what you've been told is correct (IIUC). HDFS is an
>> independent component and does not require presence of YARN (or MR) to
>> function correctly.
>>
>> What do you exactly mean when you say "files are only stored on the
>> node that uses the hdfs command"? Does your "hdfs dfs -ls /" show a
>> local FS / result list or does it show a true HDFS directory listing?
>> Your problem may simply be configuring clients right - depending on
>> this.
>>
>> On Wed, Jan 29, 2014 at 12:52 AM, Ognen Duzlevski
>> <og...@nengoiksvelzud.com> wrote:
>> > Hello,
>> >
>> > I have set up an HDFS cluster by running a name node and a bunch of data
>> > nodes. I ran into a problem where the files are only stored on the node
>> that
>> > uses the hdfs command and was told that this is because I do not have a
>> job
>> > tracker and task nodes set up.
>> >
>> > However, the documentation for 2.2.0 does not mention any of these (at
>> least
>> > not this page:
>> >
>> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html
>> ).
>> > I browsed some of the earlier docs and they do mention job tracker nodes
>> > etc.
>> >
>> > So, for 2.2.0 - what is the way to set this up? Do I need a separate
>> machine
>> > to be the "job tracker"? Did this job tracker node change its name to
>> > something else in the current docs?
>> >
>> > Thanks,
>> > Ognen
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: Configuring hadoop 2.2.0

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

Harsh,

Thanks for your reply. What happens is this: I have about 70 files, all
about 20GB in size in an Amazon S3 bucket. I got them from the bucket in a
for loop, file by file using the -distcp command from a single node.

When I look at the distribution of space consumed on the HDFS cluster now,
the node I ran the command on has 70% of its space taken up while the rest
of the nodes are at 10% local space usage. All of the nodes started out
with the same local space of 1.6TB mounted in the same exact partition
/extra (ephemeral space on an Amazon instance put into a RAID0 array).

Hence, the distribution of space is not balanced.

However, I did discover the start-balancer.sh script and ran it with
-threshold 5. It has been running since yesterday, maybe the 5% balancing
threshold is too much?

Ognen

On Wed, Jan 29, 2014 at 4:08 AM, Harsh J <ha...@cloudera.com> wrote:

> I don't believe what you've been told is correct (IIUC). HDFS is an
> independent component and does not require presence of YARN (or MR) to
> function correctly.
>
> What do you exactly mean when you say "files are only stored on the
> node that uses the hdfs command"? Does your "hdfs dfs -ls /" show a
> local FS / result list or does it show a true HDFS directory listing?
> Your problem may simply be configuring clients right - depending on
> this.
>
> On Wed, Jan 29, 2014 at 12:52 AM, Ognen Duzlevski
> <og...@nengoiksvelzud.com> wrote:
> > Hello,
> >
> > I have set up an HDFS cluster by running a name node and a bunch of data
> > nodes. I ran into a problem where the files are only stored on the node
> that
> > uses the hdfs command and was told that this is because I do not have a
> job
> > tracker and task nodes set up.
> >
> > However, the documentation for 2.2.0 does not mention any of these (at
> least
> > not this page:
> >
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html
> ).
> > I browsed some of the earlier docs and they do mention job tracker nodes
> > etc.
> >
> > So, for 2.2.0 - what is the way to set this up? Do I need a separate
> machine
> > to be the "job tracker"? Did this job tracker node change its name to
> > something else in the current docs?
> >
> > Thanks,
> > Ognen
>
>
>
> --
> Harsh J
>

Re: Configuring hadoop 2.2.0

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

Harsh,

Thanks for your reply. What happens is this: I have about 70 files, all
about 20GB in size in an Amazon S3 bucket. I got them from the bucket in a
for loop, file by file using the -distcp command from a single node.

When I look at the distribution of space consumed on the HDFS cluster now,
the node I ran the command on has 70% of its space taken up while the rest
of the nodes are at 10% local space usage. All of the nodes started out
with the same local space of 1.6TB mounted in the same exact partition
/extra (ephemeral space on an Amazon instance put into a RAID0 array).

Hence, the distribution of space is not balanced.

However, I did discover the start-balancer.sh script and ran it with
-threshold 5. It has been running since yesterday, maybe the 5% balancing
threshold is too much?

Ognen

On Wed, Jan 29, 2014 at 4:08 AM, Harsh J <ha...@cloudera.com> wrote:

> I don't believe what you've been told is correct (IIUC). HDFS is an
> independent component and does not require presence of YARN (or MR) to
> function correctly.
>
> What do you exactly mean when you say "files are only stored on the
> node that uses the hdfs command"? Does your "hdfs dfs -ls /" show a
> local FS / result list or does it show a true HDFS directory listing?
> Your problem may simply be configuring clients right - depending on
> this.
>
> On Wed, Jan 29, 2014 at 12:52 AM, Ognen Duzlevski
> <og...@nengoiksvelzud.com> wrote:
> > Hello,
> >
> > I have set up an HDFS cluster by running a name node and a bunch of data
> > nodes. I ran into a problem where the files are only stored on the node
> that
> > uses the hdfs command and was told that this is because I do not have a
> job
> > tracker and task nodes set up.
> >
> > However, the documentation for 2.2.0 does not mention any of these (at
> least
> > not this page:
> >
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html
> ).
> > I browsed some of the earlier docs and they do mention job tracker nodes
> > etc.
> >
> > So, for 2.2.0 - what is the way to set this up? Do I need a separate
> machine
> > to be the "job tracker"? Did this job tracker node change its name to
> > something else in the current docs?
> >
> > Thanks,
> > Ognen
>
>
>
> --
> Harsh J
>

Re: Configuring hadoop 2.2.0

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

Harsh,

Thanks for your reply. What happens is this: I have about 70 files, all
about 20GB in size in an Amazon S3 bucket. I got them from the bucket in a
for loop, file by file using the -distcp command from a single node.

When I look at the distribution of space consumed on the HDFS cluster now,
the node I ran the command on has 70% of its space taken up while the rest
of the nodes are at 10% local space usage. All of the nodes started out
with the same local space of 1.6TB mounted in the same exact partition
/extra (ephemeral space on an Amazon instance put into a RAID0 array).

Hence, the distribution of space is not balanced.

However, I did discover the start-balancer.sh script and ran it with
-threshold 5. It has been running since yesterday, maybe the 5% balancing
threshold is too much?

Ognen

On Wed, Jan 29, 2014 at 4:08 AM, Harsh J <ha...@cloudera.com> wrote:

> I don't believe what you've been told is correct (IIUC). HDFS is an
> independent component and does not require presence of YARN (or MR) to
> function correctly.
>
> What do you exactly mean when you say "files are only stored on the
> node that uses the hdfs command"? Does your "hdfs dfs -ls /" show a
> local FS / result list or does it show a true HDFS directory listing?
> Your problem may simply be configuring clients right - depending on
> this.
>
> On Wed, Jan 29, 2014 at 12:52 AM, Ognen Duzlevski
> <og...@nengoiksvelzud.com> wrote:
> > Hello,
> >
> > I have set up an HDFS cluster by running a name node and a bunch of data
> > nodes. I ran into a problem where the files are only stored on the node
> that
> > uses the hdfs command and was told that this is because I do not have a
> job
> > tracker and task nodes set up.
> >
> > However, the documentation for 2.2.0 does not mention any of these (at
> least
> > not this page:
> >
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html
> ).
> > I browsed some of the earlier docs and they do mention job tracker nodes
> > etc.
> >
> > So, for 2.2.0 - what is the way to set this up? Do I need a separate
> machine
> > to be the "job tracker"? Did this job tracker node change its name to
> > something else in the current docs?
> >
> > Thanks,
> > Ognen
>
>
>
> --
> Harsh J
>

Re: Configuring hadoop 2.2.0

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

Harsh,

Thanks for your reply. What happens is this: I have about 70 files, all
about 20GB in size in an Amazon S3 bucket. I got them from the bucket in a
for loop, file by file using the -distcp command from a single node.

When I look at the distribution of space consumed on the HDFS cluster now,
the node I ran the command on has 70% of its space taken up while the rest
of the nodes are at 10% local space usage. All of the nodes started out
with the same local space of 1.6TB mounted in the same exact partition
/extra (ephemeral space on an Amazon instance put into a RAID0 array).

Hence, the distribution of space is not balanced.

However, I did discover the start-balancer.sh script and ran it with
-threshold 5. It has been running since yesterday, maybe the 5% balancing
threshold is too much?

Ognen

On Wed, Jan 29, 2014 at 4:08 AM, Harsh J <ha...@cloudera.com> wrote:

> I don't believe what you've been told is correct (IIUC). HDFS is an
> independent component and does not require presence of YARN (or MR) to
> function correctly.
>
> What do you exactly mean when you say "files are only stored on the
> node that uses the hdfs command"? Does your "hdfs dfs -ls /" show a
> local FS / result list or does it show a true HDFS directory listing?
> Your problem may simply be configuring clients right - depending on
> this.
>
> On Wed, Jan 29, 2014 at 12:52 AM, Ognen Duzlevski
> <og...@nengoiksvelzud.com> wrote:
> > Hello,
> >
> > I have set up an HDFS cluster by running a name node and a bunch of data
> > nodes. I ran into a problem where the files are only stored on the node
> that
> > uses the hdfs command and was told that this is because I do not have a
> job
> > tracker and task nodes set up.
> >
> > However, the documentation for 2.2.0 does not mention any of these (at
> least
> > not this page:
> >
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html
> ).
> > I browsed some of the earlier docs and they do mention job tracker nodes
> > etc.
> >
> > So, for 2.2.0 - what is the way to set this up? Do I need a separate
> machine
> > to be the "job tracker"? Did this job tracker node change its name to
> > something else in the current docs?
> >
> > Thanks,
> > Ognen
>
>
>
> --
> Harsh J
>

Re: Configuring hadoop 2.2.0

Posted by Harsh J <ha...@cloudera.com>.

I don't believe what you've been told is correct (IIUC). HDFS is an
independent component and does not require presence of YARN (or MR) to
function correctly.

What do you exactly mean when you say "files are only stored on the
node that uses the hdfs command"? Does your "hdfs dfs -ls /" show a
local FS / result list or does it show a true HDFS directory listing?
Your problem may simply be configuring clients right - depending on
this.

On Wed, Jan 29, 2014 at 12:52 AM, Ognen Duzlevski
<og...@nengoiksvelzud.com> wrote:
> Hello,
>
> I have set up an HDFS cluster by running a name node and a bunch of data
> nodes. I ran into a problem where the files are only stored on the node that
> uses the hdfs command and was told that this is because I do not have a job
> tracker and task nodes set up.
>
> However, the documentation for 2.2.0 does not mention any of these (at least
> not this page:
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html).
> I browsed some of the earlier docs and they do mention job tracker nodes
> etc.
>
> So, for 2.2.0 - what is the way to set this up? Do I need a separate machine
> to be the "job tracker"? Did this job tracker node change its name to
> something else in the current docs?
>
> Thanks,
> Ognen

-- 
Harsh J

Re: Configuring hadoop 2.2.0

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

Furthermore, what is the difference between a ResourceManager node and a
NodeManager node?
Ognen


On Tue, Jan 28, 2014 at 1:22 PM, Ognen Duzlevski
<og...@nengoiksvelzud.com>wrote:

> Hello,
>
> I have set up an HDFS cluster by running a name node and a bunch of data
> nodes. I ran into a problem where the files are only stored on the node
> that uses the hdfs command and was told that this is because I do not have
> a job tracker and task nodes set up.
>
> However, the documentation for 2.2.0 does not mention any of these (at
> least not this page:
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html).
> I browsed some of the earlier docs and they do mention job tracker nodes
> etc.
>
> So, for 2.2.0 - what is the way to set this up? Do I need a separate
> machine to be the "job tracker"? Did this job tracker node change its name
> to something else in the current docs?
>
> Thanks,
> Ognen
>

Re: Configuring hadoop 2.2.0

Posted by Harsh J <ha...@cloudera.com>.

I don't believe what you've been told is correct (IIUC). HDFS is an
independent component and does not require presence of YARN (or MR) to
function correctly.

What do you exactly mean when you say "files are only stored on the
node that uses the hdfs command"? Does your "hdfs dfs -ls /" show a
local FS / result list or does it show a true HDFS directory listing?
Your problem may simply be configuring clients right - depending on
this.

On Wed, Jan 29, 2014 at 12:52 AM, Ognen Duzlevski
<og...@nengoiksvelzud.com> wrote:
> Hello,
>
> I have set up an HDFS cluster by running a name node and a bunch of data
> nodes. I ran into a problem where the files are only stored on the node that
> uses the hdfs command and was told that this is because I do not have a job
> tracker and task nodes set up.
>
> However, the documentation for 2.2.0 does not mention any of these (at least
> not this page:
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html).
> I browsed some of the earlier docs and they do mention job tracker nodes
> etc.
>
> So, for 2.2.0 - what is the way to set this up? Do I need a separate machine
> to be the "job tracker"? Did this job tracker node change its name to
> something else in the current docs?
>
> Thanks,
> Ognen

-- 
Harsh J

Re: Configuring hadoop 2.2.0

Posted by Harsh J <ha...@cloudera.com>.

I don't believe what you've been told is correct (IIUC). HDFS is an
independent component and does not require presence of YARN (or MR) to
function correctly.

What do you exactly mean when you say "files are only stored on the
node that uses the hdfs command"? Does your "hdfs dfs -ls /" show a
local FS / result list or does it show a true HDFS directory listing?
Your problem may simply be configuring clients right - depending on
this.

On Wed, Jan 29, 2014 at 12:52 AM, Ognen Duzlevski
<og...@nengoiksvelzud.com> wrote:
> Hello,
>
> I have set up an HDFS cluster by running a name node and a bunch of data
> nodes. I ran into a problem where the files are only stored on the node that
> uses the hdfs command and was told that this is because I do not have a job
> tracker and task nodes set up.
>
> However, the documentation for 2.2.0 does not mention any of these (at least
> not this page:
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html).
> I browsed some of the earlier docs and they do mention job tracker nodes
> etc.
>
> So, for 2.2.0 - what is the way to set this up? Do I need a separate machine
> to be the "job tracker"? Did this job tracker node change its name to
> something else in the current docs?
>
> Thanks,
> Ognen

-- 
Harsh J

Re: Configuring hadoop 2.2.0

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

Furthermore, what is the difference between a ResourceManager node and a
NodeManager node?
Ognen


On Tue, Jan 28, 2014 at 1:22 PM, Ognen Duzlevski
<og...@nengoiksvelzud.com>wrote:

> Hello,
>
> I have set up an HDFS cluster by running a name node and a bunch of data
> nodes. I ran into a problem where the files are only stored on the node
> that uses the hdfs command and was told that this is because I do not have
> a job tracker and task nodes set up.
>
> However, the documentation for 2.2.0 does not mention any of these (at
> least not this page:
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html).
> I browsed some of the earlier docs and they do mention job tracker nodes
> etc.
>
> So, for 2.2.0 - what is the way to set this up? Do I need a separate
> machine to be the "job tracker"? Did this job tracker node change its name
> to something else in the current docs?
>
> Thanks,
> Ognen
>

Re: Configuring hadoop 2.2.0

Posted by Harsh J <ha...@cloudera.com>.

I don't believe what you've been told is correct (IIUC). HDFS is an
independent component and does not require presence of YARN (or MR) to
function correctly.

What do you exactly mean when you say "files are only stored on the
node that uses the hdfs command"? Does your "hdfs dfs -ls /" show a
local FS / result list or does it show a true HDFS directory listing?
Your problem may simply be configuring clients right - depending on
this.

On Wed, Jan 29, 2014 at 12:52 AM, Ognen Duzlevski
<og...@nengoiksvelzud.com> wrote:
> Hello,
>
> I have set up an HDFS cluster by running a name node and a bunch of data
> nodes. I ran into a problem where the files are only stored on the node that
> uses the hdfs command and was told that this is because I do not have a job
> tracker and task nodes set up.
>
> However, the documentation for 2.2.0 does not mention any of these (at least
> not this page:
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html).
> I browsed some of the earlier docs and they do mention job tracker nodes
> etc.
>
> So, for 2.2.0 - what is the way to set this up? Do I need a separate machine
> to be the "job tracker"? Did this job tracker node change its name to
> something else in the current docs?
>
> Thanks,
> Ognen

-- 
Harsh J

Re: Configuring hadoop 2.2.0

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

Furthermore, what is the difference between a ResourceManager node and a
NodeManager node?
Ognen


On Tue, Jan 28, 2014 at 1:22 PM, Ognen Duzlevski
<og...@nengoiksvelzud.com>wrote:

> Hello,
>
> I have set up an HDFS cluster by running a name node and a bunch of data
> nodes. I ran into a problem where the files are only stored on the node
> that uses the hdfs command and was told that this is because I do not have
> a job tracker and task nodes set up.
>
> However, the documentation for 2.2.0 does not mention any of these (at
> least not this page:
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html).
> I browsed some of the earlier docs and they do mention job tracker nodes
> etc.
>
> So, for 2.2.0 - what is the way to set this up? Do I need a separate
> machine to be the "job tracker"? Did this job tracker node change its name
> to something else in the current docs?
>
> Thanks,
> Ognen
>

Re: Configuring hadoop 2.2.0

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.

Furthermore, what is the difference between a ResourceManager node and a
NodeManager node?
Ognen


On Tue, Jan 28, 2014 at 1:22 PM, Ognen Duzlevski
<og...@nengoiksvelzud.com>wrote:

> Hello,
>
> I have set up an HDFS cluster by running a name node and a bunch of data
> nodes. I ran into a problem where the files are only stored on the node
> that uses the hdfs command and was told that this is because I do not have
> a job tracker and task nodes set up.
>
> However, the documentation for 2.2.0 does not mention any of these (at
> least not this page:
> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ClusterSetup.html).
> I browsed some of the earlier docs and they do mention job tracker nodes
> etc.
>
> So, for 2.2.0 - what is the way to set this up? Do I need a separate
> machine to be the "job tracker"? Did this job tracker node change its name
> to something else in the current docs?
>
> Thanks,
> Ognen
>