You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Mayank <ma...@gmail.com> on 2013/06/10 11:36:09 UTC

Application errors with one disk on datanode getting filled up to 100%

We are running a hadoop cluster with 10 datanodes and a namenode. Each
datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
disk having a capacity 414GB.

hdfs-site.xml has following property set:

<value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
<description>Data dirs for DFS.</description>
</property>

Now we are facing a issue where in we find /data1 getting filled up quickly
and many a times we see it's usage running at 100% with just few megabytes
of free space. This issue is visible on 7 out of 10 datanodes at present.

We've some java applications which are writing to hdfs and many a times we
are seeing foloowing errors in our application logs:

java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)

I went through some old discussions and looks like manual rebalancing is
what is required in this case and we should also have
dfs.datanode.du.reserved set up.

However I'd like to understand if this issue, with one disk getting filled
up to 100% can result into the issue which we are seeing in our
application.

Also, are there any other peformance implications due to some of the disks
running at 100% usage on a datanode.
--
Mayank Joshi

Skype: mail2mayank
Mb.: +91 8690625808

Blog: http://www.techynfreesouls.co.nr
PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

I wasnt aware of data node level balancer procedure , I was thinking about
the hdfs balancer .
http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F

Thanks,
Rahul



On Fri, Jun 14, 2013 at 5:50 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Thanks Sandeep,
>
> I was thinking that the overall hdfs cluster might get unbalanced over the
> time and balancer might be useful in that case.
> I was more interested to know why only one disk out of configured 4 disks
> of the DN is getting all the writes.As per whatever I have read , writes
> should be in round robin fashion , which should ideally lead to all the
> configured disks in the DN to be similarly loaded.
>
> Not sure how the balancer is fixing this issue.
>
> Rgds,
> Rahul
>
>
>
> On Fri, Jun 14, 2013 at 4:45 PM, Sandeep L <sa...@outlook.com>wrote:
>
>> Rahul,
>>
>> In general this issue happens some times in Hadoop. There is no exact
>> reason for this.
>> To mitigate this you need to run balancer in regular intervals.
>>
>> Thanks,
>> Sandeep.
>>
>> ------------------------------
>> Date: Fri, 14 Jun 2013 16:39:02 +0530
>> Subject: Re: Application errors with one disk on datanode getting filled
>> up to 100%
>> From: mail2mayank@gmail.com
>> To: user@hadoop.apache.org
>>
>>
>> No, as of this moment we've no ideas about the reasons for that behavior.
>>
>>
>> On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>> Thanks Mayank, Any clue on why was only one disk was getting all writes.
>>
>> Rahul
>>
>>
>> On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:
>>
>> So we did a manual rebalance (followed instructions at:
>> http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
>> and also reserved 30 GB of space for non dfs usage via
>> dfs.datanode.du.reserved and restarted our apps.
>>
>> Things have been going fine till now.
>>
>> Keeping fingers crossed :)
>>
>>
>> On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>> I have a few points to make , these may not be very helpful for the said
>> problem.
>>
>> +All data nodes are bad exception is kind of not pointing to the problem
>> related to disk space full.
>> +hadoop.tmp.dir acts as base location of other hadoop related properties
>> , not sure if any particular directory is created specifically.
>> +Only one disk getting filled looks strange.The other disk are part while
>> formatting the NN.
>>
>> Would be interesting to know the reason for this.
>> Please keep posted.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>> From the snapshot, you got around 3TB for writing data.
>>
>> Can you check individual datanode's storage health.
>> As you said you got 80 servers writing parallely to hdfs, I am not sure
>> can that be an issue.
>> As suggested in past threads, you can do a rebalance of the blocks but
>> that will take some time to finish and will not solve your issue right
>> away.
>>
>> You can wait for others to reply. I am sure there will be far better
>> solutions from experts for this.
>>
>>
>> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>>
>> No it's not a map-reduce job. We've a java app running on around 80
>> machines which writes to hdfs. The error that I'd mentioned is being thrown
>> by the application and yes we've replication factor set to 3 and following
>> is status of hdfs:
>>
>> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66
>> GB DFS Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
>> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
>> : 0 Number of Under-Replicated Blocks : 0
>>
>>
>> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>> when you say application errors out .. does that mean your mapreduce job
>> is erroring? In that case apart from hdfs space you will need to look at
>> mapred tmp directory space as well.
>>
>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
>> replication factor of 3 so at max you will have datasize of 5TB with you.
>> I am also assuming you are not scheduling your program to run on entire
>> 5TB with just 10 nodes.
>>
>> i suspect your clusters mapred tmp space is getting filled in while the
>> job is running.
>>
>>
>>
>>
>>
>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>>
>> We are running a hadoop cluster with 10 datanodes and a namenode. Each
>> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
>> disk having a capacity 414GB.
>>
>>
>> hdfs-site.xml has following property set:
>>
>> <property>
>>         <name>dfs.data.dir</name>
>>
>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>         <description>Data dirs for DFS.</description>
>> </property>
>>
>> Now we are facing a issue where in we find /data1 getting filled up
>> quickly and many a times we see it's usage running at 100% with just few
>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>> present.
>>
>> We've some java applications which are writing to hdfs and many a times
>> we are seeing foloowing errors in our application logs:
>>
>>
>>
>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>
>>
>>
>> I went through some old discussions and looks like manual rebalancing is
>> what is required in this case and we should also have
>> dfs.datanode.du.reserved set up.
>>
>> However I'd like to understand if this issue, with one disk getting
>> filled up to 100% can result into the issue which we are seeing in our
>> application.
>>
>> Also, are there any other peformance implications due to some of the
>> disks running at 100% usage on a datanode.
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>>
>>
>>
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>>
>>
>>
>>
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>>
>>
>>
>>
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>
>

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

I wasnt aware of data node level balancer procedure , I was thinking about
the hdfs balancer .
http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F

Thanks,
Rahul



On Fri, Jun 14, 2013 at 5:50 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Thanks Sandeep,
>
> I was thinking that the overall hdfs cluster might get unbalanced over the
> time and balancer might be useful in that case.
> I was more interested to know why only one disk out of configured 4 disks
> of the DN is getting all the writes.As per whatever I have read , writes
> should be in round robin fashion , which should ideally lead to all the
> configured disks in the DN to be similarly loaded.
>
> Not sure how the balancer is fixing this issue.
>
> Rgds,
> Rahul
>
>
>
> On Fri, Jun 14, 2013 at 4:45 PM, Sandeep L <sa...@outlook.com>wrote:
>
>> Rahul,
>>
>> In general this issue happens some times in Hadoop. There is no exact
>> reason for this.
>> To mitigate this you need to run balancer in regular intervals.
>>
>> Thanks,
>> Sandeep.
>>
>> ------------------------------
>> Date: Fri, 14 Jun 2013 16:39:02 +0530
>> Subject: Re: Application errors with one disk on datanode getting filled
>> up to 100%
>> From: mail2mayank@gmail.com
>> To: user@hadoop.apache.org
>>
>>
>> No, as of this moment we've no ideas about the reasons for that behavior.
>>
>>
>> On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>> Thanks Mayank, Any clue on why was only one disk was getting all writes.
>>
>> Rahul
>>
>>
>> On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:
>>
>> So we did a manual rebalance (followed instructions at:
>> http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
>> and also reserved 30 GB of space for non dfs usage via
>> dfs.datanode.du.reserved and restarted our apps.
>>
>> Things have been going fine till now.
>>
>> Keeping fingers crossed :)
>>
>>
>> On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>> I have a few points to make , these may not be very helpful for the said
>> problem.
>>
>> +All data nodes are bad exception is kind of not pointing to the problem
>> related to disk space full.
>> +hadoop.tmp.dir acts as base location of other hadoop related properties
>> , not sure if any particular directory is created specifically.
>> +Only one disk getting filled looks strange.The other disk are part while
>> formatting the NN.
>>
>> Would be interesting to know the reason for this.
>> Please keep posted.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>> From the snapshot, you got around 3TB for writing data.
>>
>> Can you check individual datanode's storage health.
>> As you said you got 80 servers writing parallely to hdfs, I am not sure
>> can that be an issue.
>> As suggested in past threads, you can do a rebalance of the blocks but
>> that will take some time to finish and will not solve your issue right
>> away.
>>
>> You can wait for others to reply. I am sure there will be far better
>> solutions from experts for this.
>>
>>
>> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>>
>> No it's not a map-reduce job. We've a java app running on around 80
>> machines which writes to hdfs. The error that I'd mentioned is being thrown
>> by the application and yes we've replication factor set to 3 and following
>> is status of hdfs:
>>
>> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66
>> GB DFS Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
>> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
>> : 0 Number of Under-Replicated Blocks : 0
>>
>>
>> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>> when you say application errors out .. does that mean your mapreduce job
>> is erroring? In that case apart from hdfs space you will need to look at
>> mapred tmp directory space as well.
>>
>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
>> replication factor of 3 so at max you will have datasize of 5TB with you.
>> I am also assuming you are not scheduling your program to run on entire
>> 5TB with just 10 nodes.
>>
>> i suspect your clusters mapred tmp space is getting filled in while the
>> job is running.
>>
>>
>>
>>
>>
>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>>
>> We are running a hadoop cluster with 10 datanodes and a namenode. Each
>> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
>> disk having a capacity 414GB.
>>
>>
>> hdfs-site.xml has following property set:
>>
>> <property>
>>         <name>dfs.data.dir</name>
>>
>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>         <description>Data dirs for DFS.</description>
>> </property>
>>
>> Now we are facing a issue where in we find /data1 getting filled up
>> quickly and many a times we see it's usage running at 100% with just few
>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>> present.
>>
>> We've some java applications which are writing to hdfs and many a times
>> we are seeing foloowing errors in our application logs:
>>
>>
>>
>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>
>>
>>
>> I went through some old discussions and looks like manual rebalancing is
>> what is required in this case and we should also have
>> dfs.datanode.du.reserved set up.
>>
>> However I'd like to understand if this issue, with one disk getting
>> filled up to 100% can result into the issue which we are seeing in our
>> application.
>>
>> Also, are there any other peformance implications due to some of the
>> disks running at 100% usage on a datanode.
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>>
>>
>>
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>>
>>
>>
>>
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>>
>>
>>
>>
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>
>

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

I wasnt aware of data node level balancer procedure , I was thinking about
the hdfs balancer .
http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F

Thanks,
Rahul



On Fri, Jun 14, 2013 at 5:50 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Thanks Sandeep,
>
> I was thinking that the overall hdfs cluster might get unbalanced over the
> time and balancer might be useful in that case.
> I was more interested to know why only one disk out of configured 4 disks
> of the DN is getting all the writes.As per whatever I have read , writes
> should be in round robin fashion , which should ideally lead to all the
> configured disks in the DN to be similarly loaded.
>
> Not sure how the balancer is fixing this issue.
>
> Rgds,
> Rahul
>
>
>
> On Fri, Jun 14, 2013 at 4:45 PM, Sandeep L <sa...@outlook.com>wrote:
>
>> Rahul,
>>
>> In general this issue happens some times in Hadoop. There is no exact
>> reason for this.
>> To mitigate this you need to run balancer in regular intervals.
>>
>> Thanks,
>> Sandeep.
>>
>> ------------------------------
>> Date: Fri, 14 Jun 2013 16:39:02 +0530
>> Subject: Re: Application errors with one disk on datanode getting filled
>> up to 100%
>> From: mail2mayank@gmail.com
>> To: user@hadoop.apache.org
>>
>>
>> No, as of this moment we've no ideas about the reasons for that behavior.
>>
>>
>> On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>> Thanks Mayank, Any clue on why was only one disk was getting all writes.
>>
>> Rahul
>>
>>
>> On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:
>>
>> So we did a manual rebalance (followed instructions at:
>> http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
>> and also reserved 30 GB of space for non dfs usage via
>> dfs.datanode.du.reserved and restarted our apps.
>>
>> Things have been going fine till now.
>>
>> Keeping fingers crossed :)
>>
>>
>> On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>> I have a few points to make , these may not be very helpful for the said
>> problem.
>>
>> +All data nodes are bad exception is kind of not pointing to the problem
>> related to disk space full.
>> +hadoop.tmp.dir acts as base location of other hadoop related properties
>> , not sure if any particular directory is created specifically.
>> +Only one disk getting filled looks strange.The other disk are part while
>> formatting the NN.
>>
>> Would be interesting to know the reason for this.
>> Please keep posted.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>> From the snapshot, you got around 3TB for writing data.
>>
>> Can you check individual datanode's storage health.
>> As you said you got 80 servers writing parallely to hdfs, I am not sure
>> can that be an issue.
>> As suggested in past threads, you can do a rebalance of the blocks but
>> that will take some time to finish and will not solve your issue right
>> away.
>>
>> You can wait for others to reply. I am sure there will be far better
>> solutions from experts for this.
>>
>>
>> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>>
>> No it's not a map-reduce job. We've a java app running on around 80
>> machines which writes to hdfs. The error that I'd mentioned is being thrown
>> by the application and yes we've replication factor set to 3 and following
>> is status of hdfs:
>>
>> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66
>> GB DFS Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
>> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
>> : 0 Number of Under-Replicated Blocks : 0
>>
>>
>> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>> when you say application errors out .. does that mean your mapreduce job
>> is erroring? In that case apart from hdfs space you will need to look at
>> mapred tmp directory space as well.
>>
>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
>> replication factor of 3 so at max you will have datasize of 5TB with you.
>> I am also assuming you are not scheduling your program to run on entire
>> 5TB with just 10 nodes.
>>
>> i suspect your clusters mapred tmp space is getting filled in while the
>> job is running.
>>
>>
>>
>>
>>
>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>>
>> We are running a hadoop cluster with 10 datanodes and a namenode. Each
>> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
>> disk having a capacity 414GB.
>>
>>
>> hdfs-site.xml has following property set:
>>
>> <property>
>>         <name>dfs.data.dir</name>
>>
>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>         <description>Data dirs for DFS.</description>
>> </property>
>>
>> Now we are facing a issue where in we find /data1 getting filled up
>> quickly and many a times we see it's usage running at 100% with just few
>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>> present.
>>
>> We've some java applications which are writing to hdfs and many a times
>> we are seeing foloowing errors in our application logs:
>>
>>
>>
>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>
>>
>>
>> I went through some old discussions and looks like manual rebalancing is
>> what is required in this case and we should also have
>> dfs.datanode.du.reserved set up.
>>
>> However I'd like to understand if this issue, with one disk getting
>> filled up to 100% can result into the issue which we are seeing in our
>> application.
>>
>> Also, are there any other peformance implications due to some of the
>> disks running at 100% usage on a datanode.
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>>
>>
>>
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>>
>>
>>
>>
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>>
>>
>>
>>
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>
>

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

I wasnt aware of data node level balancer procedure , I was thinking about
the hdfs balancer .
http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F

Thanks,
Rahul



On Fri, Jun 14, 2013 at 5:50 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Thanks Sandeep,
>
> I was thinking that the overall hdfs cluster might get unbalanced over the
> time and balancer might be useful in that case.
> I was more interested to know why only one disk out of configured 4 disks
> of the DN is getting all the writes.As per whatever I have read , writes
> should be in round robin fashion , which should ideally lead to all the
> configured disks in the DN to be similarly loaded.
>
> Not sure how the balancer is fixing this issue.
>
> Rgds,
> Rahul
>
>
>
> On Fri, Jun 14, 2013 at 4:45 PM, Sandeep L <sa...@outlook.com>wrote:
>
>> Rahul,
>>
>> In general this issue happens some times in Hadoop. There is no exact
>> reason for this.
>> To mitigate this you need to run balancer in regular intervals.
>>
>> Thanks,
>> Sandeep.
>>
>> ------------------------------
>> Date: Fri, 14 Jun 2013 16:39:02 +0530
>> Subject: Re: Application errors with one disk on datanode getting filled
>> up to 100%
>> From: mail2mayank@gmail.com
>> To: user@hadoop.apache.org
>>
>>
>> No, as of this moment we've no ideas about the reasons for that behavior.
>>
>>
>> On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>> Thanks Mayank, Any clue on why was only one disk was getting all writes.
>>
>> Rahul
>>
>>
>> On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:
>>
>> So we did a manual rebalance (followed instructions at:
>> http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
>> and also reserved 30 GB of space for non dfs usage via
>> dfs.datanode.du.reserved and restarted our apps.
>>
>> Things have been going fine till now.
>>
>> Keeping fingers crossed :)
>>
>>
>> On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>> I have a few points to make , these may not be very helpful for the said
>> problem.
>>
>> +All data nodes are bad exception is kind of not pointing to the problem
>> related to disk space full.
>> +hadoop.tmp.dir acts as base location of other hadoop related properties
>> , not sure if any particular directory is created specifically.
>> +Only one disk getting filled looks strange.The other disk are part while
>> formatting the NN.
>>
>> Would be interesting to know the reason for this.
>> Please keep posted.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>> From the snapshot, you got around 3TB for writing data.
>>
>> Can you check individual datanode's storage health.
>> As you said you got 80 servers writing parallely to hdfs, I am not sure
>> can that be an issue.
>> As suggested in past threads, you can do a rebalance of the blocks but
>> that will take some time to finish and will not solve your issue right
>> away.
>>
>> You can wait for others to reply. I am sure there will be far better
>> solutions from experts for this.
>>
>>
>> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>>
>> No it's not a map-reduce job. We've a java app running on around 80
>> machines which writes to hdfs. The error that I'd mentioned is being thrown
>> by the application and yes we've replication factor set to 3 and following
>> is status of hdfs:
>>
>> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66
>> GB DFS Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
>> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
>> : 0 Number of Under-Replicated Blocks : 0
>>
>>
>> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>> when you say application errors out .. does that mean your mapreduce job
>> is erroring? In that case apart from hdfs space you will need to look at
>> mapred tmp directory space as well.
>>
>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
>> replication factor of 3 so at max you will have datasize of 5TB with you.
>> I am also assuming you are not scheduling your program to run on entire
>> 5TB with just 10 nodes.
>>
>> i suspect your clusters mapred tmp space is getting filled in while the
>> job is running.
>>
>>
>>
>>
>>
>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>>
>> We are running a hadoop cluster with 10 datanodes and a namenode. Each
>> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
>> disk having a capacity 414GB.
>>
>>
>> hdfs-site.xml has following property set:
>>
>> <property>
>>         <name>dfs.data.dir</name>
>>
>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>         <description>Data dirs for DFS.</description>
>> </property>
>>
>> Now we are facing a issue where in we find /data1 getting filled up
>> quickly and many a times we see it's usage running at 100% with just few
>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>> present.
>>
>> We've some java applications which are writing to hdfs and many a times
>> we are seeing foloowing errors in our application logs:
>>
>>
>>
>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>
>>
>>
>> I went through some old discussions and looks like manual rebalancing is
>> what is required in this case and we should also have
>> dfs.datanode.du.reserved set up.
>>
>> However I'd like to understand if this issue, with one disk getting
>> filled up to 100% can result into the issue which we are seeing in our
>> application.
>>
>> Also, are there any other peformance implications due to some of the
>> disks running at 100% usage on a datanode.
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>>
>>
>>
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>>
>>
>>
>>
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>>
>>
>>
>>
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>
>

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks Sandeep.
Yes , thats correct , I was more interested to know about the uneven
distribution within the DN.

Thanks,
Rahul


On Fri, Jun 14, 2013 at 6:12 PM, Sandeep L <sa...@outlook.com>wrote:

> Rahul,
>
> In general most of the times Hadoop tries to compute data locally that is,
> if run a MapReduce task on particular input,
> Hadoop will try compute data locally and write data locally(Majority of
> times this will happen), replicate in other nodes.
>
> In your scenario majority of your input data may be from a single
> datanode, so Hadoop is trying to write output data to same datanode.
>
> Thanks,
> Sandeep.
> ------------------------------
> From: rahul.rec.dgp@gmail.com
> Date: Fri, 14 Jun 2013 17:50:46 +0530
>
> Subject: Re: Application errors with one disk on datanode getting filled
> up to 100%
> To: user@hadoop.apache.org
>
>
> Thanks Sandeep,
>
> I was thinking that the overall hdfs cluster might get unbalanced over the
> time and balancer might be useful in that case.
> I was more interested to know why only one disk out of configured 4 disks
> of the DN is getting all the writes.As per whatever I have read , writes
> should be in round robin fashion , which should ideally lead to all the
> configured disks in the DN to be similarly loaded.
>
> Not sure how the balancer is fixing this issue.
>
> Rgds,
> Rahul
>
>
>
> On Fri, Jun 14, 2013 at 4:45 PM, Sandeep L <sa...@outlook.com>wrote:
>
> Rahul,
>
> In general this issue happens some times in Hadoop. There is no exact
> reason for this.
> To mitigate this you need to run balancer in regular intervals.
>
> Thanks,
> Sandeep.
>
> ------------------------------
> Date: Fri, 14 Jun 2013 16:39:02 +0530
> Subject: Re: Application errors with one disk on datanode getting filled
> up to 100%
> From: mail2mayank@gmail.com
> To: user@hadoop.apache.org
>
>
> No, as of this moment we've no ideas about the reasons for that behavior.
>
>
> On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
> Thanks Mayank, Any clue on why was only one disk was getting all writes.
>
> Rahul
>
>
> On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:
>
> So we did a manual rebalance (followed instructions at:
> http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
> and also reserved 30 GB of space for non dfs usage via
> dfs.datanode.du.reserved and restarted our apps.
>
> Things have been going fine till now.
>
> Keeping fingers crossed :)
>
>
> On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
> I have a few points to make , these may not be very helpful for the said
> problem.
>
> +All data nodes are bad exception is kind of not pointing to the problem
> related to disk space full.
> +hadoop.tmp.dir acts as base location of other hadoop related properties ,
> not sure if any particular directory is created specifically.
> +Only one disk getting filled looks strange.The other disk are part while
> formatting the NN.
>
> Would be interesting to know the reason for this.
> Please keep posted.
>
> Thanks,
> Rahul
>
>
> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
> From the snapshot, you got around 3TB for writing data.
>
> Can you check individual datanode's storage health.
> As you said you got 80 servers writing parallely to hdfs, I am not sure
> can that be an issue.
> As suggested in past threads, you can do a rebalance of the blocks but
> that will take some time to finish and will not solve your issue right
> away.
>
> You can wait for others to reply. I am sure there will be far better
> solutions from experts for this.
>
>
> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>
> No it's not a map-reduce job. We've a java app running on around 80
> machines which writes to hdfs. The error that I'd mentioned is being thrown
> by the application and yes we've replication factor set to 3 and following
> is status of hdfs:
>
> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66
> GB DFS Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
> : 0 Number of Under-Replicated Blocks : 0
>
>
> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
> when you say application errors out .. does that mean your mapreduce job
> is erroring? In that case apart from hdfs space you will need to look at
> mapred tmp directory space as well.
>
> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
> replication factor of 3 so at max you will have datasize of 5TB with you.
> I am also assuming you are not scheduling your program to run on entire
> 5TB with just 10 nodes.
>
> i suspect your clusters mapred tmp space is getting filled in while the
> job is running.
>
>
>
>
>
> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>
> We are running a hadoop cluster with 10 datanodes and a namenode. Each
> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
> disk having a capacity 414GB.
>
>
> hdfs-site.xml has following property set:
>
> <property>
>         <name>dfs.data.dir</name>
>
> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>         <description>Data dirs for DFS.</description>
> </property>
>
> Now we are facing a issue where in we find /data1 getting filled up
> quickly and many a times we see it's usage running at 100% with just few
> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
> present.
>
> We've some java applications which are writing to hdfs and many a times we
> are seeing foloowing errors in our application logs:
>
>
>
> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>
>
>
> I went through some old discussions and looks like manual rebalancing is
> what is required in this case and we should also have
> dfs.datanode.du.reserved set up.
>
> However I'd like to understand if this issue, with one disk getting filled
> up to 100% can result into the issue which we are seeing in our
> application.
>
> Also, are there any other peformance implications due to some of the disks
> running at 100% usage on a datanode.
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks Sandeep.
Yes , thats correct , I was more interested to know about the uneven
distribution within the DN.

Thanks,
Rahul


On Fri, Jun 14, 2013 at 6:12 PM, Sandeep L <sa...@outlook.com>wrote:

> Rahul,
>
> In general most of the times Hadoop tries to compute data locally that is,
> if run a MapReduce task on particular input,
> Hadoop will try compute data locally and write data locally(Majority of
> times this will happen), replicate in other nodes.
>
> In your scenario majority of your input data may be from a single
> datanode, so Hadoop is trying to write output data to same datanode.
>
> Thanks,
> Sandeep.
> ------------------------------
> From: rahul.rec.dgp@gmail.com
> Date: Fri, 14 Jun 2013 17:50:46 +0530
>
> Subject: Re: Application errors with one disk on datanode getting filled
> up to 100%
> To: user@hadoop.apache.org
>
>
> Thanks Sandeep,
>
> I was thinking that the overall hdfs cluster might get unbalanced over the
> time and balancer might be useful in that case.
> I was more interested to know why only one disk out of configured 4 disks
> of the DN is getting all the writes.As per whatever I have read , writes
> should be in round robin fashion , which should ideally lead to all the
> configured disks in the DN to be similarly loaded.
>
> Not sure how the balancer is fixing this issue.
>
> Rgds,
> Rahul
>
>
>
> On Fri, Jun 14, 2013 at 4:45 PM, Sandeep L <sa...@outlook.com>wrote:
>
> Rahul,
>
> In general this issue happens some times in Hadoop. There is no exact
> reason for this.
> To mitigate this you need to run balancer in regular intervals.
>
> Thanks,
> Sandeep.
>
> ------------------------------
> Date: Fri, 14 Jun 2013 16:39:02 +0530
> Subject: Re: Application errors with one disk on datanode getting filled
> up to 100%
> From: mail2mayank@gmail.com
> To: user@hadoop.apache.org
>
>
> No, as of this moment we've no ideas about the reasons for that behavior.
>
>
> On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
> Thanks Mayank, Any clue on why was only one disk was getting all writes.
>
> Rahul
>
>
> On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:
>
> So we did a manual rebalance (followed instructions at:
> http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
> and also reserved 30 GB of space for non dfs usage via
> dfs.datanode.du.reserved and restarted our apps.
>
> Things have been going fine till now.
>
> Keeping fingers crossed :)
>
>
> On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
> I have a few points to make , these may not be very helpful for the said
> problem.
>
> +All data nodes are bad exception is kind of not pointing to the problem
> related to disk space full.
> +hadoop.tmp.dir acts as base location of other hadoop related properties ,
> not sure if any particular directory is created specifically.
> +Only one disk getting filled looks strange.The other disk are part while
> formatting the NN.
>
> Would be interesting to know the reason for this.
> Please keep posted.
>
> Thanks,
> Rahul
>
>
> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
> From the snapshot, you got around 3TB for writing data.
>
> Can you check individual datanode's storage health.
> As you said you got 80 servers writing parallely to hdfs, I am not sure
> can that be an issue.
> As suggested in past threads, you can do a rebalance of the blocks but
> that will take some time to finish and will not solve your issue right
> away.
>
> You can wait for others to reply. I am sure there will be far better
> solutions from experts for this.
>
>
> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>
> No it's not a map-reduce job. We've a java app running on around 80
> machines which writes to hdfs. The error that I'd mentioned is being thrown
> by the application and yes we've replication factor set to 3 and following
> is status of hdfs:
>
> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66
> GB DFS Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
> : 0 Number of Under-Replicated Blocks : 0
>
>
> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
> when you say application errors out .. does that mean your mapreduce job
> is erroring? In that case apart from hdfs space you will need to look at
> mapred tmp directory space as well.
>
> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
> replication factor of 3 so at max you will have datasize of 5TB with you.
> I am also assuming you are not scheduling your program to run on entire
> 5TB with just 10 nodes.
>
> i suspect your clusters mapred tmp space is getting filled in while the
> job is running.
>
>
>
>
>
> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>
> We are running a hadoop cluster with 10 datanodes and a namenode. Each
> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
> disk having a capacity 414GB.
>
>
> hdfs-site.xml has following property set:
>
> <property>
>         <name>dfs.data.dir</name>
>
> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>         <description>Data dirs for DFS.</description>
> </property>
>
> Now we are facing a issue where in we find /data1 getting filled up
> quickly and many a times we see it's usage running at 100% with just few
> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
> present.
>
> We've some java applications which are writing to hdfs and many a times we
> are seeing foloowing errors in our application logs:
>
>
>
> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>
>
>
> I went through some old discussions and looks like manual rebalancing is
> what is required in this case and we should also have
> dfs.datanode.du.reserved set up.
>
> However I'd like to understand if this issue, with one disk getting filled
> up to 100% can result into the issue which we are seeing in our
> application.
>
> Also, are there any other peformance implications due to some of the disks
> running at 100% usage on a datanode.
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks Sandeep.
Yes , thats correct , I was more interested to know about the uneven
distribution within the DN.

Thanks,
Rahul


On Fri, Jun 14, 2013 at 6:12 PM, Sandeep L <sa...@outlook.com>wrote:

> Rahul,
>
> In general most of the times Hadoop tries to compute data locally that is,
> if run a MapReduce task on particular input,
> Hadoop will try compute data locally and write data locally(Majority of
> times this will happen), replicate in other nodes.
>
> In your scenario majority of your input data may be from a single
> datanode, so Hadoop is trying to write output data to same datanode.
>
> Thanks,
> Sandeep.
> ------------------------------
> From: rahul.rec.dgp@gmail.com
> Date: Fri, 14 Jun 2013 17:50:46 +0530
>
> Subject: Re: Application errors with one disk on datanode getting filled
> up to 100%
> To: user@hadoop.apache.org
>
>
> Thanks Sandeep,
>
> I was thinking that the overall hdfs cluster might get unbalanced over the
> time and balancer might be useful in that case.
> I was more interested to know why only one disk out of configured 4 disks
> of the DN is getting all the writes.As per whatever I have read , writes
> should be in round robin fashion , which should ideally lead to all the
> configured disks in the DN to be similarly loaded.
>
> Not sure how the balancer is fixing this issue.
>
> Rgds,
> Rahul
>
>
>
> On Fri, Jun 14, 2013 at 4:45 PM, Sandeep L <sa...@outlook.com>wrote:
>
> Rahul,
>
> In general this issue happens some times in Hadoop. There is no exact
> reason for this.
> To mitigate this you need to run balancer in regular intervals.
>
> Thanks,
> Sandeep.
>
> ------------------------------
> Date: Fri, 14 Jun 2013 16:39:02 +0530
> Subject: Re: Application errors with one disk on datanode getting filled
> up to 100%
> From: mail2mayank@gmail.com
> To: user@hadoop.apache.org
>
>
> No, as of this moment we've no ideas about the reasons for that behavior.
>
>
> On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
> Thanks Mayank, Any clue on why was only one disk was getting all writes.
>
> Rahul
>
>
> On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:
>
> So we did a manual rebalance (followed instructions at:
> http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
> and also reserved 30 GB of space for non dfs usage via
> dfs.datanode.du.reserved and restarted our apps.
>
> Things have been going fine till now.
>
> Keeping fingers crossed :)
>
>
> On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
> I have a few points to make , these may not be very helpful for the said
> problem.
>
> +All data nodes are bad exception is kind of not pointing to the problem
> related to disk space full.
> +hadoop.tmp.dir acts as base location of other hadoop related properties ,
> not sure if any particular directory is created specifically.
> +Only one disk getting filled looks strange.The other disk are part while
> formatting the NN.
>
> Would be interesting to know the reason for this.
> Please keep posted.
>
> Thanks,
> Rahul
>
>
> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
> From the snapshot, you got around 3TB for writing data.
>
> Can you check individual datanode's storage health.
> As you said you got 80 servers writing parallely to hdfs, I am not sure
> can that be an issue.
> As suggested in past threads, you can do a rebalance of the blocks but
> that will take some time to finish and will not solve your issue right
> away.
>
> You can wait for others to reply. I am sure there will be far better
> solutions from experts for this.
>
>
> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>
> No it's not a map-reduce job. We've a java app running on around 80
> machines which writes to hdfs. The error that I'd mentioned is being thrown
> by the application and yes we've replication factor set to 3 and following
> is status of hdfs:
>
> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66
> GB DFS Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
> : 0 Number of Under-Replicated Blocks : 0
>
>
> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
> when you say application errors out .. does that mean your mapreduce job
> is erroring? In that case apart from hdfs space you will need to look at
> mapred tmp directory space as well.
>
> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
> replication factor of 3 so at max you will have datasize of 5TB with you.
> I am also assuming you are not scheduling your program to run on entire
> 5TB with just 10 nodes.
>
> i suspect your clusters mapred tmp space is getting filled in while the
> job is running.
>
>
>
>
>
> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>
> We are running a hadoop cluster with 10 datanodes and a namenode. Each
> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
> disk having a capacity 414GB.
>
>
> hdfs-site.xml has following property set:
>
> <property>
>         <name>dfs.data.dir</name>
>
> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>         <description>Data dirs for DFS.</description>
> </property>
>
> Now we are facing a issue where in we find /data1 getting filled up
> quickly and many a times we see it's usage running at 100% with just few
> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
> present.
>
> We've some java applications which are writing to hdfs and many a times we
> are seeing foloowing errors in our application logs:
>
>
>
> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>
>
>
> I went through some old discussions and looks like manual rebalancing is
> what is required in this case and we should also have
> dfs.datanode.du.reserved set up.
>
> However I'd like to understand if this issue, with one disk getting filled
> up to 100% can result into the issue which we are seeing in our
> application.
>
> Also, are there any other peformance implications due to some of the disks
> running at 100% usage on a datanode.
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks Sandeep.
Yes , thats correct , I was more interested to know about the uneven
distribution within the DN.

Thanks,
Rahul


On Fri, Jun 14, 2013 at 6:12 PM, Sandeep L <sa...@outlook.com>wrote:

> Rahul,
>
> In general most of the times Hadoop tries to compute data locally that is,
> if run a MapReduce task on particular input,
> Hadoop will try compute data locally and write data locally(Majority of
> times this will happen), replicate in other nodes.
>
> In your scenario majority of your input data may be from a single
> datanode, so Hadoop is trying to write output data to same datanode.
>
> Thanks,
> Sandeep.
> ------------------------------
> From: rahul.rec.dgp@gmail.com
> Date: Fri, 14 Jun 2013 17:50:46 +0530
>
> Subject: Re: Application errors with one disk on datanode getting filled
> up to 100%
> To: user@hadoop.apache.org
>
>
> Thanks Sandeep,
>
> I was thinking that the overall hdfs cluster might get unbalanced over the
> time and balancer might be useful in that case.
> I was more interested to know why only one disk out of configured 4 disks
> of the DN is getting all the writes.As per whatever I have read , writes
> should be in round robin fashion , which should ideally lead to all the
> configured disks in the DN to be similarly loaded.
>
> Not sure how the balancer is fixing this issue.
>
> Rgds,
> Rahul
>
>
>
> On Fri, Jun 14, 2013 at 4:45 PM, Sandeep L <sa...@outlook.com>wrote:
>
> Rahul,
>
> In general this issue happens some times in Hadoop. There is no exact
> reason for this.
> To mitigate this you need to run balancer in regular intervals.
>
> Thanks,
> Sandeep.
>
> ------------------------------
> Date: Fri, 14 Jun 2013 16:39:02 +0530
> Subject: Re: Application errors with one disk on datanode getting filled
> up to 100%
> From: mail2mayank@gmail.com
> To: user@hadoop.apache.org
>
>
> No, as of this moment we've no ideas about the reasons for that behavior.
>
>
> On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
> Thanks Mayank, Any clue on why was only one disk was getting all writes.
>
> Rahul
>
>
> On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:
>
> So we did a manual rebalance (followed instructions at:
> http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
> and also reserved 30 GB of space for non dfs usage via
> dfs.datanode.du.reserved and restarted our apps.
>
> Things have been going fine till now.
>
> Keeping fingers crossed :)
>
>
> On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
> I have a few points to make , these may not be very helpful for the said
> problem.
>
> +All data nodes are bad exception is kind of not pointing to the problem
> related to disk space full.
> +hadoop.tmp.dir acts as base location of other hadoop related properties ,
> not sure if any particular directory is created specifically.
> +Only one disk getting filled looks strange.The other disk are part while
> formatting the NN.
>
> Would be interesting to know the reason for this.
> Please keep posted.
>
> Thanks,
> Rahul
>
>
> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
> From the snapshot, you got around 3TB for writing data.
>
> Can you check individual datanode's storage health.
> As you said you got 80 servers writing parallely to hdfs, I am not sure
> can that be an issue.
> As suggested in past threads, you can do a rebalance of the blocks but
> that will take some time to finish and will not solve your issue right
> away.
>
> You can wait for others to reply. I am sure there will be far better
> solutions from experts for this.
>
>
> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>
> No it's not a map-reduce job. We've a java app running on around 80
> machines which writes to hdfs. The error that I'd mentioned is being thrown
> by the application and yes we've replication factor set to 3 and following
> is status of hdfs:
>
> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66
> GB DFS Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
> : 0 Number of Under-Replicated Blocks : 0
>
>
> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
> when you say application errors out .. does that mean your mapreduce job
> is erroring? In that case apart from hdfs space you will need to look at
> mapred tmp directory space as well.
>
> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
> replication factor of 3 so at max you will have datasize of 5TB with you.
> I am also assuming you are not scheduling your program to run on entire
> 5TB with just 10 nodes.
>
> i suspect your clusters mapred tmp space is getting filled in while the
> job is running.
>
>
>
>
>
> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>
> We are running a hadoop cluster with 10 datanodes and a namenode. Each
> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
> disk having a capacity 414GB.
>
>
> hdfs-site.xml has following property set:
>
> <property>
>         <name>dfs.data.dir</name>
>
> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>         <description>Data dirs for DFS.</description>
> </property>
>
> Now we are facing a issue where in we find /data1 getting filled up
> quickly and many a times we see it's usage running at 100% with just few
> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
> present.
>
> We've some java applications which are writing to hdfs and many a times we
> are seeing foloowing errors in our application logs:
>
>
>
> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>
>
>
> I went through some old discussions and looks like manual rebalancing is
> what is required in this case and we should also have
> dfs.datanode.du.reserved set up.
>
> However I'd like to understand if this issue, with one disk getting filled
> up to 100% can result into the issue which we are seeing in our
> application.
>
> Also, are there any other peformance implications due to some of the disks
> running at 100% usage on a datanode.
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>

RE: Application errors with one disk on datanode getting filled up to 100%

Posted by Sandeep L <sa...@outlook.com>.

Rahul,
In general most of the times Hadoop tries to compute data locally that is, if run a MapReduce task on particular input, Hadoop will try compute data locally and write data locally(Majority of times this will happen), replicate in other nodes.
In your scenario majority of your input data may be from a single datanode, so Hadoop is trying to write output data to same datanode.
Thanks,Sandeep.
From: rahul.rec.dgp@gmail.com
Date: Fri, 14 Jun 2013 17:50:46 +0530
Subject: Re: Application errors with one disk on datanode getting filled up to 100%
To: user@hadoop.apache.org

Thanks Sandeep,

I was thinking that the overall hdfs cluster might get unbalanced over the time and balancer might be useful in that case. 

I was more interested to know why only one disk out of configured 4 disks of the DN is getting all the writes.As per whatever I have read , writes should be in round robin fashion , which should ideally lead to all the configured disks in the DN to be similarly loaded.

Not sure how the balancer is fixing this issue.

Rgds,
Rahul

On Fri, Jun 14, 2013 at 4:45 PM, Sandeep L <sa...@outlook.com> wrote:

Rahul,
In general this issue happens some times in Hadoop. There is no exact reason for this.To mitigate this you need to run balancer in regular intervals.

Thanks,Sandeep.
Date: Fri, 14 Jun 2013 16:39:02 +0530
Subject: Re: Application errors with one disk on datanode getting filled up to 100%
From: mail2mayank@gmail.com

To: user@hadoop.apache.org

No, as of this moment we've no ideas about the reasons for that behavior. 

On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <ra...@gmail.com> wrote:

Thanks Mayank, Any clue on why was only one disk was getting all writes.

Rahul

On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:

So we did a manual rebalance (followed instructions at: http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F) and also reserved 30 GB of space for non dfs usage via dfs.datanode.du.reserved and restarted our apps. 

Things have been going fine till now. 

Keeping fingers crossed :)

On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <ra...@gmail.com> wrote:

I have a few points to make , these may not be very helpful for the said problem.

+All data nodes are bad exception is kind of not pointing to the problem related to disk space full.
+hadoop.tmp.dir acts as base location of other hadoop related properties , not sure if any particular directory is created specifically.

+Only one disk getting filled looks strange.The other disk are part while formatting the NN.

Would be interesting to know the reason for this.
Please keep posted.

Thanks,
Rahul

On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com> wrote:

>From the snapshot, you got around 3TB for writing data. 
Can you check individual datanode's storage health. 

As you said you got 80 servers writing parallely to hdfs, I am not sure can that be an issue. 
As suggested in past threads, you can do a rebalance of the blocks but that will take some time to finish and will not solve your issue right away. 
You can wait for others to reply. I am sure there will be far better solutions from experts for this. 

On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:

No it's not a map-reduce job. We've a java app running on around 80 machines which writes to hdfs. The error that I'd mentioned is being thrown by the application and yes we've replication factor set to 3 and following is status of hdfs:

 Configured Capacity : 16.15 TB  DFS Used : 11.84 TB
  Non DFS Used : 872.66 GB  DFS Remaining : 3.46 TB  DFS Used%
 : 73.3 %  DFS Remaining% : 21.42 %  Live Nodes 

 : 10  Dead Nodes  : 0
  Decommissioning Nodes  : 0 
 Number of Under-Replicated Blocks : 0

On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com> wrote:

when you say application errors out .. does that mean your mapreduce job is erroring? In that case apart from hdfs space you will need to look at mapred tmp directory space as well. 

you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a replication factor of 3 so at max you will have datasize of 5TB with you. I am also assuming you are not scheduling your program to run on entire 5TB with just 10 nodes. 

i suspect your clusters mapred tmp space is getting filled in while the job is running. 

On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:

We are running a hadoop cluster with 10 datanodes and a namenode. Each datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each disk having a capacity 414GB.

hdfs-site.xml has following property set:

<property>
        <name>dfs.data.dir</name>
        <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
        <description>Data dirs for DFS.</description>

</property>

Now we are facing a issue where in we find /data1 getting filled up quickly and many a times we see it's usage running at 100% with just few megabytes of free space. This issue is visible on 7 out of 10 datanodes at present.

We've some java applications which are writing to hdfs and many a times we are seeing foloowing errors in our application logs:

java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)

I went through some old discussions and looks like manual rebalancing is what is required in this case and we should also have dfs.datanode.du.reserved set up.

However I'd like to understand if this issue, with one disk getting filled up to 100% can result into the issue which we are seeing in our application. 

Also, are there any other peformance implications due to some of the disks running at 100% usage on a datanode.

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

-- 
Nitin Pawar

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

-- 
Nitin Pawar

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

RE: Application errors with one disk on datanode getting filled up to 100%

Posted by Sandeep L <sa...@outlook.com>.

Rahul,
In general most of the times Hadoop tries to compute data locally that is, if run a MapReduce task on particular input, Hadoop will try compute data locally and write data locally(Majority of times this will happen), replicate in other nodes.
In your scenario majority of your input data may be from a single datanode, so Hadoop is trying to write output data to same datanode.
Thanks,Sandeep.
From: rahul.rec.dgp@gmail.com
Date: Fri, 14 Jun 2013 17:50:46 +0530
Subject: Re: Application errors with one disk on datanode getting filled up to 100%
To: user@hadoop.apache.org

Thanks Sandeep,

I was thinking that the overall hdfs cluster might get unbalanced over the time and balancer might be useful in that case. 

I was more interested to know why only one disk out of configured 4 disks of the DN is getting all the writes.As per whatever I have read , writes should be in round robin fashion , which should ideally lead to all the configured disks in the DN to be similarly loaded.

Not sure how the balancer is fixing this issue.

Rgds,
Rahul

On Fri, Jun 14, 2013 at 4:45 PM, Sandeep L <sa...@outlook.com> wrote:

Rahul,
In general this issue happens some times in Hadoop. There is no exact reason for this.To mitigate this you need to run balancer in regular intervals.

Thanks,Sandeep.
Date: Fri, 14 Jun 2013 16:39:02 +0530
Subject: Re: Application errors with one disk on datanode getting filled up to 100%
From: mail2mayank@gmail.com

To: user@hadoop.apache.org

No, as of this moment we've no ideas about the reasons for that behavior. 

On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <ra...@gmail.com> wrote:

Thanks Mayank, Any clue on why was only one disk was getting all writes.

Rahul

On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:

So we did a manual rebalance (followed instructions at: http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F) and also reserved 30 GB of space for non dfs usage via dfs.datanode.du.reserved and restarted our apps. 

Things have been going fine till now. 

Keeping fingers crossed :)

On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <ra...@gmail.com> wrote:

I have a few points to make , these may not be very helpful for the said problem.

+All data nodes are bad exception is kind of not pointing to the problem related to disk space full.
+hadoop.tmp.dir acts as base location of other hadoop related properties , not sure if any particular directory is created specifically.

+Only one disk getting filled looks strange.The other disk are part while formatting the NN.

Would be interesting to know the reason for this.
Please keep posted.

Thanks,
Rahul

On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com> wrote:

>From the snapshot, you got around 3TB for writing data. 
Can you check individual datanode's storage health. 

As you said you got 80 servers writing parallely to hdfs, I am not sure can that be an issue. 
As suggested in past threads, you can do a rebalance of the blocks but that will take some time to finish and will not solve your issue right away. 
You can wait for others to reply. I am sure there will be far better solutions from experts for this. 

On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:

No it's not a map-reduce job. We've a java app running on around 80 machines which writes to hdfs. The error that I'd mentioned is being thrown by the application and yes we've replication factor set to 3 and following is status of hdfs:

 Configured Capacity : 16.15 TB  DFS Used : 11.84 TB
  Non DFS Used : 872.66 GB  DFS Remaining : 3.46 TB  DFS Used%
 : 73.3 %  DFS Remaining% : 21.42 %  Live Nodes 

 : 10  Dead Nodes  : 0
  Decommissioning Nodes  : 0 
 Number of Under-Replicated Blocks : 0

On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com> wrote:

when you say application errors out .. does that mean your mapreduce job is erroring? In that case apart from hdfs space you will need to look at mapred tmp directory space as well. 

you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a replication factor of 3 so at max you will have datasize of 5TB with you. I am also assuming you are not scheduling your program to run on entire 5TB with just 10 nodes. 

i suspect your clusters mapred tmp space is getting filled in while the job is running. 

On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:

We are running a hadoop cluster with 10 datanodes and a namenode. Each datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each disk having a capacity 414GB.

hdfs-site.xml has following property set:

<property>
        <name>dfs.data.dir</name>
        <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
        <description>Data dirs for DFS.</description>

</property>

Now we are facing a issue where in we find /data1 getting filled up quickly and many a times we see it's usage running at 100% with just few megabytes of free space. This issue is visible on 7 out of 10 datanodes at present.

We've some java applications which are writing to hdfs and many a times we are seeing foloowing errors in our application logs:

java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)

I went through some old discussions and looks like manual rebalancing is what is required in this case and we should also have dfs.datanode.du.reserved set up.

However I'd like to understand if this issue, with one disk getting filled up to 100% can result into the issue which we are seeing in our application. 

Also, are there any other peformance implications due to some of the disks running at 100% usage on a datanode.

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

-- 
Nitin Pawar

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

-- 
Nitin Pawar

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

RE: Application errors with one disk on datanode getting filled up to 100%

Posted by Sandeep L <sa...@outlook.com>.

Rahul,
In general most of the times Hadoop tries to compute data locally that is, if run a MapReduce task on particular input, Hadoop will try compute data locally and write data locally(Majority of times this will happen), replicate in other nodes.
In your scenario majority of your input data may be from a single datanode, so Hadoop is trying to write output data to same datanode.
Thanks,Sandeep.
From: rahul.rec.dgp@gmail.com
Date: Fri, 14 Jun 2013 17:50:46 +0530
Subject: Re: Application errors with one disk on datanode getting filled up to 100%
To: user@hadoop.apache.org

Thanks Sandeep,

I was thinking that the overall hdfs cluster might get unbalanced over the time and balancer might be useful in that case. 

I was more interested to know why only one disk out of configured 4 disks of the DN is getting all the writes.As per whatever I have read , writes should be in round robin fashion , which should ideally lead to all the configured disks in the DN to be similarly loaded.

Not sure how the balancer is fixing this issue.

Rgds,
Rahul

On Fri, Jun 14, 2013 at 4:45 PM, Sandeep L <sa...@outlook.com> wrote:

Rahul,
In general this issue happens some times in Hadoop. There is no exact reason for this.To mitigate this you need to run balancer in regular intervals.

Thanks,Sandeep.
Date: Fri, 14 Jun 2013 16:39:02 +0530
Subject: Re: Application errors with one disk on datanode getting filled up to 100%
From: mail2mayank@gmail.com

To: user@hadoop.apache.org

No, as of this moment we've no ideas about the reasons for that behavior. 

On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <ra...@gmail.com> wrote:

Thanks Mayank, Any clue on why was only one disk was getting all writes.

Rahul

On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:

So we did a manual rebalance (followed instructions at: http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F) and also reserved 30 GB of space for non dfs usage via dfs.datanode.du.reserved and restarted our apps. 

Things have been going fine till now. 

Keeping fingers crossed :)

On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <ra...@gmail.com> wrote:

I have a few points to make , these may not be very helpful for the said problem.

+All data nodes are bad exception is kind of not pointing to the problem related to disk space full.
+hadoop.tmp.dir acts as base location of other hadoop related properties , not sure if any particular directory is created specifically.

+Only one disk getting filled looks strange.The other disk are part while formatting the NN.

Would be interesting to know the reason for this.
Please keep posted.

Thanks,
Rahul

On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com> wrote:

>From the snapshot, you got around 3TB for writing data. 
Can you check individual datanode's storage health. 

As you said you got 80 servers writing parallely to hdfs, I am not sure can that be an issue. 
As suggested in past threads, you can do a rebalance of the blocks but that will take some time to finish and will not solve your issue right away. 
You can wait for others to reply. I am sure there will be far better solutions from experts for this. 

On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:

No it's not a map-reduce job. We've a java app running on around 80 machines which writes to hdfs. The error that I'd mentioned is being thrown by the application and yes we've replication factor set to 3 and following is status of hdfs:

 Configured Capacity : 16.15 TB  DFS Used : 11.84 TB
  Non DFS Used : 872.66 GB  DFS Remaining : 3.46 TB  DFS Used%
 : 73.3 %  DFS Remaining% : 21.42 %  Live Nodes 

 : 10  Dead Nodes  : 0
  Decommissioning Nodes  : 0 
 Number of Under-Replicated Blocks : 0

On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com> wrote:

when you say application errors out .. does that mean your mapreduce job is erroring? In that case apart from hdfs space you will need to look at mapred tmp directory space as well. 

you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a replication factor of 3 so at max you will have datasize of 5TB with you. I am also assuming you are not scheduling your program to run on entire 5TB with just 10 nodes. 

i suspect your clusters mapred tmp space is getting filled in while the job is running. 

On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:

We are running a hadoop cluster with 10 datanodes and a namenode. Each datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each disk having a capacity 414GB.

hdfs-site.xml has following property set:

<property>
        <name>dfs.data.dir</name>
        <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
        <description>Data dirs for DFS.</description>

</property>

Now we are facing a issue where in we find /data1 getting filled up quickly and many a times we see it's usage running at 100% with just few megabytes of free space. This issue is visible on 7 out of 10 datanodes at present.

We've some java applications which are writing to hdfs and many a times we are seeing foloowing errors in our application logs:

java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)

I went through some old discussions and looks like manual rebalancing is what is required in this case and we should also have dfs.datanode.du.reserved set up.

However I'd like to understand if this issue, with one disk getting filled up to 100% can result into the issue which we are seeing in our application. 

Also, are there any other peformance implications due to some of the disks running at 100% usage on a datanode.

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

-- 
Nitin Pawar

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

-- 
Nitin Pawar

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

RE: Application errors with one disk on datanode getting filled up to 100%

Posted by Sandeep L <sa...@outlook.com>.

Rahul,
In general most of the times Hadoop tries to compute data locally that is, if run a MapReduce task on particular input, Hadoop will try compute data locally and write data locally(Majority of times this will happen), replicate in other nodes.
In your scenario majority of your input data may be from a single datanode, so Hadoop is trying to write output data to same datanode.
Thanks,Sandeep.
From: rahul.rec.dgp@gmail.com
Date: Fri, 14 Jun 2013 17:50:46 +0530
Subject: Re: Application errors with one disk on datanode getting filled up to 100%
To: user@hadoop.apache.org

Thanks Sandeep,

I was thinking that the overall hdfs cluster might get unbalanced over the time and balancer might be useful in that case. 

I was more interested to know why only one disk out of configured 4 disks of the DN is getting all the writes.As per whatever I have read , writes should be in round robin fashion , which should ideally lead to all the configured disks in the DN to be similarly loaded.

Not sure how the balancer is fixing this issue.

Rgds,
Rahul

On Fri, Jun 14, 2013 at 4:45 PM, Sandeep L <sa...@outlook.com> wrote:

Rahul,
In general this issue happens some times in Hadoop. There is no exact reason for this.To mitigate this you need to run balancer in regular intervals.

Thanks,Sandeep.
Date: Fri, 14 Jun 2013 16:39:02 +0530
Subject: Re: Application errors with one disk on datanode getting filled up to 100%
From: mail2mayank@gmail.com

To: user@hadoop.apache.org

No, as of this moment we've no ideas about the reasons for that behavior. 

On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <ra...@gmail.com> wrote:

Thanks Mayank, Any clue on why was only one disk was getting all writes.

Rahul

On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:

So we did a manual rebalance (followed instructions at: http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F) and also reserved 30 GB of space for non dfs usage via dfs.datanode.du.reserved and restarted our apps. 

Things have been going fine till now. 

Keeping fingers crossed :)

On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <ra...@gmail.com> wrote:

I have a few points to make , these may not be very helpful for the said problem.

+All data nodes are bad exception is kind of not pointing to the problem related to disk space full.
+hadoop.tmp.dir acts as base location of other hadoop related properties , not sure if any particular directory is created specifically.

+Only one disk getting filled looks strange.The other disk are part while formatting the NN.

Would be interesting to know the reason for this.
Please keep posted.

Thanks,
Rahul

On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com> wrote:

>From the snapshot, you got around 3TB for writing data. 
Can you check individual datanode's storage health. 

As you said you got 80 servers writing parallely to hdfs, I am not sure can that be an issue. 
As suggested in past threads, you can do a rebalance of the blocks but that will take some time to finish and will not solve your issue right away. 
You can wait for others to reply. I am sure there will be far better solutions from experts for this. 

On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:

No it's not a map-reduce job. We've a java app running on around 80 machines which writes to hdfs. The error that I'd mentioned is being thrown by the application and yes we've replication factor set to 3 and following is status of hdfs:

 Configured Capacity : 16.15 TB  DFS Used : 11.84 TB
  Non DFS Used : 872.66 GB  DFS Remaining : 3.46 TB  DFS Used%
 : 73.3 %  DFS Remaining% : 21.42 %  Live Nodes 

 : 10  Dead Nodes  : 0
  Decommissioning Nodes  : 0 
 Number of Under-Replicated Blocks : 0

On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com> wrote:

when you say application errors out .. does that mean your mapreduce job is erroring? In that case apart from hdfs space you will need to look at mapred tmp directory space as well. 

you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a replication factor of 3 so at max you will have datasize of 5TB with you. I am also assuming you are not scheduling your program to run on entire 5TB with just 10 nodes. 

i suspect your clusters mapred tmp space is getting filled in while the job is running. 

On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:

We are running a hadoop cluster with 10 datanodes and a namenode. Each datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each disk having a capacity 414GB.

hdfs-site.xml has following property set:

<property>
        <name>dfs.data.dir</name>
        <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
        <description>Data dirs for DFS.</description>

</property>

Now we are facing a issue where in we find /data1 getting filled up quickly and many a times we see it's usage running at 100% with just few megabytes of free space. This issue is visible on 7 out of 10 datanodes at present.

We've some java applications which are writing to hdfs and many a times we are seeing foloowing errors in our application logs:

java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)

I went through some old discussions and looks like manual rebalancing is what is required in this case and we should also have dfs.datanode.du.reserved set up.

However I'd like to understand if this issue, with one disk getting filled up to 100% can result into the issue which we are seeing in our application. 

Also, are there any other peformance implications due to some of the disks running at 100% usage on a datanode.

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

-- 
Nitin Pawar

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

-- 
Nitin Pawar

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks Sandeep,

I was thinking that the overall hdfs cluster might get unbalanced over the
time and balancer might be useful in that case.
I was more interested to know why only one disk out of configured 4 disks
of the DN is getting all the writes.As per whatever I have read , writes
should be in round robin fashion , which should ideally lead to all the
configured disks in the DN to be similarly loaded.

Not sure how the balancer is fixing this issue.

Rgds,
Rahul



On Fri, Jun 14, 2013 at 4:45 PM, Sandeep L <sa...@outlook.com>wrote:

> Rahul,
>
> In general this issue happens some times in Hadoop. There is no exact
> reason for this.
> To mitigate this you need to run balancer in regular intervals.
>
> Thanks,
> Sandeep.
>
> ------------------------------
> Date: Fri, 14 Jun 2013 16:39:02 +0530
> Subject: Re: Application errors with one disk on datanode getting filled
> up to 100%
> From: mail2mayank@gmail.com
> To: user@hadoop.apache.org
>
>
> No, as of this moment we've no ideas about the reasons for that behavior.
>
>
> On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
> Thanks Mayank, Any clue on why was only one disk was getting all writes.
>
> Rahul
>
>
> On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:
>
> So we did a manual rebalance (followed instructions at:
> http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
> and also reserved 30 GB of space for non dfs usage via
> dfs.datanode.du.reserved and restarted our apps.
>
> Things have been going fine till now.
>
> Keeping fingers crossed :)
>
>
> On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
> I have a few points to make , these may not be very helpful for the said
> problem.
>
> +All data nodes are bad exception is kind of not pointing to the problem
> related to disk space full.
> +hadoop.tmp.dir acts as base location of other hadoop related properties ,
> not sure if any particular directory is created specifically.
> +Only one disk getting filled looks strange.The other disk are part while
> formatting the NN.
>
> Would be interesting to know the reason for this.
> Please keep posted.
>
> Thanks,
> Rahul
>
>
> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
> From the snapshot, you got around 3TB for writing data.
>
> Can you check individual datanode's storage health.
> As you said you got 80 servers writing parallely to hdfs, I am not sure
> can that be an issue.
> As suggested in past threads, you can do a rebalance of the blocks but
> that will take some time to finish and will not solve your issue right
> away.
>
> You can wait for others to reply. I am sure there will be far better
> solutions from experts for this.
>
>
> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>
> No it's not a map-reduce job. We've a java app running on around 80
> machines which writes to hdfs. The error that I'd mentioned is being thrown
> by the application and yes we've replication factor set to 3 and following
> is status of hdfs:
>
> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66
> GB DFS Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
> : 0 Number of Under-Replicated Blocks : 0
>
>
> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
> when you say application errors out .. does that mean your mapreduce job
> is erroring? In that case apart from hdfs space you will need to look at
> mapred tmp directory space as well.
>
> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
> replication factor of 3 so at max you will have datasize of 5TB with you.
> I am also assuming you are not scheduling your program to run on entire
> 5TB with just 10 nodes.
>
> i suspect your clusters mapred tmp space is getting filled in while the
> job is running.
>
>
>
>
>
> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>
> We are running a hadoop cluster with 10 datanodes and a namenode. Each
> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
> disk having a capacity 414GB.
>
>
> hdfs-site.xml has following property set:
>
> <property>
>         <name>dfs.data.dir</name>
>
> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>         <description>Data dirs for DFS.</description>
> </property>
>
> Now we are facing a issue where in we find /data1 getting filled up
> quickly and many a times we see it's usage running at 100% with just few
> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
> present.
>
> We've some java applications which are writing to hdfs and many a times we
> are seeing foloowing errors in our application logs:
>
>
>
> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>
>
>
> I went through some old discussions and looks like manual rebalancing is
> what is required in this case and we should also have
> dfs.datanode.du.reserved set up.
>
> However I'd like to understand if this issue, with one disk getting filled
> up to 100% can result into the issue which we are seeing in our
> application.
>
> Also, are there any other peformance implications due to some of the disks
> running at 100% usage on a datanode.
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Harsh J <ha...@cloudera.com>.

Sandeep/Mayank,

If you take a look at the volume selection parts of the code, you can
notice it is simply round robin. There's no way we continuously may select
the same disk, unless the disk is deselected for errors (tolerated) or
space (due to lack or reservation). Its better to monitor for a pattern and
look for a misconfiguration, rather than suspect a bug and also accept the
behavior.

Rahul,

The current HDFS version received a better inter-disk balancing code that
I've seen in use already. See
https://issues.apache.org/jira/browse/HDFS-1804 for more info.


On Fri, Jun 14, 2013 at 4:45 PM, Sandeep L <sa...@outlook.com>wrote:

> Rahul,
>
> In general this issue happens some times in Hadoop. There is no exact
> reason for this.
> To mitigate this you need to run balancer in regular intervals.
>
> Thanks,
> Sandeep.
>
> ------------------------------
> Date: Fri, 14 Jun 2013 16:39:02 +0530
> Subject: Re: Application errors with one disk on datanode getting filled
> up to 100%
> From: mail2mayank@gmail.com
> To: user@hadoop.apache.org
>
>
> No, as of this moment we've no ideas about the reasons for that behavior.
>
>
> On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
> Thanks Mayank, Any clue on why was only one disk was getting all writes.
>
> Rahul
>
>
> On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:
>
> So we did a manual rebalance (followed instructions at:
> http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
> and also reserved 30 GB of space for non dfs usage via
> dfs.datanode.du.reserved and restarted our apps.
>
> Things have been going fine till now.
>
> Keeping fingers crossed :)
>
>
> On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
> I have a few points to make , these may not be very helpful for the said
> problem.
>
> +All data nodes are bad exception is kind of not pointing to the problem
> related to disk space full.
> +hadoop.tmp.dir acts as base location of other hadoop related properties ,
> not sure if any particular directory is created specifically.
> +Only one disk getting filled looks strange.The other disk are part while
> formatting the NN.
>
> Would be interesting to know the reason for this.
> Please keep posted.
>
> Thanks,
> Rahul
>
>
> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
> From the snapshot, you got around 3TB for writing data.
>
> Can you check individual datanode's storage health.
> As you said you got 80 servers writing parallely to hdfs, I am not sure
> can that be an issue.
> As suggested in past threads, you can do a rebalance of the blocks but
> that will take some time to finish and will not solve your issue right
> away.
>
> You can wait for others to reply. I am sure there will be far better
> solutions from experts for this.
>
>
> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>
> No it's not a map-reduce job. We've a java app running on around 80
> machines which writes to hdfs. The error that I'd mentioned is being thrown
> by the application and yes we've replication factor set to 3 and following
> is status of hdfs:
>
> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66
> GB DFS Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
> : 0 Number of Under-Replicated Blocks : 0
>
>
> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
> when you say application errors out .. does that mean your mapreduce job
> is erroring? In that case apart from hdfs space you will need to look at
> mapred tmp directory space as well.
>
> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
> replication factor of 3 so at max you will have datasize of 5TB with you.
> I am also assuming you are not scheduling your program to run on entire
> 5TB with just 10 nodes.
>
> i suspect your clusters mapred tmp space is getting filled in while the
> job is running.
>
>
>
>
>
> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>
> We are running a hadoop cluster with 10 datanodes and a namenode. Each
> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
> disk having a capacity 414GB.
>
>
> hdfs-site.xml has following property set:
>
> <property>
>         <name>dfs.data.dir</name>
>
> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>         <description>Data dirs for DFS.</description>
> </property>
>
> Now we are facing a issue where in we find /data1 getting filled up
> quickly and many a times we see it's usage running at 100% with just few
> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
> present.
>
> We've some java applications which are writing to hdfs and many a times we
> are seeing foloowing errors in our application logs:
>
>
>
> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>
>
>
> I went through some old discussions and looks like manual rebalancing is
> what is required in this case and we should also have
> dfs.datanode.du.reserved set up.
>
> However I'd like to understand if this issue, with one disk getting filled
> up to 100% can result into the issue which we are seeing in our
> application.
>
> Also, are there any other peformance implications due to some of the disks
> running at 100% usage on a datanode.
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>



-- 
Harsh J

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks Sandeep,

I was thinking that the overall hdfs cluster might get unbalanced over the
time and balancer might be useful in that case.
I was more interested to know why only one disk out of configured 4 disks
of the DN is getting all the writes.As per whatever I have read , writes
should be in round robin fashion , which should ideally lead to all the
configured disks in the DN to be similarly loaded.

Not sure how the balancer is fixing this issue.

Rgds,
Rahul



On Fri, Jun 14, 2013 at 4:45 PM, Sandeep L <sa...@outlook.com>wrote:

> Rahul,
>
> In general this issue happens some times in Hadoop. There is no exact
> reason for this.
> To mitigate this you need to run balancer in regular intervals.
>
> Thanks,
> Sandeep.
>
> ------------------------------
> Date: Fri, 14 Jun 2013 16:39:02 +0530
> Subject: Re: Application errors with one disk on datanode getting filled
> up to 100%
> From: mail2mayank@gmail.com
> To: user@hadoop.apache.org
>
>
> No, as of this moment we've no ideas about the reasons for that behavior.
>
>
> On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
> Thanks Mayank, Any clue on why was only one disk was getting all writes.
>
> Rahul
>
>
> On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:
>
> So we did a manual rebalance (followed instructions at:
> http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
> and also reserved 30 GB of space for non dfs usage via
> dfs.datanode.du.reserved and restarted our apps.
>
> Things have been going fine till now.
>
> Keeping fingers crossed :)
>
>
> On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
> I have a few points to make , these may not be very helpful for the said
> problem.
>
> +All data nodes are bad exception is kind of not pointing to the problem
> related to disk space full.
> +hadoop.tmp.dir acts as base location of other hadoop related properties ,
> not sure if any particular directory is created specifically.
> +Only one disk getting filled looks strange.The other disk are part while
> formatting the NN.
>
> Would be interesting to know the reason for this.
> Please keep posted.
>
> Thanks,
> Rahul
>
>
> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
> From the snapshot, you got around 3TB for writing data.
>
> Can you check individual datanode's storage health.
> As you said you got 80 servers writing parallely to hdfs, I am not sure
> can that be an issue.
> As suggested in past threads, you can do a rebalance of the blocks but
> that will take some time to finish and will not solve your issue right
> away.
>
> You can wait for others to reply. I am sure there will be far better
> solutions from experts for this.
>
>
> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>
> No it's not a map-reduce job. We've a java app running on around 80
> machines which writes to hdfs. The error that I'd mentioned is being thrown
> by the application and yes we've replication factor set to 3 and following
> is status of hdfs:
>
> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66
> GB DFS Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
> : 0 Number of Under-Replicated Blocks : 0
>
>
> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
> when you say application errors out .. does that mean your mapreduce job
> is erroring? In that case apart from hdfs space you will need to look at
> mapred tmp directory space as well.
>
> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
> replication factor of 3 so at max you will have datasize of 5TB with you.
> I am also assuming you are not scheduling your program to run on entire
> 5TB with just 10 nodes.
>
> i suspect your clusters mapred tmp space is getting filled in while the
> job is running.
>
>
>
>
>
> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>
> We are running a hadoop cluster with 10 datanodes and a namenode. Each
> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
> disk having a capacity 414GB.
>
>
> hdfs-site.xml has following property set:
>
> <property>
>         <name>dfs.data.dir</name>
>
> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>         <description>Data dirs for DFS.</description>
> </property>
>
> Now we are facing a issue where in we find /data1 getting filled up
> quickly and many a times we see it's usage running at 100% with just few
> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
> present.
>
> We've some java applications which are writing to hdfs and many a times we
> are seeing foloowing errors in our application logs:
>
>
>
> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>
>
>
> I went through some old discussions and looks like manual rebalancing is
> what is required in this case and we should also have
> dfs.datanode.du.reserved set up.
>
> However I'd like to understand if this issue, with one disk getting filled
> up to 100% can result into the issue which we are seeing in our
> application.
>
> Also, are there any other peformance implications due to some of the disks
> running at 100% usage on a datanode.
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks Sandeep,

I was thinking that the overall hdfs cluster might get unbalanced over the
time and balancer might be useful in that case.
I was more interested to know why only one disk out of configured 4 disks
of the DN is getting all the writes.As per whatever I have read , writes
should be in round robin fashion , which should ideally lead to all the
configured disks in the DN to be similarly loaded.

Not sure how the balancer is fixing this issue.

Rgds,
Rahul



On Fri, Jun 14, 2013 at 4:45 PM, Sandeep L <sa...@outlook.com>wrote:

> Rahul,
>
> In general this issue happens some times in Hadoop. There is no exact
> reason for this.
> To mitigate this you need to run balancer in regular intervals.
>
> Thanks,
> Sandeep.
>
> ------------------------------
> Date: Fri, 14 Jun 2013 16:39:02 +0530
> Subject: Re: Application errors with one disk on datanode getting filled
> up to 100%
> From: mail2mayank@gmail.com
> To: user@hadoop.apache.org
>
>
> No, as of this moment we've no ideas about the reasons for that behavior.
>
>
> On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
> Thanks Mayank, Any clue on why was only one disk was getting all writes.
>
> Rahul
>
>
> On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:
>
> So we did a manual rebalance (followed instructions at:
> http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
> and also reserved 30 GB of space for non dfs usage via
> dfs.datanode.du.reserved and restarted our apps.
>
> Things have been going fine till now.
>
> Keeping fingers crossed :)
>
>
> On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
> I have a few points to make , these may not be very helpful for the said
> problem.
>
> +All data nodes are bad exception is kind of not pointing to the problem
> related to disk space full.
> +hadoop.tmp.dir acts as base location of other hadoop related properties ,
> not sure if any particular directory is created specifically.
> +Only one disk getting filled looks strange.The other disk are part while
> formatting the NN.
>
> Would be interesting to know the reason for this.
> Please keep posted.
>
> Thanks,
> Rahul
>
>
> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
> From the snapshot, you got around 3TB for writing data.
>
> Can you check individual datanode's storage health.
> As you said you got 80 servers writing parallely to hdfs, I am not sure
> can that be an issue.
> As suggested in past threads, you can do a rebalance of the blocks but
> that will take some time to finish and will not solve your issue right
> away.
>
> You can wait for others to reply. I am sure there will be far better
> solutions from experts for this.
>
>
> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>
> No it's not a map-reduce job. We've a java app running on around 80
> machines which writes to hdfs. The error that I'd mentioned is being thrown
> by the application and yes we've replication factor set to 3 and following
> is status of hdfs:
>
> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66
> GB DFS Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
> : 0 Number of Under-Replicated Blocks : 0
>
>
> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
> when you say application errors out .. does that mean your mapreduce job
> is erroring? In that case apart from hdfs space you will need to look at
> mapred tmp directory space as well.
>
> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
> replication factor of 3 so at max you will have datasize of 5TB with you.
> I am also assuming you are not scheduling your program to run on entire
> 5TB with just 10 nodes.
>
> i suspect your clusters mapred tmp space is getting filled in while the
> job is running.
>
>
>
>
>
> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>
> We are running a hadoop cluster with 10 datanodes and a namenode. Each
> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
> disk having a capacity 414GB.
>
>
> hdfs-site.xml has following property set:
>
> <property>
>         <name>dfs.data.dir</name>
>
> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>         <description>Data dirs for DFS.</description>
> </property>
>
> Now we are facing a issue where in we find /data1 getting filled up
> quickly and many a times we see it's usage running at 100% with just few
> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
> present.
>
> We've some java applications which are writing to hdfs and many a times we
> are seeing foloowing errors in our application logs:
>
>
>
> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>
>
>
> I went through some old discussions and looks like manual rebalancing is
> what is required in this case and we should also have
> dfs.datanode.du.reserved set up.
>
> However I'd like to understand if this issue, with one disk getting filled
> up to 100% can result into the issue which we are seeing in our
> application.
>
> Also, are there any other peformance implications due to some of the disks
> running at 100% usage on a datanode.
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Harsh J <ha...@cloudera.com>.

Sandeep/Mayank,

If you take a look at the volume selection parts of the code, you can
notice it is simply round robin. There's no way we continuously may select
the same disk, unless the disk is deselected for errors (tolerated) or
space (due to lack or reservation). Its better to monitor for a pattern and
look for a misconfiguration, rather than suspect a bug and also accept the
behavior.

Rahul,

The current HDFS version received a better inter-disk balancing code that
I've seen in use already. See
https://issues.apache.org/jira/browse/HDFS-1804 for more info.


On Fri, Jun 14, 2013 at 4:45 PM, Sandeep L <sa...@outlook.com>wrote:

> Rahul,
>
> In general this issue happens some times in Hadoop. There is no exact
> reason for this.
> To mitigate this you need to run balancer in regular intervals.
>
> Thanks,
> Sandeep.
>
> ------------------------------
> Date: Fri, 14 Jun 2013 16:39:02 +0530
> Subject: Re: Application errors with one disk on datanode getting filled
> up to 100%
> From: mail2mayank@gmail.com
> To: user@hadoop.apache.org
>
>
> No, as of this moment we've no ideas about the reasons for that behavior.
>
>
> On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
> Thanks Mayank, Any clue on why was only one disk was getting all writes.
>
> Rahul
>
>
> On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:
>
> So we did a manual rebalance (followed instructions at:
> http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
> and also reserved 30 GB of space for non dfs usage via
> dfs.datanode.du.reserved and restarted our apps.
>
> Things have been going fine till now.
>
> Keeping fingers crossed :)
>
>
> On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
> I have a few points to make , these may not be very helpful for the said
> problem.
>
> +All data nodes are bad exception is kind of not pointing to the problem
> related to disk space full.
> +hadoop.tmp.dir acts as base location of other hadoop related properties ,
> not sure if any particular directory is created specifically.
> +Only one disk getting filled looks strange.The other disk are part while
> formatting the NN.
>
> Would be interesting to know the reason for this.
> Please keep posted.
>
> Thanks,
> Rahul
>
>
> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
> From the snapshot, you got around 3TB for writing data.
>
> Can you check individual datanode's storage health.
> As you said you got 80 servers writing parallely to hdfs, I am not sure
> can that be an issue.
> As suggested in past threads, you can do a rebalance of the blocks but
> that will take some time to finish and will not solve your issue right
> away.
>
> You can wait for others to reply. I am sure there will be far better
> solutions from experts for this.
>
>
> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>
> No it's not a map-reduce job. We've a java app running on around 80
> machines which writes to hdfs. The error that I'd mentioned is being thrown
> by the application and yes we've replication factor set to 3 and following
> is status of hdfs:
>
> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66
> GB DFS Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
> : 0 Number of Under-Replicated Blocks : 0
>
>
> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
> when you say application errors out .. does that mean your mapreduce job
> is erroring? In that case apart from hdfs space you will need to look at
> mapred tmp directory space as well.
>
> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
> replication factor of 3 so at max you will have datasize of 5TB with you.
> I am also assuming you are not scheduling your program to run on entire
> 5TB with just 10 nodes.
>
> i suspect your clusters mapred tmp space is getting filled in while the
> job is running.
>
>
>
>
>
> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>
> We are running a hadoop cluster with 10 datanodes and a namenode. Each
> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
> disk having a capacity 414GB.
>
>
> hdfs-site.xml has following property set:
>
> <property>
>         <name>dfs.data.dir</name>
>
> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>         <description>Data dirs for DFS.</description>
> </property>
>
> Now we are facing a issue where in we find /data1 getting filled up
> quickly and many a times we see it's usage running at 100% with just few
> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
> present.
>
> We've some java applications which are writing to hdfs and many a times we
> are seeing foloowing errors in our application logs:
>
>
>
> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>
>
>
> I went through some old discussions and looks like manual rebalancing is
> what is required in this case and we should also have
> dfs.datanode.du.reserved set up.
>
> However I'd like to understand if this issue, with one disk getting filled
> up to 100% can result into the issue which we are seeing in our
> application.
>
> Also, are there any other peformance implications due to some of the disks
> running at 100% usage on a datanode.
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>



-- 
Harsh J

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks Sandeep,

I was thinking that the overall hdfs cluster might get unbalanced over the
time and balancer might be useful in that case.
I was more interested to know why only one disk out of configured 4 disks
of the DN is getting all the writes.As per whatever I have read , writes
should be in round robin fashion , which should ideally lead to all the
configured disks in the DN to be similarly loaded.

Not sure how the balancer is fixing this issue.

Rgds,
Rahul



On Fri, Jun 14, 2013 at 4:45 PM, Sandeep L <sa...@outlook.com>wrote:

> Rahul,
>
> In general this issue happens some times in Hadoop. There is no exact
> reason for this.
> To mitigate this you need to run balancer in regular intervals.
>
> Thanks,
> Sandeep.
>
> ------------------------------
> Date: Fri, 14 Jun 2013 16:39:02 +0530
> Subject: Re: Application errors with one disk on datanode getting filled
> up to 100%
> From: mail2mayank@gmail.com
> To: user@hadoop.apache.org
>
>
> No, as of this moment we've no ideas about the reasons for that behavior.
>
>
> On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
> Thanks Mayank, Any clue on why was only one disk was getting all writes.
>
> Rahul
>
>
> On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:
>
> So we did a manual rebalance (followed instructions at:
> http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
> and also reserved 30 GB of space for non dfs usage via
> dfs.datanode.du.reserved and restarted our apps.
>
> Things have been going fine till now.
>
> Keeping fingers crossed :)
>
>
> On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
> I have a few points to make , these may not be very helpful for the said
> problem.
>
> +All data nodes are bad exception is kind of not pointing to the problem
> related to disk space full.
> +hadoop.tmp.dir acts as base location of other hadoop related properties ,
> not sure if any particular directory is created specifically.
> +Only one disk getting filled looks strange.The other disk are part while
> formatting the NN.
>
> Would be interesting to know the reason for this.
> Please keep posted.
>
> Thanks,
> Rahul
>
>
> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
> From the snapshot, you got around 3TB for writing data.
>
> Can you check individual datanode's storage health.
> As you said you got 80 servers writing parallely to hdfs, I am not sure
> can that be an issue.
> As suggested in past threads, you can do a rebalance of the blocks but
> that will take some time to finish and will not solve your issue right
> away.
>
> You can wait for others to reply. I am sure there will be far better
> solutions from experts for this.
>
>
> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>
> No it's not a map-reduce job. We've a java app running on around 80
> machines which writes to hdfs. The error that I'd mentioned is being thrown
> by the application and yes we've replication factor set to 3 and following
> is status of hdfs:
>
> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66
> GB DFS Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
> : 0 Number of Under-Replicated Blocks : 0
>
>
> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
> when you say application errors out .. does that mean your mapreduce job
> is erroring? In that case apart from hdfs space you will need to look at
> mapred tmp directory space as well.
>
> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
> replication factor of 3 so at max you will have datasize of 5TB with you.
> I am also assuming you are not scheduling your program to run on entire
> 5TB with just 10 nodes.
>
> i suspect your clusters mapred tmp space is getting filled in while the
> job is running.
>
>
>
>
>
> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>
> We are running a hadoop cluster with 10 datanodes and a namenode. Each
> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
> disk having a capacity 414GB.
>
>
> hdfs-site.xml has following property set:
>
> <property>
>         <name>dfs.data.dir</name>
>
> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>         <description>Data dirs for DFS.</description>
> </property>
>
> Now we are facing a issue where in we find /data1 getting filled up
> quickly and many a times we see it's usage running at 100% with just few
> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
> present.
>
> We've some java applications which are writing to hdfs and many a times we
> are seeing foloowing errors in our application logs:
>
>
>
> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>
>
>
> I went through some old discussions and looks like manual rebalancing is
> what is required in this case and we should also have
> dfs.datanode.du.reserved set up.
>
> However I'd like to understand if this issue, with one disk getting filled
> up to 100% can result into the issue which we are seeing in our
> application.
>
> Also, are there any other peformance implications due to some of the disks
> running at 100% usage on a datanode.
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Harsh J <ha...@cloudera.com>.

Sandeep/Mayank,

If you take a look at the volume selection parts of the code, you can
notice it is simply round robin. There's no way we continuously may select
the same disk, unless the disk is deselected for errors (tolerated) or
space (due to lack or reservation). Its better to monitor for a pattern and
look for a misconfiguration, rather than suspect a bug and also accept the
behavior.

Rahul,

The current HDFS version received a better inter-disk balancing code that
I've seen in use already. See
https://issues.apache.org/jira/browse/HDFS-1804 for more info.


On Fri, Jun 14, 2013 at 4:45 PM, Sandeep L <sa...@outlook.com>wrote:

> Rahul,
>
> In general this issue happens some times in Hadoop. There is no exact
> reason for this.
> To mitigate this you need to run balancer in regular intervals.
>
> Thanks,
> Sandeep.
>
> ------------------------------
> Date: Fri, 14 Jun 2013 16:39:02 +0530
> Subject: Re: Application errors with one disk on datanode getting filled
> up to 100%
> From: mail2mayank@gmail.com
> To: user@hadoop.apache.org
>
>
> No, as of this moment we've no ideas about the reasons for that behavior.
>
>
> On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
> Thanks Mayank, Any clue on why was only one disk was getting all writes.
>
> Rahul
>
>
> On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:
>
> So we did a manual rebalance (followed instructions at:
> http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
> and also reserved 30 GB of space for non dfs usage via
> dfs.datanode.du.reserved and restarted our apps.
>
> Things have been going fine till now.
>
> Keeping fingers crossed :)
>
>
> On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
> I have a few points to make , these may not be very helpful for the said
> problem.
>
> +All data nodes are bad exception is kind of not pointing to the problem
> related to disk space full.
> +hadoop.tmp.dir acts as base location of other hadoop related properties ,
> not sure if any particular directory is created specifically.
> +Only one disk getting filled looks strange.The other disk are part while
> formatting the NN.
>
> Would be interesting to know the reason for this.
> Please keep posted.
>
> Thanks,
> Rahul
>
>
> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
> From the snapshot, you got around 3TB for writing data.
>
> Can you check individual datanode's storage health.
> As you said you got 80 servers writing parallely to hdfs, I am not sure
> can that be an issue.
> As suggested in past threads, you can do a rebalance of the blocks but
> that will take some time to finish and will not solve your issue right
> away.
>
> You can wait for others to reply. I am sure there will be far better
> solutions from experts for this.
>
>
> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>
> No it's not a map-reduce job. We've a java app running on around 80
> machines which writes to hdfs. The error that I'd mentioned is being thrown
> by the application and yes we've replication factor set to 3 and following
> is status of hdfs:
>
> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66
> GB DFS Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
> : 0 Number of Under-Replicated Blocks : 0
>
>
> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
> when you say application errors out .. does that mean your mapreduce job
> is erroring? In that case apart from hdfs space you will need to look at
> mapred tmp directory space as well.
>
> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
> replication factor of 3 so at max you will have datasize of 5TB with you.
> I am also assuming you are not scheduling your program to run on entire
> 5TB with just 10 nodes.
>
> i suspect your clusters mapred tmp space is getting filled in while the
> job is running.
>
>
>
>
>
> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>
> We are running a hadoop cluster with 10 datanodes and a namenode. Each
> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
> disk having a capacity 414GB.
>
>
> hdfs-site.xml has following property set:
>
> <property>
>         <name>dfs.data.dir</name>
>
> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>         <description>Data dirs for DFS.</description>
> </property>
>
> Now we are facing a issue where in we find /data1 getting filled up
> quickly and many a times we see it's usage running at 100% with just few
> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
> present.
>
> We've some java applications which are writing to hdfs and many a times we
> are seeing foloowing errors in our application logs:
>
>
>
> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>
>
>
> I went through some old discussions and looks like manual rebalancing is
> what is required in this case and we should also have
> dfs.datanode.du.reserved set up.
>
> However I'd like to understand if this issue, with one disk getting filled
> up to 100% can result into the issue which we are seeing in our
> application.
>
> Also, are there any other peformance implications due to some of the disks
> running at 100% usage on a datanode.
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>



-- 
Harsh J

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Harsh J <ha...@cloudera.com>.

Sandeep/Mayank,

If you take a look at the volume selection parts of the code, you can
notice it is simply round robin. There's no way we continuously may select
the same disk, unless the disk is deselected for errors (tolerated) or
space (due to lack or reservation). Its better to monitor for a pattern and
look for a misconfiguration, rather than suspect a bug and also accept the
behavior.

Rahul,

The current HDFS version received a better inter-disk balancing code that
I've seen in use already. See
https://issues.apache.org/jira/browse/HDFS-1804 for more info.


On Fri, Jun 14, 2013 at 4:45 PM, Sandeep L <sa...@outlook.com>wrote:

> Rahul,
>
> In general this issue happens some times in Hadoop. There is no exact
> reason for this.
> To mitigate this you need to run balancer in regular intervals.
>
> Thanks,
> Sandeep.
>
> ------------------------------
> Date: Fri, 14 Jun 2013 16:39:02 +0530
> Subject: Re: Application errors with one disk on datanode getting filled
> up to 100%
> From: mail2mayank@gmail.com
> To: user@hadoop.apache.org
>
>
> No, as of this moment we've no ideas about the reasons for that behavior.
>
>
> On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
> Thanks Mayank, Any clue on why was only one disk was getting all writes.
>
> Rahul
>
>
> On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:
>
> So we did a manual rebalance (followed instructions at:
> http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
> and also reserved 30 GB of space for non dfs usage via
> dfs.datanode.du.reserved and restarted our apps.
>
> Things have been going fine till now.
>
> Keeping fingers crossed :)
>
>
> On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
> I have a few points to make , these may not be very helpful for the said
> problem.
>
> +All data nodes are bad exception is kind of not pointing to the problem
> related to disk space full.
> +hadoop.tmp.dir acts as base location of other hadoop related properties ,
> not sure if any particular directory is created specifically.
> +Only one disk getting filled looks strange.The other disk are part while
> formatting the NN.
>
> Would be interesting to know the reason for this.
> Please keep posted.
>
> Thanks,
> Rahul
>
>
> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
> From the snapshot, you got around 3TB for writing data.
>
> Can you check individual datanode's storage health.
> As you said you got 80 servers writing parallely to hdfs, I am not sure
> can that be an issue.
> As suggested in past threads, you can do a rebalance of the blocks but
> that will take some time to finish and will not solve your issue right
> away.
>
> You can wait for others to reply. I am sure there will be far better
> solutions from experts for this.
>
>
> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>
> No it's not a map-reduce job. We've a java app running on around 80
> machines which writes to hdfs. The error that I'd mentioned is being thrown
> by the application and yes we've replication factor set to 3 and following
> is status of hdfs:
>
> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66
> GB DFS Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
> : 0 Number of Under-Replicated Blocks : 0
>
>
> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
> when you say application errors out .. does that mean your mapreduce job
> is erroring? In that case apart from hdfs space you will need to look at
> mapred tmp directory space as well.
>
> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
> replication factor of 3 so at max you will have datasize of 5TB with you.
> I am also assuming you are not scheduling your program to run on entire
> 5TB with just 10 nodes.
>
> i suspect your clusters mapred tmp space is getting filled in while the
> job is running.
>
>
>
>
>
> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>
> We are running a hadoop cluster with 10 datanodes and a namenode. Each
> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
> disk having a capacity 414GB.
>
>
> hdfs-site.xml has following property set:
>
> <property>
>         <name>dfs.data.dir</name>
>
> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>         <description>Data dirs for DFS.</description>
> </property>
>
> Now we are facing a issue where in we find /data1 getting filled up
> quickly and many a times we see it's usage running at 100% with just few
> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
> present.
>
> We've some java applications which are writing to hdfs and many a times we
> are seeing foloowing errors in our application logs:
>
>
>
> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>
>
>
> I went through some old discussions and looks like manual rebalancing is
> what is required in this case and we should also have
> dfs.datanode.du.reserved set up.
>
> However I'd like to understand if this issue, with one disk getting filled
> up to 100% can result into the issue which we are seeing in our
> application.
>
> Also, are there any other peformance implications due to some of the disks
> running at 100% usage on a datanode.
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
> --
> Nitin Pawar
>
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>
>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>



-- 
Harsh J

RE: Application errors with one disk on datanode getting filled up to 100%

Posted by Sandeep L <sa...@outlook.com>.

Rahul,
In general this issue happens some times in Hadoop. There is no exact reason for this.To mitigate this you need to run balancer in regular intervals.
Thanks,Sandeep.
Date: Fri, 14 Jun 2013 16:39:02 +0530
Subject: Re: Application errors with one disk on datanode getting filled up to 100%
From: mail2mayank@gmail.com
To: user@hadoop.apache.org

No, as of this moment we've no ideas about the reasons for that behavior. 

On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <ra...@gmail.com> wrote:

Thanks Mayank, Any clue on why was only one disk was getting all writes.

Rahul

On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:

So we did a manual rebalance (followed instructions at: http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F) and also reserved 30 GB of space for non dfs usage via dfs.datanode.du.reserved and restarted our apps. 

Things have been going fine till now. 

Keeping fingers crossed :)

On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <ra...@gmail.com> wrote:

I have a few points to make , these may not be very helpful for the said problem.

+All data nodes are bad exception is kind of not pointing to the problem related to disk space full.
+hadoop.tmp.dir acts as base location of other hadoop related properties , not sure if any particular directory is created specifically.

+Only one disk getting filled looks strange.The other disk are part while formatting the NN.

Would be interesting to know the reason for this.
Please keep posted.

Thanks,
Rahul

On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com> wrote:

>From the snapshot, you got around 3TB for writing data. 
Can you check individual datanode's storage health. 

As you said you got 80 servers writing parallely to hdfs, I am not sure can that be an issue. 
As suggested in past threads, you can do a rebalance of the blocks but that will take some time to finish and will not solve your issue right away. 
You can wait for others to reply. I am sure there will be far better solutions from experts for this. 

On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:

No it's not a map-reduce job. We've a java app running on around 80 machines which writes to hdfs. The error that I'd mentioned is being thrown by the application and yes we've replication factor set to 3 and following is status of hdfs:

 Configured Capacity : 16.15 TB  DFS Used : 11.84 TB
  Non DFS Used : 872.66 GB  DFS Remaining : 3.46 TB  DFS Used%
 : 73.3 %  DFS Remaining% : 21.42 %  Live Nodes 

 : 10  Dead Nodes  : 0
  Decommissioning Nodes  : 0 
 Number of Under-Replicated Blocks : 0

On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com> wrote:

when you say application errors out .. does that mean your mapreduce job is erroring? In that case apart from hdfs space you will need to look at mapred tmp directory space as well. 

you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a replication factor of 3 so at max you will have datasize of 5TB with you. I am also assuming you are not scheduling your program to run on entire 5TB with just 10 nodes. 

i suspect your clusters mapred tmp space is getting filled in while the job is running. 

On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:

We are running a hadoop cluster with 10 datanodes and a namenode. Each datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each disk having a capacity 414GB.

hdfs-site.xml has following property set:

<property>
        <name>dfs.data.dir</name>
        <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
        <description>Data dirs for DFS.</description>

</property>

Now we are facing a issue where in we find /data1 getting filled up quickly and many a times we see it's usage running at 100% with just few megabytes of free space. This issue is visible on 7 out of 10 datanodes at present.

We've some java applications which are writing to hdfs and many a times we are seeing foloowing errors in our application logs:

java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)

I went through some old discussions and looks like manual rebalancing is what is required in this case and we should also have dfs.datanode.du.reserved set up.

However I'd like to understand if this issue, with one disk getting filled up to 100% can result into the issue which we are seeing in our application. 

Also, are there any other peformance implications due to some of the disks running at 100% usage on a datanode.

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

-- 
Nitin Pawar

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

-- 
Nitin Pawar

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

RE: Application errors with one disk on datanode getting filled up to 100%

Posted by Sandeep L <sa...@outlook.com>.

Rahul,
In general this issue happens some times in Hadoop. There is no exact reason for this.To mitigate this you need to run balancer in regular intervals.
Thanks,Sandeep.
Date: Fri, 14 Jun 2013 16:39:02 +0530
Subject: Re: Application errors with one disk on datanode getting filled up to 100%
From: mail2mayank@gmail.com
To: user@hadoop.apache.org

No, as of this moment we've no ideas about the reasons for that behavior. 

On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <ra...@gmail.com> wrote:

Thanks Mayank, Any clue on why was only one disk was getting all writes.

Rahul

On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:

So we did a manual rebalance (followed instructions at: http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F) and also reserved 30 GB of space for non dfs usage via dfs.datanode.du.reserved and restarted our apps. 

Things have been going fine till now. 

Keeping fingers crossed :)

On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <ra...@gmail.com> wrote:

I have a few points to make , these may not be very helpful for the said problem.

+All data nodes are bad exception is kind of not pointing to the problem related to disk space full.
+hadoop.tmp.dir acts as base location of other hadoop related properties , not sure if any particular directory is created specifically.

+Only one disk getting filled looks strange.The other disk are part while formatting the NN.

Would be interesting to know the reason for this.
Please keep posted.

Thanks,
Rahul

On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com> wrote:

>From the snapshot, you got around 3TB for writing data. 
Can you check individual datanode's storage health. 

As you said you got 80 servers writing parallely to hdfs, I am not sure can that be an issue. 
As suggested in past threads, you can do a rebalance of the blocks but that will take some time to finish and will not solve your issue right away. 
You can wait for others to reply. I am sure there will be far better solutions from experts for this. 

On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:

No it's not a map-reduce job. We've a java app running on around 80 machines which writes to hdfs. The error that I'd mentioned is being thrown by the application and yes we've replication factor set to 3 and following is status of hdfs:

 Configured Capacity : 16.15 TB  DFS Used : 11.84 TB
  Non DFS Used : 872.66 GB  DFS Remaining : 3.46 TB  DFS Used%
 : 73.3 %  DFS Remaining% : 21.42 %  Live Nodes 

 : 10  Dead Nodes  : 0
  Decommissioning Nodes  : 0 
 Number of Under-Replicated Blocks : 0

On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com> wrote:

when you say application errors out .. does that mean your mapreduce job is erroring? In that case apart from hdfs space you will need to look at mapred tmp directory space as well. 

you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a replication factor of 3 so at max you will have datasize of 5TB with you. I am also assuming you are not scheduling your program to run on entire 5TB with just 10 nodes. 

i suspect your clusters mapred tmp space is getting filled in while the job is running. 

On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:

We are running a hadoop cluster with 10 datanodes and a namenode. Each datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each disk having a capacity 414GB.

hdfs-site.xml has following property set:

<property>
        <name>dfs.data.dir</name>
        <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
        <description>Data dirs for DFS.</description>

</property>

Now we are facing a issue where in we find /data1 getting filled up quickly and many a times we see it's usage running at 100% with just few megabytes of free space. This issue is visible on 7 out of 10 datanodes at present.

We've some java applications which are writing to hdfs and many a times we are seeing foloowing errors in our application logs:

java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)

I went through some old discussions and looks like manual rebalancing is what is required in this case and we should also have dfs.datanode.du.reserved set up.

However I'd like to understand if this issue, with one disk getting filled up to 100% can result into the issue which we are seeing in our application. 

Also, are there any other peformance implications due to some of the disks running at 100% usage on a datanode.

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

-- 
Nitin Pawar

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

-- 
Nitin Pawar

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

RE: Application errors with one disk on datanode getting filled up to 100%

Posted by Sandeep L <sa...@outlook.com>.

Rahul,
In general this issue happens some times in Hadoop. There is no exact reason for this.To mitigate this you need to run balancer in regular intervals.
Thanks,Sandeep.
Date: Fri, 14 Jun 2013 16:39:02 +0530
Subject: Re: Application errors with one disk on datanode getting filled up to 100%
From: mail2mayank@gmail.com
To: user@hadoop.apache.org

No, as of this moment we've no ideas about the reasons for that behavior. 

On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <ra...@gmail.com> wrote:

Thanks Mayank, Any clue on why was only one disk was getting all writes.

Rahul

On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:

So we did a manual rebalance (followed instructions at: http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F) and also reserved 30 GB of space for non dfs usage via dfs.datanode.du.reserved and restarted our apps. 

Things have been going fine till now. 

Keeping fingers crossed :)

On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <ra...@gmail.com> wrote:

I have a few points to make , these may not be very helpful for the said problem.

+All data nodes are bad exception is kind of not pointing to the problem related to disk space full.
+hadoop.tmp.dir acts as base location of other hadoop related properties , not sure if any particular directory is created specifically.

+Only one disk getting filled looks strange.The other disk are part while formatting the NN.

Would be interesting to know the reason for this.
Please keep posted.

Thanks,
Rahul

On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com> wrote:

>From the snapshot, you got around 3TB for writing data. 
Can you check individual datanode's storage health. 

As you said you got 80 servers writing parallely to hdfs, I am not sure can that be an issue. 
As suggested in past threads, you can do a rebalance of the blocks but that will take some time to finish and will not solve your issue right away. 
You can wait for others to reply. I am sure there will be far better solutions from experts for this. 

On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:

No it's not a map-reduce job. We've a java app running on around 80 machines which writes to hdfs. The error that I'd mentioned is being thrown by the application and yes we've replication factor set to 3 and following is status of hdfs:

 Configured Capacity : 16.15 TB  DFS Used : 11.84 TB
  Non DFS Used : 872.66 GB  DFS Remaining : 3.46 TB  DFS Used%
 : 73.3 %  DFS Remaining% : 21.42 %  Live Nodes 

 : 10  Dead Nodes  : 0
  Decommissioning Nodes  : 0 
 Number of Under-Replicated Blocks : 0

On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com> wrote:

when you say application errors out .. does that mean your mapreduce job is erroring? In that case apart from hdfs space you will need to look at mapred tmp directory space as well. 

you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a replication factor of 3 so at max you will have datasize of 5TB with you. I am also assuming you are not scheduling your program to run on entire 5TB with just 10 nodes. 

i suspect your clusters mapred tmp space is getting filled in while the job is running. 

On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:

We are running a hadoop cluster with 10 datanodes and a namenode. Each datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each disk having a capacity 414GB.

hdfs-site.xml has following property set:

<property>
        <name>dfs.data.dir</name>
        <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
        <description>Data dirs for DFS.</description>

</property>

Now we are facing a issue where in we find /data1 getting filled up quickly and many a times we see it's usage running at 100% with just few megabytes of free space. This issue is visible on 7 out of 10 datanodes at present.

We've some java applications which are writing to hdfs and many a times we are seeing foloowing errors in our application logs:

java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)

I went through some old discussions and looks like manual rebalancing is what is required in this case and we should also have dfs.datanode.du.reserved set up.

However I'd like to understand if this issue, with one disk getting filled up to 100% can result into the issue which we are seeing in our application. 

Also, are there any other peformance implications due to some of the disks running at 100% usage on a datanode.

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

-- 
Nitin Pawar

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

-- 
Nitin Pawar

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

RE: Application errors with one disk on datanode getting filled up to 100%

Posted by Sandeep L <sa...@outlook.com>.

Rahul,
In general this issue happens some times in Hadoop. There is no exact reason for this.To mitigate this you need to run balancer in regular intervals.
Thanks,Sandeep.
Date: Fri, 14 Jun 2013 16:39:02 +0530
Subject: Re: Application errors with one disk on datanode getting filled up to 100%
From: mail2mayank@gmail.com
To: user@hadoop.apache.org

No, as of this moment we've no ideas about the reasons for that behavior. 

On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <ra...@gmail.com> wrote:

Thanks Mayank, Any clue on why was only one disk was getting all writes.

Rahul

On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:

So we did a manual rebalance (followed instructions at: http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F) and also reserved 30 GB of space for non dfs usage via dfs.datanode.du.reserved and restarted our apps. 

Things have been going fine till now. 

Keeping fingers crossed :)

On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <ra...@gmail.com> wrote:

I have a few points to make , these may not be very helpful for the said problem.

+All data nodes are bad exception is kind of not pointing to the problem related to disk space full.
+hadoop.tmp.dir acts as base location of other hadoop related properties , not sure if any particular directory is created specifically.

+Only one disk getting filled looks strange.The other disk are part while formatting the NN.

Would be interesting to know the reason for this.
Please keep posted.

Thanks,
Rahul

On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com> wrote:

>From the snapshot, you got around 3TB for writing data. 
Can you check individual datanode's storage health. 

As you said you got 80 servers writing parallely to hdfs, I am not sure can that be an issue. 
As suggested in past threads, you can do a rebalance of the blocks but that will take some time to finish and will not solve your issue right away. 
You can wait for others to reply. I am sure there will be far better solutions from experts for this. 

On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:

No it's not a map-reduce job. We've a java app running on around 80 machines which writes to hdfs. The error that I'd mentioned is being thrown by the application and yes we've replication factor set to 3 and following is status of hdfs:

 Configured Capacity : 16.15 TB  DFS Used : 11.84 TB
  Non DFS Used : 872.66 GB  DFS Remaining : 3.46 TB  DFS Used%
 : 73.3 %  DFS Remaining% : 21.42 %  Live Nodes 

 : 10  Dead Nodes  : 0
  Decommissioning Nodes  : 0 
 Number of Under-Replicated Blocks : 0

On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com> wrote:

when you say application errors out .. does that mean your mapreduce job is erroring? In that case apart from hdfs space you will need to look at mapred tmp directory space as well. 

you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a replication factor of 3 so at max you will have datasize of 5TB with you. I am also assuming you are not scheduling your program to run on entire 5TB with just 10 nodes. 

i suspect your clusters mapred tmp space is getting filled in while the job is running. 

On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:

We are running a hadoop cluster with 10 datanodes and a namenode. Each datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each disk having a capacity 414GB.

hdfs-site.xml has following property set:

<property>
        <name>dfs.data.dir</name>
        <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
        <description>Data dirs for DFS.</description>

</property>

Now we are facing a issue where in we find /data1 getting filled up quickly and many a times we see it's usage running at 100% with just few megabytes of free space. This issue is visible on 7 out of 10 datanodes at present.

We've some java applications which are writing to hdfs and many a times we are seeing foloowing errors in our application logs:

java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)

I went through some old discussions and looks like manual rebalancing is what is required in this case and we should also have dfs.datanode.du.reserved set up.

However I'd like to understand if this issue, with one disk getting filled up to 100% can result into the issue which we are seeing in our application. 

Also, are there any other peformance implications due to some of the disks running at 100% usage on a datanode.

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

-- 
Nitin Pawar

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

-- 
Nitin Pawar

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

-- 
Mayank Joshi

Skype: mail2mayank 
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr

PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Mayank <ma...@gmail.com>.

No, as of this moment we've no ideas about the reasons for that behavior.


On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Thanks Mayank, Any clue on why was only one disk was getting all writes.
>
> Rahul
>
>
> On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:
>
>> So we did a manual rebalance (followed instructions at:
>> http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
>> and also reserved 30 GB of space for non dfs usage via
>> dfs.datanode.du.reserved and restarted our apps.
>>
>> Things have been going fine till now.
>>
>> Keeping fingers crossed :)
>>
>>
>> On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> I have a few points to make , these may not be very helpful for the said
>>> problem.
>>>
>>> +All data nodes are bad exception is kind of not pointing to the problem
>>> related to disk space full.
>>> +hadoop.tmp.dir acts as base location of other hadoop related properties
>>> , not sure if any particular directory is created specifically.
>>> +Only one disk getting filled looks strange.The other disk are part
>>> while formatting the NN.
>>>
>>> Would be interesting to know the reason for this.
>>> Please keep posted.
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> From the snapshot, you got around 3TB for writing data.
>>>>
>>>> Can you check individual datanode's storage health.
>>>> As you said you got 80 servers writing parallely to hdfs, I am not sure
>>>> can that be an issue.
>>>> As suggested in past threads, you can do a rebalance of the blocks but
>>>> that will take some time to finish and will not solve your issue right
>>>> away.
>>>>
>>>> You can wait for others to reply. I am sure there will be far better
>>>> solutions from experts for this.
>>>>
>>>>
>>>> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>>>>
>>>>> No it's not a map-reduce job. We've a java app running on around 80
>>>>> machines which writes to hdfs. The error that I'd mentioned is being thrown
>>>>> by the application and yes we've replication factor set to 3 and following
>>>>> is status of hdfs:
>>>>>
>>>>> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used :872.66 GB DFS
>>>>> Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
>>>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
>>>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
>>>>> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
>>>>> : 0 Number of Under-Replicated Blocks : 0
>>>>>
>>>>>
>>>>> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> when you say application errors out .. does that mean your mapreduce
>>>>>> job is erroring? In that case apart from hdfs space you will need to look
>>>>>> at mapred tmp directory space as well.
>>>>>>
>>>>>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
>>>>>> replication factor of 3 so at max you will have datasize of 5TB with you.
>>>>>> I am also assuming you are not scheduling your program to run on
>>>>>> entire 5TB with just 10 nodes.
>>>>>>
>>>>>> i suspect your clusters mapred tmp space is getting filled in while
>>>>>> the job is running.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com>wrote:
>>>>>>
>>>>>>> We are running a hadoop cluster with 10 datanodes and a namenode.
>>>>>>> Each datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which
>>>>>>> each disk having a capacity 414GB.
>>>>>>>
>>>>>>>
>>>>>>> hdfs-site.xml has following property set:
>>>>>>>
>>>>>>> <property>
>>>>>>>         <name>dfs.data.dir</name>
>>>>>>>
>>>>>>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>>>>>>         <description>Data dirs for DFS.</description>
>>>>>>> </property>
>>>>>>>
>>>>>>> Now we are facing a issue where in we find /data1 getting filled up
>>>>>>> quickly and many a times we see it's usage running at 100% with just few
>>>>>>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>>>>>>> present.
>>>>>>>
>>>>>>> We've some java applications which are writing to hdfs and many a
>>>>>>> times we are seeing foloowing errors in our application logs:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I went through some old discussions and looks like manual
>>>>>>> rebalancing is what is required in this case and we should also have
>>>>>>> dfs.datanode.du.reserved set up.
>>>>>>>
>>>>>>> However I'd like to understand if this issue, with one disk getting
>>>>>>> filled up to 100% can result into the issue which we are seeing in our
>>>>>>> application.
>>>>>>>
>>>>>>> Also, are there any other peformance implications due to some of the
>>>>>>> disks running at 100% usage on a datanode.
>>>>>>> --
>>>>>>> Mayank Joshi
>>>>>>>
>>>>>>> Skype: mail2mayank
>>>>>>> Mb.:  +91 8690625808
>>>>>>>
>>>>>>> Blog: http://www.techynfreesouls.co.nr
>>>>>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>>>>>
>>>>>>> Today is tommorrow I was so worried about yesterday ...
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Mayank Joshi
>>>>>
>>>>> Skype: mail2mayank
>>>>> Mb.:  +91 8690625808
>>>>>
>>>>> Blog: http://www.techynfreesouls.co.nr
>>>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>>>
>>>>> Today is tommorrow I was so worried about yesterday ...
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>>
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>
>


-- 
Mayank Joshi

Skype: mail2mayank
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr
PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Mayank <ma...@gmail.com>.

No, as of this moment we've no ideas about the reasons for that behavior.


On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Thanks Mayank, Any clue on why was only one disk was getting all writes.
>
> Rahul
>
>
> On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:
>
>> So we did a manual rebalance (followed instructions at:
>> http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
>> and also reserved 30 GB of space for non dfs usage via
>> dfs.datanode.du.reserved and restarted our apps.
>>
>> Things have been going fine till now.
>>
>> Keeping fingers crossed :)
>>
>>
>> On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> I have a few points to make , these may not be very helpful for the said
>>> problem.
>>>
>>> +All data nodes are bad exception is kind of not pointing to the problem
>>> related to disk space full.
>>> +hadoop.tmp.dir acts as base location of other hadoop related properties
>>> , not sure if any particular directory is created specifically.
>>> +Only one disk getting filled looks strange.The other disk are part
>>> while formatting the NN.
>>>
>>> Would be interesting to know the reason for this.
>>> Please keep posted.
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> From the snapshot, you got around 3TB for writing data.
>>>>
>>>> Can you check individual datanode's storage health.
>>>> As you said you got 80 servers writing parallely to hdfs, I am not sure
>>>> can that be an issue.
>>>> As suggested in past threads, you can do a rebalance of the blocks but
>>>> that will take some time to finish and will not solve your issue right
>>>> away.
>>>>
>>>> You can wait for others to reply. I am sure there will be far better
>>>> solutions from experts for this.
>>>>
>>>>
>>>> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>>>>
>>>>> No it's not a map-reduce job. We've a java app running on around 80
>>>>> machines which writes to hdfs. The error that I'd mentioned is being thrown
>>>>> by the application and yes we've replication factor set to 3 and following
>>>>> is status of hdfs:
>>>>>
>>>>> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used :872.66 GB DFS
>>>>> Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
>>>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
>>>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
>>>>> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
>>>>> : 0 Number of Under-Replicated Blocks : 0
>>>>>
>>>>>
>>>>> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> when you say application errors out .. does that mean your mapreduce
>>>>>> job is erroring? In that case apart from hdfs space you will need to look
>>>>>> at mapred tmp directory space as well.
>>>>>>
>>>>>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
>>>>>> replication factor of 3 so at max you will have datasize of 5TB with you.
>>>>>> I am also assuming you are not scheduling your program to run on
>>>>>> entire 5TB with just 10 nodes.
>>>>>>
>>>>>> i suspect your clusters mapred tmp space is getting filled in while
>>>>>> the job is running.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com>wrote:
>>>>>>
>>>>>>> We are running a hadoop cluster with 10 datanodes and a namenode.
>>>>>>> Each datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which
>>>>>>> each disk having a capacity 414GB.
>>>>>>>
>>>>>>>
>>>>>>> hdfs-site.xml has following property set:
>>>>>>>
>>>>>>> <property>
>>>>>>>         <name>dfs.data.dir</name>
>>>>>>>
>>>>>>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>>>>>>         <description>Data dirs for DFS.</description>
>>>>>>> </property>
>>>>>>>
>>>>>>> Now we are facing a issue where in we find /data1 getting filled up
>>>>>>> quickly and many a times we see it's usage running at 100% with just few
>>>>>>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>>>>>>> present.
>>>>>>>
>>>>>>> We've some java applications which are writing to hdfs and many a
>>>>>>> times we are seeing foloowing errors in our application logs:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I went through some old discussions and looks like manual
>>>>>>> rebalancing is what is required in this case and we should also have
>>>>>>> dfs.datanode.du.reserved set up.
>>>>>>>
>>>>>>> However I'd like to understand if this issue, with one disk getting
>>>>>>> filled up to 100% can result into the issue which we are seeing in our
>>>>>>> application.
>>>>>>>
>>>>>>> Also, are there any other peformance implications due to some of the
>>>>>>> disks running at 100% usage on a datanode.
>>>>>>> --
>>>>>>> Mayank Joshi
>>>>>>>
>>>>>>> Skype: mail2mayank
>>>>>>> Mb.:  +91 8690625808
>>>>>>>
>>>>>>> Blog: http://www.techynfreesouls.co.nr
>>>>>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>>>>>
>>>>>>> Today is tommorrow I was so worried about yesterday ...
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Mayank Joshi
>>>>>
>>>>> Skype: mail2mayank
>>>>> Mb.:  +91 8690625808
>>>>>
>>>>> Blog: http://www.techynfreesouls.co.nr
>>>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>>>
>>>>> Today is tommorrow I was so worried about yesterday ...
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>>
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>
>


-- 
Mayank Joshi

Skype: mail2mayank
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr
PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Mayank <ma...@gmail.com>.

No, as of this moment we've no ideas about the reasons for that behavior.


On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Thanks Mayank, Any clue on why was only one disk was getting all writes.
>
> Rahul
>
>
> On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:
>
>> So we did a manual rebalance (followed instructions at:
>> http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
>> and also reserved 30 GB of space for non dfs usage via
>> dfs.datanode.du.reserved and restarted our apps.
>>
>> Things have been going fine till now.
>>
>> Keeping fingers crossed :)
>>
>>
>> On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> I have a few points to make , these may not be very helpful for the said
>>> problem.
>>>
>>> +All data nodes are bad exception is kind of not pointing to the problem
>>> related to disk space full.
>>> +hadoop.tmp.dir acts as base location of other hadoop related properties
>>> , not sure if any particular directory is created specifically.
>>> +Only one disk getting filled looks strange.The other disk are part
>>> while formatting the NN.
>>>
>>> Would be interesting to know the reason for this.
>>> Please keep posted.
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> From the snapshot, you got around 3TB for writing data.
>>>>
>>>> Can you check individual datanode's storage health.
>>>> As you said you got 80 servers writing parallely to hdfs, I am not sure
>>>> can that be an issue.
>>>> As suggested in past threads, you can do a rebalance of the blocks but
>>>> that will take some time to finish and will not solve your issue right
>>>> away.
>>>>
>>>> You can wait for others to reply. I am sure there will be far better
>>>> solutions from experts for this.
>>>>
>>>>
>>>> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>>>>
>>>>> No it's not a map-reduce job. We've a java app running on around 80
>>>>> machines which writes to hdfs. The error that I'd mentioned is being thrown
>>>>> by the application and yes we've replication factor set to 3 and following
>>>>> is status of hdfs:
>>>>>
>>>>> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used :872.66 GB DFS
>>>>> Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
>>>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
>>>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
>>>>> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
>>>>> : 0 Number of Under-Replicated Blocks : 0
>>>>>
>>>>>
>>>>> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> when you say application errors out .. does that mean your mapreduce
>>>>>> job is erroring? In that case apart from hdfs space you will need to look
>>>>>> at mapred tmp directory space as well.
>>>>>>
>>>>>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
>>>>>> replication factor of 3 so at max you will have datasize of 5TB with you.
>>>>>> I am also assuming you are not scheduling your program to run on
>>>>>> entire 5TB with just 10 nodes.
>>>>>>
>>>>>> i suspect your clusters mapred tmp space is getting filled in while
>>>>>> the job is running.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com>wrote:
>>>>>>
>>>>>>> We are running a hadoop cluster with 10 datanodes and a namenode.
>>>>>>> Each datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which
>>>>>>> each disk having a capacity 414GB.
>>>>>>>
>>>>>>>
>>>>>>> hdfs-site.xml has following property set:
>>>>>>>
>>>>>>> <property>
>>>>>>>         <name>dfs.data.dir</name>
>>>>>>>
>>>>>>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>>>>>>         <description>Data dirs for DFS.</description>
>>>>>>> </property>
>>>>>>>
>>>>>>> Now we are facing a issue where in we find /data1 getting filled up
>>>>>>> quickly and many a times we see it's usage running at 100% with just few
>>>>>>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>>>>>>> present.
>>>>>>>
>>>>>>> We've some java applications which are writing to hdfs and many a
>>>>>>> times we are seeing foloowing errors in our application logs:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I went through some old discussions and looks like manual
>>>>>>> rebalancing is what is required in this case and we should also have
>>>>>>> dfs.datanode.du.reserved set up.
>>>>>>>
>>>>>>> However I'd like to understand if this issue, with one disk getting
>>>>>>> filled up to 100% can result into the issue which we are seeing in our
>>>>>>> application.
>>>>>>>
>>>>>>> Also, are there any other peformance implications due to some of the
>>>>>>> disks running at 100% usage on a datanode.
>>>>>>> --
>>>>>>> Mayank Joshi
>>>>>>>
>>>>>>> Skype: mail2mayank
>>>>>>> Mb.:  +91 8690625808
>>>>>>>
>>>>>>> Blog: http://www.techynfreesouls.co.nr
>>>>>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>>>>>
>>>>>>> Today is tommorrow I was so worried about yesterday ...
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Mayank Joshi
>>>>>
>>>>> Skype: mail2mayank
>>>>> Mb.:  +91 8690625808
>>>>>
>>>>> Blog: http://www.techynfreesouls.co.nr
>>>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>>>
>>>>> Today is tommorrow I was so worried about yesterday ...
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>>
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>
>


-- 
Mayank Joshi

Skype: mail2mayank
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr
PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Mayank <ma...@gmail.com>.

No, as of this moment we've no ideas about the reasons for that behavior.


On Fri, Jun 14, 2013 at 4:04 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> Thanks Mayank, Any clue on why was only one disk was getting all writes.
>
> Rahul
>
>
> On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:
>
>> So we did a manual rebalance (followed instructions at:
>> http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
>> and also reserved 30 GB of space for non dfs usage via
>> dfs.datanode.du.reserved and restarted our apps.
>>
>> Things have been going fine till now.
>>
>> Keeping fingers crossed :)
>>
>>
>> On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
>> rahul.rec.dgp@gmail.com> wrote:
>>
>>> I have a few points to make , these may not be very helpful for the said
>>> problem.
>>>
>>> +All data nodes are bad exception is kind of not pointing to the problem
>>> related to disk space full.
>>> +hadoop.tmp.dir acts as base location of other hadoop related properties
>>> , not sure if any particular directory is created specifically.
>>> +Only one disk getting filled looks strange.The other disk are part
>>> while formatting the NN.
>>>
>>> Would be interesting to know the reason for this.
>>> Please keep posted.
>>>
>>> Thanks,
>>> Rahul
>>>
>>>
>>> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> From the snapshot, you got around 3TB for writing data.
>>>>
>>>> Can you check individual datanode's storage health.
>>>> As you said you got 80 servers writing parallely to hdfs, I am not sure
>>>> can that be an issue.
>>>> As suggested in past threads, you can do a rebalance of the blocks but
>>>> that will take some time to finish and will not solve your issue right
>>>> away.
>>>>
>>>> You can wait for others to reply. I am sure there will be far better
>>>> solutions from experts for this.
>>>>
>>>>
>>>> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>>>>
>>>>> No it's not a map-reduce job. We've a java app running on around 80
>>>>> machines which writes to hdfs. The error that I'd mentioned is being thrown
>>>>> by the application and yes we've replication factor set to 3 and following
>>>>> is status of hdfs:
>>>>>
>>>>> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used :872.66 GB DFS
>>>>> Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
>>>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
>>>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
>>>>> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
>>>>> : 0 Number of Under-Replicated Blocks : 0
>>>>>
>>>>>
>>>>> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>>
>>>>>> when you say application errors out .. does that mean your mapreduce
>>>>>> job is erroring? In that case apart from hdfs space you will need to look
>>>>>> at mapred tmp directory space as well.
>>>>>>
>>>>>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
>>>>>> replication factor of 3 so at max you will have datasize of 5TB with you.
>>>>>> I am also assuming you are not scheduling your program to run on
>>>>>> entire 5TB with just 10 nodes.
>>>>>>
>>>>>> i suspect your clusters mapred tmp space is getting filled in while
>>>>>> the job is running.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com>wrote:
>>>>>>
>>>>>>> We are running a hadoop cluster with 10 datanodes and a namenode.
>>>>>>> Each datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which
>>>>>>> each disk having a capacity 414GB.
>>>>>>>
>>>>>>>
>>>>>>> hdfs-site.xml has following property set:
>>>>>>>
>>>>>>> <property>
>>>>>>>         <name>dfs.data.dir</name>
>>>>>>>
>>>>>>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>>>>>>         <description>Data dirs for DFS.</description>
>>>>>>> </property>
>>>>>>>
>>>>>>> Now we are facing a issue where in we find /data1 getting filled up
>>>>>>> quickly and many a times we see it's usage running at 100% with just few
>>>>>>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>>>>>>> present.
>>>>>>>
>>>>>>> We've some java applications which are writing to hdfs and many a
>>>>>>> times we are seeing foloowing errors in our application logs:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>>>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I went through some old discussions and looks like manual
>>>>>>> rebalancing is what is required in this case and we should also have
>>>>>>> dfs.datanode.du.reserved set up.
>>>>>>>
>>>>>>> However I'd like to understand if this issue, with one disk getting
>>>>>>> filled up to 100% can result into the issue which we are seeing in our
>>>>>>> application.
>>>>>>>
>>>>>>> Also, are there any other peformance implications due to some of the
>>>>>>> disks running at 100% usage on a datanode.
>>>>>>> --
>>>>>>> Mayank Joshi
>>>>>>>
>>>>>>> Skype: mail2mayank
>>>>>>> Mb.:  +91 8690625808
>>>>>>>
>>>>>>> Blog: http://www.techynfreesouls.co.nr
>>>>>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>>>>>
>>>>>>> Today is tommorrow I was so worried about yesterday ...
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Mayank Joshi
>>>>>
>>>>> Skype: mail2mayank
>>>>> Mb.:  +91 8690625808
>>>>>
>>>>> Blog: http://www.techynfreesouls.co.nr
>>>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>>>
>>>>> Today is tommorrow I was so worried about yesterday ...
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>>
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>
>


-- 
Mayank Joshi

Skype: mail2mayank
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr
PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks Mayank, Any clue on why was only one disk was getting all writes.

Rahul


On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:

> So we did a manual rebalance (followed instructions at:
> http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
> and also reserved 30 GB of space for non dfs usage via
> dfs.datanode.du.reserved and restarted our apps.
>
> Things have been going fine till now.
>
> Keeping fingers crossed :)
>
>
> On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> I have a few points to make , these may not be very helpful for the said
>> problem.
>>
>> +All data nodes are bad exception is kind of not pointing to the problem
>> related to disk space full.
>> +hadoop.tmp.dir acts as base location of other hadoop related properties
>> , not sure if any particular directory is created specifically.
>> +Only one disk getting filled looks strange.The other disk are part while
>> formatting the NN.
>>
>> Would be interesting to know the reason for this.
>> Please keep posted.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> From the snapshot, you got around 3TB for writing data.
>>>
>>> Can you check individual datanode's storage health.
>>> As you said you got 80 servers writing parallely to hdfs, I am not sure
>>> can that be an issue.
>>> As suggested in past threads, you can do a rebalance of the blocks but
>>> that will take some time to finish and will not solve your issue right
>>> away.
>>>
>>> You can wait for others to reply. I am sure there will be far better
>>> solutions from experts for this.
>>>
>>>
>>> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>>>
>>>> No it's not a map-reduce job. We've a java app running on around 80
>>>> machines which writes to hdfs. The error that I'd mentioned is being thrown
>>>> by the application and yes we've replication factor set to 3 and following
>>>> is status of hdfs:
>>>>
>>>> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used :872.66 GB DFS
>>>> Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
>>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
>>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
>>>> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
>>>> : 0 Number of Under-Replicated Blocks : 0
>>>>
>>>>
>>>> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> when you say application errors out .. does that mean your mapreduce
>>>>> job is erroring? In that case apart from hdfs space you will need to look
>>>>> at mapred tmp directory space as well.
>>>>>
>>>>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
>>>>> replication factor of 3 so at max you will have datasize of 5TB with you.
>>>>> I am also assuming you are not scheduling your program to run on
>>>>> entire 5TB with just 10 nodes.
>>>>>
>>>>> i suspect your clusters mapred tmp space is getting filled in while
>>>>> the job is running.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>>>>>
>>>>>> We are running a hadoop cluster with 10 datanodes and a namenode.
>>>>>> Each datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which
>>>>>> each disk having a capacity 414GB.
>>>>>>
>>>>>>
>>>>>> hdfs-site.xml has following property set:
>>>>>>
>>>>>> <property>
>>>>>>         <name>dfs.data.dir</name>
>>>>>>
>>>>>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>>>>>         <description>Data dirs for DFS.</description>
>>>>>> </property>
>>>>>>
>>>>>> Now we are facing a issue where in we find /data1 getting filled up
>>>>>> quickly and many a times we see it's usage running at 100% with just few
>>>>>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>>>>>> present.
>>>>>>
>>>>>> We've some java applications which are writing to hdfs and many a
>>>>>> times we are seeing foloowing errors in our application logs:
>>>>>>
>>>>>>
>>>>>>
>>>>>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>>>>>
>>>>>>
>>>>>>
>>>>>> I went through some old discussions and looks like manual rebalancing
>>>>>> is what is required in this case and we should also have
>>>>>> dfs.datanode.du.reserved set up.
>>>>>>
>>>>>> However I'd like to understand if this issue, with one disk getting
>>>>>> filled up to 100% can result into the issue which we are seeing in our
>>>>>> application.
>>>>>>
>>>>>> Also, are there any other peformance implications due to some of the
>>>>>> disks running at 100% usage on a datanode.
>>>>>> --
>>>>>> Mayank Joshi
>>>>>>
>>>>>> Skype: mail2mayank
>>>>>> Mb.:  +91 8690625808
>>>>>>
>>>>>> Blog: http://www.techynfreesouls.co.nr
>>>>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>>>>
>>>>>> Today is tommorrow I was so worried about yesterday ...
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Mayank Joshi
>>>>
>>>> Skype: mail2mayank
>>>> Mb.:  +91 8690625808
>>>>
>>>> Blog: http://www.techynfreesouls.co.nr
>>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>>
>>>> Today is tommorrow I was so worried about yesterday ...
>>>>
>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks Mayank, Any clue on why was only one disk was getting all writes.

Rahul


On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:

> So we did a manual rebalance (followed instructions at:
> http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
> and also reserved 30 GB of space for non dfs usage via
> dfs.datanode.du.reserved and restarted our apps.
>
> Things have been going fine till now.
>
> Keeping fingers crossed :)
>
>
> On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> I have a few points to make , these may not be very helpful for the said
>> problem.
>>
>> +All data nodes are bad exception is kind of not pointing to the problem
>> related to disk space full.
>> +hadoop.tmp.dir acts as base location of other hadoop related properties
>> , not sure if any particular directory is created specifically.
>> +Only one disk getting filled looks strange.The other disk are part while
>> formatting the NN.
>>
>> Would be interesting to know the reason for this.
>> Please keep posted.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> From the snapshot, you got around 3TB for writing data.
>>>
>>> Can you check individual datanode's storage health.
>>> As you said you got 80 servers writing parallely to hdfs, I am not sure
>>> can that be an issue.
>>> As suggested in past threads, you can do a rebalance of the blocks but
>>> that will take some time to finish and will not solve your issue right
>>> away.
>>>
>>> You can wait for others to reply. I am sure there will be far better
>>> solutions from experts for this.
>>>
>>>
>>> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>>>
>>>> No it's not a map-reduce job. We've a java app running on around 80
>>>> machines which writes to hdfs. The error that I'd mentioned is being thrown
>>>> by the application and yes we've replication factor set to 3 and following
>>>> is status of hdfs:
>>>>
>>>> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used :872.66 GB DFS
>>>> Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
>>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
>>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
>>>> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
>>>> : 0 Number of Under-Replicated Blocks : 0
>>>>
>>>>
>>>> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> when you say application errors out .. does that mean your mapreduce
>>>>> job is erroring? In that case apart from hdfs space you will need to look
>>>>> at mapred tmp directory space as well.
>>>>>
>>>>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
>>>>> replication factor of 3 so at max you will have datasize of 5TB with you.
>>>>> I am also assuming you are not scheduling your program to run on
>>>>> entire 5TB with just 10 nodes.
>>>>>
>>>>> i suspect your clusters mapred tmp space is getting filled in while
>>>>> the job is running.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>>>>>
>>>>>> We are running a hadoop cluster with 10 datanodes and a namenode.
>>>>>> Each datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which
>>>>>> each disk having a capacity 414GB.
>>>>>>
>>>>>>
>>>>>> hdfs-site.xml has following property set:
>>>>>>
>>>>>> <property>
>>>>>>         <name>dfs.data.dir</name>
>>>>>>
>>>>>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>>>>>         <description>Data dirs for DFS.</description>
>>>>>> </property>
>>>>>>
>>>>>> Now we are facing a issue where in we find /data1 getting filled up
>>>>>> quickly and many a times we see it's usage running at 100% with just few
>>>>>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>>>>>> present.
>>>>>>
>>>>>> We've some java applications which are writing to hdfs and many a
>>>>>> times we are seeing foloowing errors in our application logs:
>>>>>>
>>>>>>
>>>>>>
>>>>>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>>>>>
>>>>>>
>>>>>>
>>>>>> I went through some old discussions and looks like manual rebalancing
>>>>>> is what is required in this case and we should also have
>>>>>> dfs.datanode.du.reserved set up.
>>>>>>
>>>>>> However I'd like to understand if this issue, with one disk getting
>>>>>> filled up to 100% can result into the issue which we are seeing in our
>>>>>> application.
>>>>>>
>>>>>> Also, are there any other peformance implications due to some of the
>>>>>> disks running at 100% usage on a datanode.
>>>>>> --
>>>>>> Mayank Joshi
>>>>>>
>>>>>> Skype: mail2mayank
>>>>>> Mb.:  +91 8690625808
>>>>>>
>>>>>> Blog: http://www.techynfreesouls.co.nr
>>>>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>>>>
>>>>>> Today is tommorrow I was so worried about yesterday ...
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Mayank Joshi
>>>>
>>>> Skype: mail2mayank
>>>> Mb.:  +91 8690625808
>>>>
>>>> Blog: http://www.techynfreesouls.co.nr
>>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>>
>>>> Today is tommorrow I was so worried about yesterday ...
>>>>
>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks Mayank, Any clue on why was only one disk was getting all writes.

Rahul


On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:

> So we did a manual rebalance (followed instructions at:
> http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
> and also reserved 30 GB of space for non dfs usage via
> dfs.datanode.du.reserved and restarted our apps.
>
> Things have been going fine till now.
>
> Keeping fingers crossed :)
>
>
> On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> I have a few points to make , these may not be very helpful for the said
>> problem.
>>
>> +All data nodes are bad exception is kind of not pointing to the problem
>> related to disk space full.
>> +hadoop.tmp.dir acts as base location of other hadoop related properties
>> , not sure if any particular directory is created specifically.
>> +Only one disk getting filled looks strange.The other disk are part while
>> formatting the NN.
>>
>> Would be interesting to know the reason for this.
>> Please keep posted.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> From the snapshot, you got around 3TB for writing data.
>>>
>>> Can you check individual datanode's storage health.
>>> As you said you got 80 servers writing parallely to hdfs, I am not sure
>>> can that be an issue.
>>> As suggested in past threads, you can do a rebalance of the blocks but
>>> that will take some time to finish and will not solve your issue right
>>> away.
>>>
>>> You can wait for others to reply. I am sure there will be far better
>>> solutions from experts for this.
>>>
>>>
>>> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>>>
>>>> No it's not a map-reduce job. We've a java app running on around 80
>>>> machines which writes to hdfs. The error that I'd mentioned is being thrown
>>>> by the application and yes we've replication factor set to 3 and following
>>>> is status of hdfs:
>>>>
>>>> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used :872.66 GB DFS
>>>> Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
>>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
>>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
>>>> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
>>>> : 0 Number of Under-Replicated Blocks : 0
>>>>
>>>>
>>>> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> when you say application errors out .. does that mean your mapreduce
>>>>> job is erroring? In that case apart from hdfs space you will need to look
>>>>> at mapred tmp directory space as well.
>>>>>
>>>>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
>>>>> replication factor of 3 so at max you will have datasize of 5TB with you.
>>>>> I am also assuming you are not scheduling your program to run on
>>>>> entire 5TB with just 10 nodes.
>>>>>
>>>>> i suspect your clusters mapred tmp space is getting filled in while
>>>>> the job is running.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>>>>>
>>>>>> We are running a hadoop cluster with 10 datanodes and a namenode.
>>>>>> Each datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which
>>>>>> each disk having a capacity 414GB.
>>>>>>
>>>>>>
>>>>>> hdfs-site.xml has following property set:
>>>>>>
>>>>>> <property>
>>>>>>         <name>dfs.data.dir</name>
>>>>>>
>>>>>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>>>>>         <description>Data dirs for DFS.</description>
>>>>>> </property>
>>>>>>
>>>>>> Now we are facing a issue where in we find /data1 getting filled up
>>>>>> quickly and many a times we see it's usage running at 100% with just few
>>>>>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>>>>>> present.
>>>>>>
>>>>>> We've some java applications which are writing to hdfs and many a
>>>>>> times we are seeing foloowing errors in our application logs:
>>>>>>
>>>>>>
>>>>>>
>>>>>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>>>>>
>>>>>>
>>>>>>
>>>>>> I went through some old discussions and looks like manual rebalancing
>>>>>> is what is required in this case and we should also have
>>>>>> dfs.datanode.du.reserved set up.
>>>>>>
>>>>>> However I'd like to understand if this issue, with one disk getting
>>>>>> filled up to 100% can result into the issue which we are seeing in our
>>>>>> application.
>>>>>>
>>>>>> Also, are there any other peformance implications due to some of the
>>>>>> disks running at 100% usage on a datanode.
>>>>>> --
>>>>>> Mayank Joshi
>>>>>>
>>>>>> Skype: mail2mayank
>>>>>> Mb.:  +91 8690625808
>>>>>>
>>>>>> Blog: http://www.techynfreesouls.co.nr
>>>>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>>>>
>>>>>> Today is tommorrow I was so worried about yesterday ...
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Mayank Joshi
>>>>
>>>> Skype: mail2mayank
>>>> Mb.:  +91 8690625808
>>>>
>>>> Blog: http://www.techynfreesouls.co.nr
>>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>>
>>>> Today is tommorrow I was so worried about yesterday ...
>>>>
>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

Thanks Mayank, Any clue on why was only one disk was getting all writes.

Rahul


On Thu, Jun 13, 2013 at 11:47 AM, Mayank <ma...@gmail.com> wrote:

> So we did a manual rebalance (followed instructions at:
> http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
> and also reserved 30 GB of space for non dfs usage via
> dfs.datanode.du.reserved and restarted our apps.
>
> Things have been going fine till now.
>
> Keeping fingers crossed :)
>
>
> On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
> rahul.rec.dgp@gmail.com> wrote:
>
>> I have a few points to make , these may not be very helpful for the said
>> problem.
>>
>> +All data nodes are bad exception is kind of not pointing to the problem
>> related to disk space full.
>> +hadoop.tmp.dir acts as base location of other hadoop related properties
>> , not sure if any particular directory is created specifically.
>> +Only one disk getting filled looks strange.The other disk are part while
>> formatting the NN.
>>
>> Would be interesting to know the reason for this.
>> Please keep posted.
>>
>> Thanks,
>> Rahul
>>
>>
>> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> From the snapshot, you got around 3TB for writing data.
>>>
>>> Can you check individual datanode's storage health.
>>> As you said you got 80 servers writing parallely to hdfs, I am not sure
>>> can that be an issue.
>>> As suggested in past threads, you can do a rebalance of the blocks but
>>> that will take some time to finish and will not solve your issue right
>>> away.
>>>
>>> You can wait for others to reply. I am sure there will be far better
>>> solutions from experts for this.
>>>
>>>
>>> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>>>
>>>> No it's not a map-reduce job. We've a java app running on around 80
>>>> machines which writes to hdfs. The error that I'd mentioned is being thrown
>>>> by the application and yes we've replication factor set to 3 and following
>>>> is status of hdfs:
>>>>
>>>> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used :872.66 GB DFS
>>>> Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
>>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
>>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
>>>> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
>>>> : 0 Number of Under-Replicated Blocks : 0
>>>>
>>>>
>>>> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>>
>>>>> when you say application errors out .. does that mean your mapreduce
>>>>> job is erroring? In that case apart from hdfs space you will need to look
>>>>> at mapred tmp directory space as well.
>>>>>
>>>>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
>>>>> replication factor of 3 so at max you will have datasize of 5TB with you.
>>>>> I am also assuming you are not scheduling your program to run on
>>>>> entire 5TB with just 10 nodes.
>>>>>
>>>>> i suspect your clusters mapred tmp space is getting filled in while
>>>>> the job is running.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>>>>>
>>>>>> We are running a hadoop cluster with 10 datanodes and a namenode.
>>>>>> Each datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which
>>>>>> each disk having a capacity 414GB.
>>>>>>
>>>>>>
>>>>>> hdfs-site.xml has following property set:
>>>>>>
>>>>>> <property>
>>>>>>         <name>dfs.data.dir</name>
>>>>>>
>>>>>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>>>>>         <description>Data dirs for DFS.</description>
>>>>>> </property>
>>>>>>
>>>>>> Now we are facing a issue where in we find /data1 getting filled up
>>>>>> quickly and many a times we see it's usage running at 100% with just few
>>>>>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>>>>>> present.
>>>>>>
>>>>>> We've some java applications which are writing to hdfs and many a
>>>>>> times we are seeing foloowing errors in our application logs:
>>>>>>
>>>>>>
>>>>>>
>>>>>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>>>>>
>>>>>>
>>>>>>
>>>>>> I went through some old discussions and looks like manual rebalancing
>>>>>> is what is required in this case and we should also have
>>>>>> dfs.datanode.du.reserved set up.
>>>>>>
>>>>>> However I'd like to understand if this issue, with one disk getting
>>>>>> filled up to 100% can result into the issue which we are seeing in our
>>>>>> application.
>>>>>>
>>>>>> Also, are there any other peformance implications due to some of the
>>>>>> disks running at 100% usage on a datanode.
>>>>>> --
>>>>>> Mayank Joshi
>>>>>>
>>>>>> Skype: mail2mayank
>>>>>> Mb.:  +91 8690625808
>>>>>>
>>>>>> Blog: http://www.techynfreesouls.co.nr
>>>>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>>>>
>>>>>> Today is tommorrow I was so worried about yesterday ...
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Mayank Joshi
>>>>
>>>> Skype: mail2mayank
>>>> Mb.:  +91 8690625808
>>>>
>>>> Blog: http://www.techynfreesouls.co.nr
>>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>>
>>>> Today is tommorrow I was so worried about yesterday ...
>>>>
>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Mayank <ma...@gmail.com>.

So we did a manual rebalance (followed instructions at:
http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
and also reserved 30 GB of space for non dfs usage via
dfs.datanode.du.reserved and restarted our apps.

Things have been going fine till now.

Keeping fingers crossed :)


On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> I have a few points to make , these may not be very helpful for the said
> problem.
>
> +All data nodes are bad exception is kind of not pointing to the problem
> related to disk space full.
> +hadoop.tmp.dir acts as base location of other hadoop related properties ,
> not sure if any particular directory is created specifically.
> +Only one disk getting filled looks strange.The other disk are part while
> formatting the NN.
>
> Would be interesting to know the reason for this.
> Please keep posted.
>
> Thanks,
> Rahul
>
>
> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> From the snapshot, you got around 3TB for writing data.
>>
>> Can you check individual datanode's storage health.
>> As you said you got 80 servers writing parallely to hdfs, I am not sure
>> can that be an issue.
>> As suggested in past threads, you can do a rebalance of the blocks but
>> that will take some time to finish and will not solve your issue right
>> away.
>>
>> You can wait for others to reply. I am sure there will be far better
>> solutions from experts for this.
>>
>>
>> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>>
>>> No it's not a map-reduce job. We've a java app running on around 80
>>> machines which writes to hdfs. The error that I'd mentioned is being thrown
>>> by the application and yes we've replication factor set to 3 and following
>>> is status of hdfs:
>>>
>>> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used :872.66 GB DFS
>>> Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
>>> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
>>> : 0 Number of Under-Replicated Blocks : 0
>>>
>>>
>>> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> when you say application errors out .. does that mean your mapreduce
>>>> job is erroring? In that case apart from hdfs space you will need to look
>>>> at mapred tmp directory space as well.
>>>>
>>>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
>>>> replication factor of 3 so at max you will have datasize of 5TB with you.
>>>> I am also assuming you are not scheduling your program to run on entire
>>>> 5TB with just 10 nodes.
>>>>
>>>> i suspect your clusters mapred tmp space is getting filled in while the
>>>> job is running.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>>>>
>>>>> We are running a hadoop cluster with 10 datanodes and a namenode. Each
>>>>> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
>>>>> disk having a capacity 414GB.
>>>>>
>>>>>
>>>>> hdfs-site.xml has following property set:
>>>>>
>>>>> <property>
>>>>>         <name>dfs.data.dir</name>
>>>>>
>>>>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>>>>         <description>Data dirs for DFS.</description>
>>>>> </property>
>>>>>
>>>>> Now we are facing a issue where in we find /data1 getting filled up
>>>>> quickly and many a times we see it's usage running at 100% with just few
>>>>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>>>>> present.
>>>>>
>>>>> We've some java applications which are writing to hdfs and many a
>>>>> times we are seeing foloowing errors in our application logs:
>>>>>
>>>>>
>>>>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>>>>
>>>>>
>>>>> I went through some old discussions and looks like manual rebalancing
>>>>> is what is required in this case and we should also have
>>>>> dfs.datanode.du.reserved set up.
>>>>>
>>>>> However I'd like to understand if this issue, with one disk getting
>>>>> filled up to 100% can result into the issue which we are seeing in our
>>>>> application.
>>>>>
>>>>> Also, are there any other peformance implications due to some of the
>>>>> disks running at 100% usage on a datanode.
>>>>> --
>>>>> Mayank Joshi
>>>>>
>>>>> Skype: mail2mayank
>>>>> Mb.:  +91 8690625808
>>>>>
>>>>> Blog: http://www.techynfreesouls.co.nr
>>>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>>>
>>>>> Today is tommorrow I was so worried about yesterday ...
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>>
>>> --
>>> Mayank Joshi
>>>
>>> Skype: mail2mayank
>>> Mb.:  +91 8690625808
>>>
>>> Blog: http://www.techynfreesouls.co.nr
>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>
>>> Today is tommorrow I was so worried about yesterday ...
>>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>


-- 
Mayank Joshi

Skype: mail2mayank
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr
PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Mayank <ma...@gmail.com>.

So we did a manual rebalance (followed instructions at:
http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
and also reserved 30 GB of space for non dfs usage via
dfs.datanode.du.reserved and restarted our apps.

Things have been going fine till now.

Keeping fingers crossed :)


On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> I have a few points to make , these may not be very helpful for the said
> problem.
>
> +All data nodes are bad exception is kind of not pointing to the problem
> related to disk space full.
> +hadoop.tmp.dir acts as base location of other hadoop related properties ,
> not sure if any particular directory is created specifically.
> +Only one disk getting filled looks strange.The other disk are part while
> formatting the NN.
>
> Would be interesting to know the reason for this.
> Please keep posted.
>
> Thanks,
> Rahul
>
>
> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> From the snapshot, you got around 3TB for writing data.
>>
>> Can you check individual datanode's storage health.
>> As you said you got 80 servers writing parallely to hdfs, I am not sure
>> can that be an issue.
>> As suggested in past threads, you can do a rebalance of the blocks but
>> that will take some time to finish and will not solve your issue right
>> away.
>>
>> You can wait for others to reply. I am sure there will be far better
>> solutions from experts for this.
>>
>>
>> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>>
>>> No it's not a map-reduce job. We've a java app running on around 80
>>> machines which writes to hdfs. The error that I'd mentioned is being thrown
>>> by the application and yes we've replication factor set to 3 and following
>>> is status of hdfs:
>>>
>>> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used :872.66 GB DFS
>>> Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
>>> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
>>> : 0 Number of Under-Replicated Blocks : 0
>>>
>>>
>>> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> when you say application errors out .. does that mean your mapreduce
>>>> job is erroring? In that case apart from hdfs space you will need to look
>>>> at mapred tmp directory space as well.
>>>>
>>>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
>>>> replication factor of 3 so at max you will have datasize of 5TB with you.
>>>> I am also assuming you are not scheduling your program to run on entire
>>>> 5TB with just 10 nodes.
>>>>
>>>> i suspect your clusters mapred tmp space is getting filled in while the
>>>> job is running.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>>>>
>>>>> We are running a hadoop cluster with 10 datanodes and a namenode. Each
>>>>> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
>>>>> disk having a capacity 414GB.
>>>>>
>>>>>
>>>>> hdfs-site.xml has following property set:
>>>>>
>>>>> <property>
>>>>>         <name>dfs.data.dir</name>
>>>>>
>>>>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>>>>         <description>Data dirs for DFS.</description>
>>>>> </property>
>>>>>
>>>>> Now we are facing a issue where in we find /data1 getting filled up
>>>>> quickly and many a times we see it's usage running at 100% with just few
>>>>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>>>>> present.
>>>>>
>>>>> We've some java applications which are writing to hdfs and many a
>>>>> times we are seeing foloowing errors in our application logs:
>>>>>
>>>>>
>>>>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>>>>
>>>>>
>>>>> I went through some old discussions and looks like manual rebalancing
>>>>> is what is required in this case and we should also have
>>>>> dfs.datanode.du.reserved set up.
>>>>>
>>>>> However I'd like to understand if this issue, with one disk getting
>>>>> filled up to 100% can result into the issue which we are seeing in our
>>>>> application.
>>>>>
>>>>> Also, are there any other peformance implications due to some of the
>>>>> disks running at 100% usage on a datanode.
>>>>> --
>>>>> Mayank Joshi
>>>>>
>>>>> Skype: mail2mayank
>>>>> Mb.:  +91 8690625808
>>>>>
>>>>> Blog: http://www.techynfreesouls.co.nr
>>>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>>>
>>>>> Today is tommorrow I was so worried about yesterday ...
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>>
>>> --
>>> Mayank Joshi
>>>
>>> Skype: mail2mayank
>>> Mb.:  +91 8690625808
>>>
>>> Blog: http://www.techynfreesouls.co.nr
>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>
>>> Today is tommorrow I was so worried about yesterday ...
>>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>


-- 
Mayank Joshi

Skype: mail2mayank
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr
PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Mayank <ma...@gmail.com>.

So we did a manual rebalance (followed instructions at:
http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
and also reserved 30 GB of space for non dfs usage via
dfs.datanode.du.reserved and restarted our apps.

Things have been going fine till now.

Keeping fingers crossed :)


On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> I have a few points to make , these may not be very helpful for the said
> problem.
>
> +All data nodes are bad exception is kind of not pointing to the problem
> related to disk space full.
> +hadoop.tmp.dir acts as base location of other hadoop related properties ,
> not sure if any particular directory is created specifically.
> +Only one disk getting filled looks strange.The other disk are part while
> formatting the NN.
>
> Would be interesting to know the reason for this.
> Please keep posted.
>
> Thanks,
> Rahul
>
>
> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> From the snapshot, you got around 3TB for writing data.
>>
>> Can you check individual datanode's storage health.
>> As you said you got 80 servers writing parallely to hdfs, I am not sure
>> can that be an issue.
>> As suggested in past threads, you can do a rebalance of the blocks but
>> that will take some time to finish and will not solve your issue right
>> away.
>>
>> You can wait for others to reply. I am sure there will be far better
>> solutions from experts for this.
>>
>>
>> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>>
>>> No it's not a map-reduce job. We've a java app running on around 80
>>> machines which writes to hdfs. The error that I'd mentioned is being thrown
>>> by the application and yes we've replication factor set to 3 and following
>>> is status of hdfs:
>>>
>>> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used :872.66 GB DFS
>>> Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
>>> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
>>> : 0 Number of Under-Replicated Blocks : 0
>>>
>>>
>>> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> when you say application errors out .. does that mean your mapreduce
>>>> job is erroring? In that case apart from hdfs space you will need to look
>>>> at mapred tmp directory space as well.
>>>>
>>>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
>>>> replication factor of 3 so at max you will have datasize of 5TB with you.
>>>> I am also assuming you are not scheduling your program to run on entire
>>>> 5TB with just 10 nodes.
>>>>
>>>> i suspect your clusters mapred tmp space is getting filled in while the
>>>> job is running.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>>>>
>>>>> We are running a hadoop cluster with 10 datanodes and a namenode. Each
>>>>> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
>>>>> disk having a capacity 414GB.
>>>>>
>>>>>
>>>>> hdfs-site.xml has following property set:
>>>>>
>>>>> <property>
>>>>>         <name>dfs.data.dir</name>
>>>>>
>>>>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>>>>         <description>Data dirs for DFS.</description>
>>>>> </property>
>>>>>
>>>>> Now we are facing a issue where in we find /data1 getting filled up
>>>>> quickly and many a times we see it's usage running at 100% with just few
>>>>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>>>>> present.
>>>>>
>>>>> We've some java applications which are writing to hdfs and many a
>>>>> times we are seeing foloowing errors in our application logs:
>>>>>
>>>>>
>>>>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>>>>
>>>>>
>>>>> I went through some old discussions and looks like manual rebalancing
>>>>> is what is required in this case and we should also have
>>>>> dfs.datanode.du.reserved set up.
>>>>>
>>>>> However I'd like to understand if this issue, with one disk getting
>>>>> filled up to 100% can result into the issue which we are seeing in our
>>>>> application.
>>>>>
>>>>> Also, are there any other peformance implications due to some of the
>>>>> disks running at 100% usage on a datanode.
>>>>> --
>>>>> Mayank Joshi
>>>>>
>>>>> Skype: mail2mayank
>>>>> Mb.:  +91 8690625808
>>>>>
>>>>> Blog: http://www.techynfreesouls.co.nr
>>>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>>>
>>>>> Today is tommorrow I was so worried about yesterday ...
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>>
>>> --
>>> Mayank Joshi
>>>
>>> Skype: mail2mayank
>>> Mb.:  +91 8690625808
>>>
>>> Blog: http://www.techynfreesouls.co.nr
>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>
>>> Today is tommorrow I was so worried about yesterday ...
>>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>


-- 
Mayank Joshi

Skype: mail2mayank
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr
PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Mayank <ma...@gmail.com>.

So we did a manual rebalance (followed instructions at:
http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F)
and also reserved 30 GB of space for non dfs usage via
dfs.datanode.du.reserved and restarted our apps.

Things have been going fine till now.

Keeping fingers crossed :)


On Wed, Jun 12, 2013 at 12:58 PM, Rahul Bhattacharjee <
rahul.rec.dgp@gmail.com> wrote:

> I have a few points to make , these may not be very helpful for the said
> problem.
>
> +All data nodes are bad exception is kind of not pointing to the problem
> related to disk space full.
> +hadoop.tmp.dir acts as base location of other hadoop related properties ,
> not sure if any particular directory is created specifically.
> +Only one disk getting filled looks strange.The other disk are part while
> formatting the NN.
>
> Would be interesting to know the reason for this.
> Please keep posted.
>
> Thanks,
> Rahul
>
>
> On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> From the snapshot, you got around 3TB for writing data.
>>
>> Can you check individual datanode's storage health.
>> As you said you got 80 servers writing parallely to hdfs, I am not sure
>> can that be an issue.
>> As suggested in past threads, you can do a rebalance of the blocks but
>> that will take some time to finish and will not solve your issue right
>> away.
>>
>> You can wait for others to reply. I am sure there will be far better
>> solutions from experts for this.
>>
>>
>> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>>
>>> No it's not a map-reduce job. We've a java app running on around 80
>>> machines which writes to hdfs. The error that I'd mentioned is being thrown
>>> by the application and yes we've replication factor set to 3 and following
>>> is status of hdfs:
>>>
>>> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used :872.66 GB DFS
>>> Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
>>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
>>> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
>>> : 0 Number of Under-Replicated Blocks : 0
>>>
>>>
>>> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>>
>>>> when you say application errors out .. does that mean your mapreduce
>>>> job is erroring? In that case apart from hdfs space you will need to look
>>>> at mapred tmp directory space as well.
>>>>
>>>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
>>>> replication factor of 3 so at max you will have datasize of 5TB with you.
>>>> I am also assuming you are not scheduling your program to run on entire
>>>> 5TB with just 10 nodes.
>>>>
>>>> i suspect your clusters mapred tmp space is getting filled in while the
>>>> job is running.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>>>>
>>>>> We are running a hadoop cluster with 10 datanodes and a namenode. Each
>>>>> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
>>>>> disk having a capacity 414GB.
>>>>>
>>>>>
>>>>> hdfs-site.xml has following property set:
>>>>>
>>>>> <property>
>>>>>         <name>dfs.data.dir</name>
>>>>>
>>>>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>>>>         <description>Data dirs for DFS.</description>
>>>>> </property>
>>>>>
>>>>> Now we are facing a issue where in we find /data1 getting filled up
>>>>> quickly and many a times we see it's usage running at 100% with just few
>>>>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>>>>> present.
>>>>>
>>>>> We've some java applications which are writing to hdfs and many a
>>>>> times we are seeing foloowing errors in our application logs:
>>>>>
>>>>>
>>>>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>>>>
>>>>>
>>>>> I went through some old discussions and looks like manual rebalancing
>>>>> is what is required in this case and we should also have
>>>>> dfs.datanode.du.reserved set up.
>>>>>
>>>>> However I'd like to understand if this issue, with one disk getting
>>>>> filled up to 100% can result into the issue which we are seeing in our
>>>>> application.
>>>>>
>>>>> Also, are there any other peformance implications due to some of the
>>>>> disks running at 100% usage on a datanode.
>>>>> --
>>>>> Mayank Joshi
>>>>>
>>>>> Skype: mail2mayank
>>>>> Mb.:  +91 8690625808
>>>>>
>>>>> Blog: http://www.techynfreesouls.co.nr
>>>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>>>
>>>>> Today is tommorrow I was so worried about yesterday ...
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>>
>>> --
>>> Mayank Joshi
>>>
>>> Skype: mail2mayank
>>> Mb.:  +91 8690625808
>>>
>>> Blog: http://www.techynfreesouls.co.nr
>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>
>>> Today is tommorrow I was so worried about yesterday ...
>>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>


-- 
Mayank Joshi

Skype: mail2mayank
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr
PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

I have a few points to make , these may not be very helpful for the said
problem.

+All data nodes are bad exception is kind of not pointing to the problem
related to disk space full.
+hadoop.tmp.dir acts as base location of other hadoop related properties ,
not sure if any particular directory is created specifically.
+Only one disk getting filled looks strange.The other disk are part while
formatting the NN.

Would be interesting to know the reason for this.
Please keep posted.

Thanks,
Rahul


On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:

> From the snapshot, you got around 3TB for writing data.
>
> Can you check individual datanode's storage health.
> As you said you got 80 servers writing parallely to hdfs, I am not sure
> can that be an issue.
> As suggested in past threads, you can do a rebalance of the blocks but
> that will take some time to finish and will not solve your issue right
> away.
>
> You can wait for others to reply. I am sure there will be far better
> solutions from experts for this.
>
>
> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>
>> No it's not a map-reduce job. We've a java app running on around 80
>> machines which writes to hdfs. The error that I'd mentioned is being thrown
>> by the application and yes we've replication factor set to 3 and following
>> is status of hdfs:
>>
>> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66
>> GB DFS Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
>> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
>> : 0 Number of Under-Replicated Blocks : 0
>>
>>
>> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> when you say application errors out .. does that mean your mapreduce job
>>> is erroring? In that case apart from hdfs space you will need to look at
>>> mapred tmp directory space as well.
>>>
>>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
>>> replication factor of 3 so at max you will have datasize of 5TB with you.
>>> I am also assuming you are not scheduling your program to run on entire
>>> 5TB with just 10 nodes.
>>>
>>> i suspect your clusters mapred tmp space is getting filled in while the
>>> job is running.
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>>>
>>>> We are running a hadoop cluster with 10 datanodes and a namenode. Each
>>>> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
>>>> disk having a capacity 414GB.
>>>>
>>>>
>>>> hdfs-site.xml has following property set:
>>>>
>>>> <property>
>>>>         <name>dfs.data.dir</name>
>>>>
>>>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>>>         <description>Data dirs for DFS.</description>
>>>> </property>
>>>>
>>>> Now we are facing a issue where in we find /data1 getting filled up
>>>> quickly and many a times we see it's usage running at 100% with just few
>>>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>>>> present.
>>>>
>>>> We've some java applications which are writing to hdfs and many a times
>>>> we are seeing foloowing errors in our application logs:
>>>>
>>>>
>>>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>>>
>>>>
>>>> I went through some old discussions and looks like manual rebalancing
>>>> is what is required in this case and we should also have
>>>> dfs.datanode.du.reserved set up.
>>>>
>>>> However I'd like to understand if this issue, with one disk getting
>>>> filled up to 100% can result into the issue which we are seeing in our
>>>> application.
>>>>
>>>> Also, are there any other peformance implications due to some of the
>>>> disks running at 100% usage on a datanode.
>>>> --
>>>> Mayank Joshi
>>>>
>>>> Skype: mail2mayank
>>>> Mb.:  +91 8690625808
>>>>
>>>> Blog: http://www.techynfreesouls.co.nr
>>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>>
>>>> Today is tommorrow I was so worried about yesterday ...
>>>>
>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>>
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>
>
>
> --
> Nitin Pawar
>

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

I have a few points to make , these may not be very helpful for the said
problem.

+All data nodes are bad exception is kind of not pointing to the problem
related to disk space full.
+hadoop.tmp.dir acts as base location of other hadoop related properties ,
not sure if any particular directory is created specifically.
+Only one disk getting filled looks strange.The other disk are part while
formatting the NN.

Would be interesting to know the reason for this.
Please keep posted.

Thanks,
Rahul


On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:

> From the snapshot, you got around 3TB for writing data.
>
> Can you check individual datanode's storage health.
> As you said you got 80 servers writing parallely to hdfs, I am not sure
> can that be an issue.
> As suggested in past threads, you can do a rebalance of the blocks but
> that will take some time to finish and will not solve your issue right
> away.
>
> You can wait for others to reply. I am sure there will be far better
> solutions from experts for this.
>
>
> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>
>> No it's not a map-reduce job. We've a java app running on around 80
>> machines which writes to hdfs. The error that I'd mentioned is being thrown
>> by the application and yes we've replication factor set to 3 and following
>> is status of hdfs:
>>
>> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66
>> GB DFS Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
>> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
>> : 0 Number of Under-Replicated Blocks : 0
>>
>>
>> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> when you say application errors out .. does that mean your mapreduce job
>>> is erroring? In that case apart from hdfs space you will need to look at
>>> mapred tmp directory space as well.
>>>
>>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
>>> replication factor of 3 so at max you will have datasize of 5TB with you.
>>> I am also assuming you are not scheduling your program to run on entire
>>> 5TB with just 10 nodes.
>>>
>>> i suspect your clusters mapred tmp space is getting filled in while the
>>> job is running.
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>>>
>>>> We are running a hadoop cluster with 10 datanodes and a namenode. Each
>>>> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
>>>> disk having a capacity 414GB.
>>>>
>>>>
>>>> hdfs-site.xml has following property set:
>>>>
>>>> <property>
>>>>         <name>dfs.data.dir</name>
>>>>
>>>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>>>         <description>Data dirs for DFS.</description>
>>>> </property>
>>>>
>>>> Now we are facing a issue where in we find /data1 getting filled up
>>>> quickly and many a times we see it's usage running at 100% with just few
>>>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>>>> present.
>>>>
>>>> We've some java applications which are writing to hdfs and many a times
>>>> we are seeing foloowing errors in our application logs:
>>>>
>>>>
>>>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>>>
>>>>
>>>> I went through some old discussions and looks like manual rebalancing
>>>> is what is required in this case and we should also have
>>>> dfs.datanode.du.reserved set up.
>>>>
>>>> However I'd like to understand if this issue, with one disk getting
>>>> filled up to 100% can result into the issue which we are seeing in our
>>>> application.
>>>>
>>>> Also, are there any other peformance implications due to some of the
>>>> disks running at 100% usage on a datanode.
>>>> --
>>>> Mayank Joshi
>>>>
>>>> Skype: mail2mayank
>>>> Mb.:  +91 8690625808
>>>>
>>>> Blog: http://www.techynfreesouls.co.nr
>>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>>
>>>> Today is tommorrow I was so worried about yesterday ...
>>>>
>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>>
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>
>
>
> --
> Nitin Pawar
>

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

I have a few points to make , these may not be very helpful for the said
problem.

+All data nodes are bad exception is kind of not pointing to the problem
related to disk space full.
+hadoop.tmp.dir acts as base location of other hadoop related properties ,
not sure if any particular directory is created specifically.
+Only one disk getting filled looks strange.The other disk are part while
formatting the NN.

Would be interesting to know the reason for this.
Please keep posted.

Thanks,
Rahul


On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:

> From the snapshot, you got around 3TB for writing data.
>
> Can you check individual datanode's storage health.
> As you said you got 80 servers writing parallely to hdfs, I am not sure
> can that be an issue.
> As suggested in past threads, you can do a rebalance of the blocks but
> that will take some time to finish and will not solve your issue right
> away.
>
> You can wait for others to reply. I am sure there will be far better
> solutions from experts for this.
>
>
> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>
>> No it's not a map-reduce job. We've a java app running on around 80
>> machines which writes to hdfs. The error that I'd mentioned is being thrown
>> by the application and yes we've replication factor set to 3 and following
>> is status of hdfs:
>>
>> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66
>> GB DFS Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
>> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
>> : 0 Number of Under-Replicated Blocks : 0
>>
>>
>> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> when you say application errors out .. does that mean your mapreduce job
>>> is erroring? In that case apart from hdfs space you will need to look at
>>> mapred tmp directory space as well.
>>>
>>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
>>> replication factor of 3 so at max you will have datasize of 5TB with you.
>>> I am also assuming you are not scheduling your program to run on entire
>>> 5TB with just 10 nodes.
>>>
>>> i suspect your clusters mapred tmp space is getting filled in while the
>>> job is running.
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>>>
>>>> We are running a hadoop cluster with 10 datanodes and a namenode. Each
>>>> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
>>>> disk having a capacity 414GB.
>>>>
>>>>
>>>> hdfs-site.xml has following property set:
>>>>
>>>> <property>
>>>>         <name>dfs.data.dir</name>
>>>>
>>>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>>>         <description>Data dirs for DFS.</description>
>>>> </property>
>>>>
>>>> Now we are facing a issue where in we find /data1 getting filled up
>>>> quickly and many a times we see it's usage running at 100% with just few
>>>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>>>> present.
>>>>
>>>> We've some java applications which are writing to hdfs and many a times
>>>> we are seeing foloowing errors in our application logs:
>>>>
>>>>
>>>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>>>
>>>>
>>>> I went through some old discussions and looks like manual rebalancing
>>>> is what is required in this case and we should also have
>>>> dfs.datanode.du.reserved set up.
>>>>
>>>> However I'd like to understand if this issue, with one disk getting
>>>> filled up to 100% can result into the issue which we are seeing in our
>>>> application.
>>>>
>>>> Also, are there any other peformance implications due to some of the
>>>> disks running at 100% usage on a datanode.
>>>> --
>>>> Mayank Joshi
>>>>
>>>> Skype: mail2mayank
>>>> Mb.:  +91 8690625808
>>>>
>>>> Blog: http://www.techynfreesouls.co.nr
>>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>>
>>>> Today is tommorrow I was so worried about yesterday ...
>>>>
>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>>
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>
>
>
> --
> Nitin Pawar
>

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

I have a few points to make , these may not be very helpful for the said
problem.

+All data nodes are bad exception is kind of not pointing to the problem
related to disk space full.
+hadoop.tmp.dir acts as base location of other hadoop related properties ,
not sure if any particular directory is created specifically.
+Only one disk getting filled looks strange.The other disk are part while
formatting the NN.

Would be interesting to know the reason for this.
Please keep posted.

Thanks,
Rahul


On Mon, Jun 10, 2013 at 3:39 PM, Nitin Pawar <ni...@gmail.com>wrote:

> From the snapshot, you got around 3TB for writing data.
>
> Can you check individual datanode's storage health.
> As you said you got 80 servers writing parallely to hdfs, I am not sure
> can that be an issue.
> As suggested in past threads, you can do a rebalance of the blocks but
> that will take some time to finish and will not solve your issue right
> away.
>
> You can wait for others to reply. I am sure there will be far better
> solutions from experts for this.
>
>
> On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:
>
>> No it's not a map-reduce job. We've a java app running on around 80
>> machines which writes to hdfs. The error that I'd mentioned is being thrown
>> by the application and yes we've replication factor set to 3 and following
>> is status of hdfs:
>>
>> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66
>> GB DFS Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
>> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
>> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
>> : 0 Number of Under-Replicated Blocks : 0
>>
>>
>> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>>
>>> when you say application errors out .. does that mean your mapreduce job
>>> is erroring? In that case apart from hdfs space you will need to look at
>>> mapred tmp directory space as well.
>>>
>>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
>>> replication factor of 3 so at max you will have datasize of 5TB with you.
>>> I am also assuming you are not scheduling your program to run on entire
>>> 5TB with just 10 nodes.
>>>
>>> i suspect your clusters mapred tmp space is getting filled in while the
>>> job is running.
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>>>
>>>> We are running a hadoop cluster with 10 datanodes and a namenode. Each
>>>> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
>>>> disk having a capacity 414GB.
>>>>
>>>>
>>>> hdfs-site.xml has following property set:
>>>>
>>>> <property>
>>>>         <name>dfs.data.dir</name>
>>>>
>>>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>>>         <description>Data dirs for DFS.</description>
>>>> </property>
>>>>
>>>> Now we are facing a issue where in we find /data1 getting filled up
>>>> quickly and many a times we see it's usage running at 100% with just few
>>>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>>>> present.
>>>>
>>>> We've some java applications which are writing to hdfs and many a times
>>>> we are seeing foloowing errors in our application logs:
>>>>
>>>>
>>>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>>>
>>>>
>>>> I went through some old discussions and looks like manual rebalancing
>>>> is what is required in this case and we should also have
>>>> dfs.datanode.du.reserved set up.
>>>>
>>>> However I'd like to understand if this issue, with one disk getting
>>>> filled up to 100% can result into the issue which we are seeing in our
>>>> application.
>>>>
>>>> Also, are there any other peformance implications due to some of the
>>>> disks running at 100% usage on a datanode.
>>>> --
>>>> Mayank Joshi
>>>>
>>>> Skype: mail2mayank
>>>> Mb.:  +91 8690625808
>>>>
>>>> Blog: http://www.techynfreesouls.co.nr
>>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>>
>>>> Today is tommorrow I was so worried about yesterday ...
>>>>
>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>>
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>
>
>
> --
> Nitin Pawar
>

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Nitin Pawar <ni...@gmail.com>.

>From the snapshot, you got around 3TB for writing data.

Can you check individual datanode's storage health.
As you said you got 80 servers writing parallely to hdfs, I am not sure can
that be an issue.
As suggested in past threads, you can do a rebalance of the blocks but that
will take some time to finish and will not solve your issue right away.

You can wait for others to reply. I am sure there will be far better
solutions from experts for this.


On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:

> No it's not a map-reduce job. We've a java app running on around 80
> machines which writes to hdfs. The error that I'd mentioned is being thrown
> by the application and yes we've replication factor set to 3 and following
> is status of hdfs:
>
> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66
> GB DFS Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
> : 0 Number of Under-Replicated Blocks : 0
>
>
> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> when you say application errors out .. does that mean your mapreduce job
>> is erroring? In that case apart from hdfs space you will need to look at
>> mapred tmp directory space as well.
>>
>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
>> replication factor of 3 so at max you will have datasize of 5TB with you.
>> I am also assuming you are not scheduling your program to run on entire
>> 5TB with just 10 nodes.
>>
>> i suspect your clusters mapred tmp space is getting filled in while the
>> job is running.
>>
>>
>>
>>
>>
>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>>
>>> We are running a hadoop cluster with 10 datanodes and a namenode. Each
>>> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
>>> disk having a capacity 414GB.
>>>
>>>
>>> hdfs-site.xml has following property set:
>>>
>>> <property>
>>>         <name>dfs.data.dir</name>
>>>
>>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>>         <description>Data dirs for DFS.</description>
>>> </property>
>>>
>>> Now we are facing a issue where in we find /data1 getting filled up
>>> quickly and many a times we see it's usage running at 100% with just few
>>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>>> present.
>>>
>>> We've some java applications which are writing to hdfs and many a times
>>> we are seeing foloowing errors in our application logs:
>>>
>>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>>
>>>
>>> I went through some old discussions and looks like manual rebalancing is
>>> what is required in this case and we should also have
>>> dfs.datanode.du.reserved set up.
>>>
>>> However I'd like to understand if this issue, with one disk getting
>>> filled up to 100% can result into the issue which we are seeing in our
>>> application.
>>>
>>> Also, are there any other peformance implications due to some of the
>>> disks running at 100% usage on a datanode.
>>> --
>>> Mayank Joshi
>>>
>>> Skype: mail2mayank
>>> Mb.:  +91 8690625808
>>>
>>> Blog: http://www.techynfreesouls.co.nr
>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>
>>> Today is tommorrow I was so worried about yesterday ...
>>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>



-- 
Nitin Pawar

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Nitin Pawar <ni...@gmail.com>.

>From the snapshot, you got around 3TB for writing data.

Can you check individual datanode's storage health.
As you said you got 80 servers writing parallely to hdfs, I am not sure can
that be an issue.
As suggested in past threads, you can do a rebalance of the blocks but that
will take some time to finish and will not solve your issue right away.

You can wait for others to reply. I am sure there will be far better
solutions from experts for this.


On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:

> No it's not a map-reduce job. We've a java app running on around 80
> machines which writes to hdfs. The error that I'd mentioned is being thrown
> by the application and yes we've replication factor set to 3 and following
> is status of hdfs:
>
> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66
> GB DFS Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
> : 0 Number of Under-Replicated Blocks : 0
>
>
> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> when you say application errors out .. does that mean your mapreduce job
>> is erroring? In that case apart from hdfs space you will need to look at
>> mapred tmp directory space as well.
>>
>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
>> replication factor of 3 so at max you will have datasize of 5TB with you.
>> I am also assuming you are not scheduling your program to run on entire
>> 5TB with just 10 nodes.
>>
>> i suspect your clusters mapred tmp space is getting filled in while the
>> job is running.
>>
>>
>>
>>
>>
>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>>
>>> We are running a hadoop cluster with 10 datanodes and a namenode. Each
>>> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
>>> disk having a capacity 414GB.
>>>
>>>
>>> hdfs-site.xml has following property set:
>>>
>>> <property>
>>>         <name>dfs.data.dir</name>
>>>
>>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>>         <description>Data dirs for DFS.</description>
>>> </property>
>>>
>>> Now we are facing a issue where in we find /data1 getting filled up
>>> quickly and many a times we see it's usage running at 100% with just few
>>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>>> present.
>>>
>>> We've some java applications which are writing to hdfs and many a times
>>> we are seeing foloowing errors in our application logs:
>>>
>>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>>
>>>
>>> I went through some old discussions and looks like manual rebalancing is
>>> what is required in this case and we should also have
>>> dfs.datanode.du.reserved set up.
>>>
>>> However I'd like to understand if this issue, with one disk getting
>>> filled up to 100% can result into the issue which we are seeing in our
>>> application.
>>>
>>> Also, are there any other peformance implications due to some of the
>>> disks running at 100% usage on a datanode.
>>> --
>>> Mayank Joshi
>>>
>>> Skype: mail2mayank
>>> Mb.:  +91 8690625808
>>>
>>> Blog: http://www.techynfreesouls.co.nr
>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>
>>> Today is tommorrow I was so worried about yesterday ...
>>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>



-- 
Nitin Pawar

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Nitin Pawar <ni...@gmail.com>.

>From the snapshot, you got around 3TB for writing data.

Can you check individual datanode's storage health.
As you said you got 80 servers writing parallely to hdfs, I am not sure can
that be an issue.
As suggested in past threads, you can do a rebalance of the blocks but that
will take some time to finish and will not solve your issue right away.

You can wait for others to reply. I am sure there will be far better
solutions from experts for this.


On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:

> No it's not a map-reduce job. We've a java app running on around 80
> machines which writes to hdfs. The error that I'd mentioned is being thrown
> by the application and yes we've replication factor set to 3 and following
> is status of hdfs:
>
> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66
> GB DFS Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
> : 0 Number of Under-Replicated Blocks : 0
>
>
> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> when you say application errors out .. does that mean your mapreduce job
>> is erroring? In that case apart from hdfs space you will need to look at
>> mapred tmp directory space as well.
>>
>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
>> replication factor of 3 so at max you will have datasize of 5TB with you.
>> I am also assuming you are not scheduling your program to run on entire
>> 5TB with just 10 nodes.
>>
>> i suspect your clusters mapred tmp space is getting filled in while the
>> job is running.
>>
>>
>>
>>
>>
>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>>
>>> We are running a hadoop cluster with 10 datanodes and a namenode. Each
>>> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
>>> disk having a capacity 414GB.
>>>
>>>
>>> hdfs-site.xml has following property set:
>>>
>>> <property>
>>>         <name>dfs.data.dir</name>
>>>
>>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>>         <description>Data dirs for DFS.</description>
>>> </property>
>>>
>>> Now we are facing a issue where in we find /data1 getting filled up
>>> quickly and many a times we see it's usage running at 100% with just few
>>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>>> present.
>>>
>>> We've some java applications which are writing to hdfs and many a times
>>> we are seeing foloowing errors in our application logs:
>>>
>>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>>
>>>
>>> I went through some old discussions and looks like manual rebalancing is
>>> what is required in this case and we should also have
>>> dfs.datanode.du.reserved set up.
>>>
>>> However I'd like to understand if this issue, with one disk getting
>>> filled up to 100% can result into the issue which we are seeing in our
>>> application.
>>>
>>> Also, are there any other peformance implications due to some of the
>>> disks running at 100% usage on a datanode.
>>> --
>>> Mayank Joshi
>>>
>>> Skype: mail2mayank
>>> Mb.:  +91 8690625808
>>>
>>> Blog: http://www.techynfreesouls.co.nr
>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>
>>> Today is tommorrow I was so worried about yesterday ...
>>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>



-- 
Nitin Pawar

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Nitin Pawar <ni...@gmail.com>.

>From the snapshot, you got around 3TB for writing data.

Can you check individual datanode's storage health.
As you said you got 80 servers writing parallely to hdfs, I am not sure can
that be an issue.
As suggested in past threads, you can do a rebalance of the blocks but that
will take some time to finish and will not solve your issue right away.

You can wait for others to reply. I am sure there will be far better
solutions from experts for this.


On Mon, Jun 10, 2013 at 3:18 PM, Mayank <ma...@gmail.com> wrote:

> No it's not a map-reduce job. We've a java app running on around 80
> machines which writes to hdfs. The error that I'd mentioned is being thrown
> by the application and yes we've replication factor set to 3 and following
> is status of hdfs:
>
> Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66
> GB DFS Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE> :10 Dead
> Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
> : 0  Decommissioning Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
> : 0 Number of Under-Replicated Blocks : 0
>
>
> On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:
>
>> when you say application errors out .. does that mean your mapreduce job
>> is erroring? In that case apart from hdfs space you will need to look at
>> mapred tmp directory space as well.
>>
>> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
>> replication factor of 3 so at max you will have datasize of 5TB with you.
>> I am also assuming you are not scheduling your program to run on entire
>> 5TB with just 10 nodes.
>>
>> i suspect your clusters mapred tmp space is getting filled in while the
>> job is running.
>>
>>
>>
>>
>>
>> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>>
>>> We are running a hadoop cluster with 10 datanodes and a namenode. Each
>>> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
>>> disk having a capacity 414GB.
>>>
>>>
>>> hdfs-site.xml has following property set:
>>>
>>> <property>
>>>         <name>dfs.data.dir</name>
>>>
>>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>>         <description>Data dirs for DFS.</description>
>>> </property>
>>>
>>> Now we are facing a issue where in we find /data1 getting filled up
>>> quickly and many a times we see it's usage running at 100% with just few
>>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>>> present.
>>>
>>> We've some java applications which are writing to hdfs and many a times
>>> we are seeing foloowing errors in our application logs:
>>>
>>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>>
>>>
>>> I went through some old discussions and looks like manual rebalancing is
>>> what is required in this case and we should also have
>>> dfs.datanode.du.reserved set up.
>>>
>>> However I'd like to understand if this issue, with one disk getting
>>> filled up to 100% can result into the issue which we are seeing in our
>>> application.
>>>
>>> Also, are there any other peformance implications due to some of the
>>> disks running at 100% usage on a datanode.
>>> --
>>> Mayank Joshi
>>>
>>> Skype: mail2mayank
>>> Mb.:  +91 8690625808
>>>
>>> Blog: http://www.techynfreesouls.co.nr
>>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>>
>>> Today is tommorrow I was so worried about yesterday ...
>>>
>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>
>
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>



-- 
Nitin Pawar

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Mayank <ma...@gmail.com>.

No it's not a map-reduce job. We've a java app running on around 80
machines which writes to hdfs. The error that I'd mentioned is being thrown
by the application and yes we've replication factor set to 3 and following
is status of hdfs:

Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66 GB DFS
Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE>
: 10 Dead Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
: 0 Decommissioning
Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
: 0 Number of Under-Replicated Blocks : 0


On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:

> when you say application errors out .. does that mean your mapreduce job
> is erroring? In that case apart from hdfs space you will need to look at
> mapred tmp directory space as well.
>
> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
> replication factor of 3 so at max you will have datasize of 5TB with you.
> I am also assuming you are not scheduling your program to run on entire
> 5TB with just 10 nodes.
>
> i suspect your clusters mapred tmp space is getting filled in while the
> job is running.
>
>
>
>
>
> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>
>> We are running a hadoop cluster with 10 datanodes and a namenode. Each
>> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
>> disk having a capacity 414GB.
>>
>>
>> hdfs-site.xml has following property set:
>>
>> <property>
>>         <name>dfs.data.dir</name>
>>
>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>         <description>Data dirs for DFS.</description>
>> </property>
>>
>> Now we are facing a issue where in we find /data1 getting filled up
>> quickly and many a times we see it's usage running at 100% with just few
>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>> present.
>>
>> We've some java applications which are writing to hdfs and many a times
>> we are seeing foloowing errors in our application logs:
>>
>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>
>>
>> I went through some old discussions and looks like manual rebalancing is
>> what is required in this case and we should also have
>> dfs.datanode.du.reserved set up.
>>
>> However I'd like to understand if this issue, with one disk getting
>> filled up to 100% can result into the issue which we are seeing in our
>> application.
>>
>> Also, are there any other peformance implications due to some of the
>> disks running at 100% usage on a datanode.
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>
>
>
> --
> Nitin Pawar
>



-- 
Mayank Joshi

Skype: mail2mayank
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr
PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Mayank <ma...@gmail.com>.

No it's not a map-reduce job. We've a java app running on around 80
machines which writes to hdfs. The error that I'd mentioned is being thrown
by the application and yes we've replication factor set to 3 and following
is status of hdfs:

Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66 GB DFS
Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE>
: 10 Dead Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
: 0 Decommissioning
Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
: 0 Number of Under-Replicated Blocks : 0


On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:

> when you say application errors out .. does that mean your mapreduce job
> is erroring? In that case apart from hdfs space you will need to look at
> mapred tmp directory space as well.
>
> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
> replication factor of 3 so at max you will have datasize of 5TB with you.
> I am also assuming you are not scheduling your program to run on entire
> 5TB with just 10 nodes.
>
> i suspect your clusters mapred tmp space is getting filled in while the
> job is running.
>
>
>
>
>
> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>
>> We are running a hadoop cluster with 10 datanodes and a namenode. Each
>> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
>> disk having a capacity 414GB.
>>
>>
>> hdfs-site.xml has following property set:
>>
>> <property>
>>         <name>dfs.data.dir</name>
>>
>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>         <description>Data dirs for DFS.</description>
>> </property>
>>
>> Now we are facing a issue where in we find /data1 getting filled up
>> quickly and many a times we see it's usage running at 100% with just few
>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>> present.
>>
>> We've some java applications which are writing to hdfs and many a times
>> we are seeing foloowing errors in our application logs:
>>
>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>
>>
>> I went through some old discussions and looks like manual rebalancing is
>> what is required in this case and we should also have
>> dfs.datanode.du.reserved set up.
>>
>> However I'd like to understand if this issue, with one disk getting
>> filled up to 100% can result into the issue which we are seeing in our
>> application.
>>
>> Also, are there any other peformance implications due to some of the
>> disks running at 100% usage on a datanode.
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>
>
>
> --
> Nitin Pawar
>



-- 
Mayank Joshi

Skype: mail2mayank
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr
PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Mayank <ma...@gmail.com>.

No it's not a map-reduce job. We've a java app running on around 80
machines which writes to hdfs. The error that I'd mentioned is being thrown
by the application and yes we've replication factor set to 3 and following
is status of hdfs:

Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66 GB DFS
Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE>
: 10 Dead Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
: 0 Decommissioning
Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
: 0 Number of Under-Replicated Blocks : 0


On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:

> when you say application errors out .. does that mean your mapreduce job
> is erroring? In that case apart from hdfs space you will need to look at
> mapred tmp directory space as well.
>
> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
> replication factor of 3 so at max you will have datasize of 5TB with you.
> I am also assuming you are not scheduling your program to run on entire
> 5TB with just 10 nodes.
>
> i suspect your clusters mapred tmp space is getting filled in while the
> job is running.
>
>
>
>
>
> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>
>> We are running a hadoop cluster with 10 datanodes and a namenode. Each
>> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
>> disk having a capacity 414GB.
>>
>>
>> hdfs-site.xml has following property set:
>>
>> <property>
>>         <name>dfs.data.dir</name>
>>
>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>         <description>Data dirs for DFS.</description>
>> </property>
>>
>> Now we are facing a issue where in we find /data1 getting filled up
>> quickly and many a times we see it's usage running at 100% with just few
>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>> present.
>>
>> We've some java applications which are writing to hdfs and many a times
>> we are seeing foloowing errors in our application logs:
>>
>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>
>>
>> I went through some old discussions and looks like manual rebalancing is
>> what is required in this case and we should also have
>> dfs.datanode.du.reserved set up.
>>
>> However I'd like to understand if this issue, with one disk getting
>> filled up to 100% can result into the issue which we are seeing in our
>> application.
>>
>> Also, are there any other peformance implications due to some of the
>> disks running at 100% usage on a datanode.
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>
>
>
> --
> Nitin Pawar
>



-- 
Mayank Joshi

Skype: mail2mayank
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr
PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Mayank <ma...@gmail.com>.

No it's not a map-reduce job. We've a java app running on around 80
machines which writes to hdfs. The error that I'd mentioned is being thrown
by the application and yes we've replication factor set to 3 and following
is status of hdfs:

Configured Capacity : 16.15 TB DFS Used : 11.84 TB Non DFS Used : 872.66 GB DFS
Remaining : 3.46 TB DFS Used% : 73.3 % DFS Remaining% : 21.42 % Live
Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=LIVE>
: 10 Dead Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DEAD>
: 0 Decommissioning
Nodes<http://hmaster.production.indix.tv:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING>
: 0 Number of Under-Replicated Blocks : 0


On Mon, Jun 10, 2013 at 3:11 PM, Nitin Pawar <ni...@gmail.com>wrote:

> when you say application errors out .. does that mean your mapreduce job
> is erroring? In that case apart from hdfs space you will need to look at
> mapred tmp directory space as well.
>
> you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
> replication factor of 3 so at max you will have datasize of 5TB with you.
> I am also assuming you are not scheduling your program to run on entire
> 5TB with just 10 nodes.
>
> i suspect your clusters mapred tmp space is getting filled in while the
> job is running.
>
>
>
>
>
> On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:
>
>> We are running a hadoop cluster with 10 datanodes and a namenode. Each
>> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
>> disk having a capacity 414GB.
>>
>>
>> hdfs-site.xml has following property set:
>>
>> <property>
>>         <name>dfs.data.dir</name>
>>
>> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>>         <description>Data dirs for DFS.</description>
>> </property>
>>
>> Now we are facing a issue where in we find /data1 getting filled up
>> quickly and many a times we see it's usage running at 100% with just few
>> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
>> present.
>>
>> We've some java applications which are writing to hdfs and many a times
>> we are seeing foloowing errors in our application logs:
>>
>> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
>> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>>
>>
>> I went through some old discussions and looks like manual rebalancing is
>> what is required in this case and we should also have
>> dfs.datanode.du.reserved set up.
>>
>> However I'd like to understand if this issue, with one disk getting
>> filled up to 100% can result into the issue which we are seeing in our
>> application.
>>
>> Also, are there any other peformance implications due to some of the
>> disks running at 100% usage on a datanode.
>> --
>> Mayank Joshi
>>
>> Skype: mail2mayank
>> Mb.:  +91 8690625808
>>
>> Blog: http://www.techynfreesouls.co.nr
>> PhotoStream: http://picasaweb.google.com/mail2mayank
>>
>> Today is tommorrow I was so worried about yesterday ...
>>
>
>
>
> --
> Nitin Pawar
>



-- 
Mayank Joshi

Skype: mail2mayank
Mb.:  +91 8690625808

Blog: http://www.techynfreesouls.co.nr
PhotoStream: http://picasaweb.google.com/mail2mayank

Today is tommorrow I was so worried about yesterday ...

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Nitin Pawar <ni...@gmail.com>.

when you say application errors out .. does that mean your mapreduce job is
erroring? In that case apart from hdfs space you will need to look at
mapred tmp directory space as well.

you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
replication factor of 3 so at max you will have datasize of 5TB with you.
I am also assuming you are not scheduling your program to run on entire 5TB
with just 10 nodes.

i suspect your clusters mapred tmp space is getting filled in while the job
is running.





On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:

> We are running a hadoop cluster with 10 datanodes and a namenode. Each
> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
> disk having a capacity 414GB.
>
>
> hdfs-site.xml has following property set:
>
> <property>
>         <name>dfs.data.dir</name>
>
> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>         <description>Data dirs for DFS.</description>
> </property>
>
> Now we are facing a issue where in we find /data1 getting filled up
> quickly and many a times we see it's usage running at 100% with just few
> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
> present.
>
> We've some java applications which are writing to hdfs and many a times we
> are seeing foloowing errors in our application logs:
>
> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>
>
> I went through some old discussions and looks like manual rebalancing is
> what is required in this case and we should also have
> dfs.datanode.du.reserved set up.
>
> However I'd like to understand if this issue, with one disk getting filled
> up to 100% can result into the issue which we are seeing in our
> application.
>
> Also, are there any other peformance implications due to some of the disks
> running at 100% usage on a datanode.
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>



-- 
Nitin Pawar

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Nitin Pawar <ni...@gmail.com>.

when you say application errors out .. does that mean your mapreduce job is
erroring? In that case apart from hdfs space you will need to look at
mapred tmp directory space as well.

you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
replication factor of 3 so at max you will have datasize of 5TB with you.
I am also assuming you are not scheduling your program to run on entire 5TB
with just 10 nodes.

i suspect your clusters mapred tmp space is getting filled in while the job
is running.





On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:

> We are running a hadoop cluster with 10 datanodes and a namenode. Each
> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
> disk having a capacity 414GB.
>
>
> hdfs-site.xml has following property set:
>
> <property>
>         <name>dfs.data.dir</name>
>
> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>         <description>Data dirs for DFS.</description>
> </property>
>
> Now we are facing a issue where in we find /data1 getting filled up
> quickly and many a times we see it's usage running at 100% with just few
> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
> present.
>
> We've some java applications which are writing to hdfs and many a times we
> are seeing foloowing errors in our application logs:
>
> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>
>
> I went through some old discussions and looks like manual rebalancing is
> what is required in this case and we should also have
> dfs.datanode.du.reserved set up.
>
> However I'd like to understand if this issue, with one disk getting filled
> up to 100% can result into the issue which we are seeing in our
> application.
>
> Also, are there any other peformance implications due to some of the disks
> running at 100% usage on a datanode.
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>



-- 
Nitin Pawar

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Nitin Pawar <ni...@gmail.com>.

when you say application errors out .. does that mean your mapreduce job is
erroring? In that case apart from hdfs space you will need to look at
mapred tmp directory space as well.

you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
replication factor of 3 so at max you will have datasize of 5TB with you.
I am also assuming you are not scheduling your program to run on entire 5TB
with just 10 nodes.

i suspect your clusters mapred tmp space is getting filled in while the job
is running.





On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:

> We are running a hadoop cluster with 10 datanodes and a namenode. Each
> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
> disk having a capacity 414GB.
>
>
> hdfs-site.xml has following property set:
>
> <property>
>         <name>dfs.data.dir</name>
>
> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>         <description>Data dirs for DFS.</description>
> </property>
>
> Now we are facing a issue where in we find /data1 getting filled up
> quickly and many a times we see it's usage running at 100% with just few
> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
> present.
>
> We've some java applications which are writing to hdfs and many a times we
> are seeing foloowing errors in our application logs:
>
> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>
>
> I went through some old discussions and looks like manual rebalancing is
> what is required in this case and we should also have
> dfs.datanode.du.reserved set up.
>
> However I'd like to understand if this issue, with one disk getting filled
> up to 100% can result into the issue which we are seeing in our
> application.
>
> Also, are there any other peformance implications due to some of the disks
> running at 100% usage on a datanode.
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>



-- 
Nitin Pawar

Re: Application errors with one disk on datanode getting filled up to 100%

Posted by Nitin Pawar <ni...@gmail.com>.

when you say application errors out .. does that mean your mapreduce job is
erroring? In that case apart from hdfs space you will need to look at
mapred tmp directory space as well.

you got 400GB * 4 * 10 = 16TB of disk and lets assume that you have a
replication factor of 3 so at max you will have datasize of 5TB with you.
I am also assuming you are not scheduling your program to run on entire 5TB
with just 10 nodes.

i suspect your clusters mapred tmp space is getting filled in while the job
is running.





On Mon, Jun 10, 2013 at 3:06 PM, Mayank <ma...@gmail.com> wrote:

> We are running a hadoop cluster with 10 datanodes and a namenode. Each
> datanode is setup with 4 disks (/data1, /data2, /data3, /data4), which each
> disk having a capacity 414GB.
>
>
> hdfs-site.xml has following property set:
>
> <property>
>         <name>dfs.data.dir</name>
>
> <value>/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs,/data4/hadoopfs</value>
>         <description>Data dirs for DFS.</description>
> </property>
>
> Now we are facing a issue where in we find /data1 getting filled up
> quickly and many a times we see it's usage running at 100% with just few
> megabytes of free space. This issue is visible on 7 out of 10 datanodes at
> present.
>
> We've some java applications which are writing to hdfs and many a times we
> are seeing foloowing errors in our application logs:
>
> java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010 are bad. Aborting...
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3093)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2200(DFSClient.java:2586)
> 	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2790)
>
>
> I went through some old discussions and looks like manual rebalancing is
> what is required in this case and we should also have
> dfs.datanode.du.reserved set up.
>
> However I'd like to understand if this issue, with one disk getting filled
> up to 100% can result into the issue which we are seeing in our
> application.
>
> Also, are there any other peformance implications due to some of the disks
> running at 100% usage on a datanode.
> --
> Mayank Joshi
>
> Skype: mail2mayank
> Mb.:  +91 8690625808
>
> Blog: http://www.techynfreesouls.co.nr
> PhotoStream: http://picasaweb.google.com/mail2mayank
>
> Today is tommorrow I was so worried about yesterday ...
>



-- 
Nitin Pawar