You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Abhijit Bagri <ab...@yahoo-inc.com> on 2008/11/15 15:12:59 UTC

The Case of a Long Running Hadoop System

Hi,

This is a long mail as I have tried to put in as much details as might  
help any of the Hadoop dev/users to help us out. The gist is this:

We have a long running Hadoop system (masters not restarted for about  
3 months). We have recently started seeing the DFS responding very  
slowly which has resulted in failures on a system which depends on  
Hadoop. Further, the DFS seems to be an unstable state (i.e if fsck is  
a good representation which I believe it is). The edits file

These are the details (skip/return here later and jump to the  
questions at the end of the mail for a quicker read) :

Hadoop Version: 0.15.3 on 32 bit systems.

Number of slaves: 12
Slaves heap size: 1G
Namenode heap: 2G
Jobtracker heap: 2G

The namenode and jobtrackers have not been restarted for about 3  
months. We did restart slaves(all of them within a few hours) a few  
times for some maintaineance in between though. We do not have a  
secondary namenode in place.

There is another system X which talks to this hadoop cluster. X writes  
to the Hadoop DFS and submits jobs to the Jobtracker. The number of  
jobs submitted to Hadoop so far is over 650,000 ( I am using the job  
id for jobs for this), each job may rad/write to multiple files and  
has several dependent libraries which it loads from Distributed Cache.

Recently, we started seeing that there were several timeouts happening  
while X tries to read/write to the DFS. This in turn results in DFS  
becoming very slow in response. The writes are especially slow. The  
trace we get in the logs are:

java.net.SocketTimeoutException: Read timed out
         at java.net.SocketInputStream.socketRead0(Native Method)
         at java.net.SocketInputStream.read(SocketInputStream.java:129)
         at java.net.SocketInputStream.read(SocketInputStream.java:182)
         at java.io.DataInputStream.readShort(DataInputStream.java:284)
         at org.apache.hadoop.dfs.DFSClient 
$DFSOutputStream.endBlock(DFSClient.java:1660)
         at org.apache.hadoop.dfs.DFSClient 
$DFSOutputStream.close(DFSClient.java:1733)
         at org.apache.hadoop.fs.FSDataOutputStream 
$PositionCache.close(FSDataOutputStream.java:49)
         at  
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java: 
64)
         ...

Also, datanode logs show a lot of traces like these:

2008-11-14 21:21:49,429 ERROR org.apache.hadoop.dfs.DataNode:  
DataXceiver: java.io.IOException: Block blk_-1310124865741110666 is  
valid, and cannot be written to.
         at  
org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:551)
         at org.apache.hadoop.dfs.DataNode 
$BlockReceiver.<init>(DataNode.java:1257)
         at org.apache.hadoop.dfs.DataNode 
$DataXceiver.writeBlock(DataNode.java:901)
         at org.apache.hadoop.dfs.DataNode 
$DataXceiver.run(DataNode.java:804)
         at java.lang.Thread.run(Thread.java:595)

and these

2008-11-14 21:21:50,695 WARN org.apache.hadoop.dfs.DataNode:  
java.io.IOException: Error in deleting blocks.
         at org.apache.hadoop.dfs.FSDataset.invalidate(FSDataset.java: 
719)
         at org.apache.hadoop.

The first one seems to be same as HADOOP-1845. We have been seeing  
this for a long time, but as HADOOP-1845 says, it hasn't been creating  
any problems. We have thus ignored this for a while, but this recent  
problems have raised eyebrows again if this really is a non-issue.

We also saw a spike in number of TCP connections on the system X.  
Also, a lsof on the system X shows a lot of connections to datanodes  
in the CLOSE_WAIT state.

While investigating the issue, the fsck output started saying that DFS  
is corrupt. The output is:

Total size:    152873519485 B
Total blocks:  1963655 (avg. block size 77851 B)
Total dirs:    1390820
Total files:   1962680
Over-replicated blocks:        111 (0.0061619785 %)
Under-replicated blocks:       13 (6.6203077E-4 %)
Target replication factor:     3
Real replication factor:       3.0000503


The filesystem under path '/' is CORRUPT

After running a fsck -move, the system went back to healthy state.  
However, after some time it again started showing corrupt and fsk - 
move restored it to a healthy state.

We are considering restarting the masters as a possible solution. I  
have earlier had issue with restarting Master with 2G heap and edits  
file size of about 2.5G. So, we decided to lookup the size of the  
edits file. The edits file is now 8.5G! Since, we are on a 32 bit  
system, we cannot push JVM heap beyond about 2.5G. So, I am not sure  
that if we bring down the master, we will be able to restart it at  
all. Please note that we do not have a secondary namenode.


Questions for hadoop-users and hadoop-dev?

About Hadoop restart:

1. Does namenode try to load edits file into memory on starrtup?
2. Will we be able to restart the namenode without a format at all?
3. Apart from taking backup, is there a safe way to restart the system  
and retain the DFS data?

About DFS being slow:

4. Is HADOOP-1845 a non-issue? If it is, what may cause the DFS to  
become slow and lead to SocketTimeoutException on client.
5. What may cause a spike in TCP connections and several connection  
fds open in CLOSE_WAIT state? Is HADOOP-3071 relevant?
6. Is restart of master a solution at all?

About the health of the DFS:

7. What are the implications of DFS going into CORRUPT state every  
once in a while? Should this increase my caffeine intake? More  
importantly, is the Hadoop system on verge of a collapse?
8. Would restarting all slaves within a few hours without restarting  
the masters have an implication on DFS health? Note that, this was  
done to prevent any downtime for system X.

Our target is to return to a state where the DFS functions normally  
(like it was doing till some time back) and of course to maintain a  
healthy DFS with its current data. Please provide any insights

To add to the above,

- Due to the kind of dependency of X on this Hadoop system, api, DFS  
data, and internal data structure of Hadoop,  it is non trivial for us  
to jump to higher versions of hadoop in very near future. We could  
apply some patches which could go with 0.15.3
- Loss of DFS data will be fatal. I believe once I format, I will get  
a namespace id mismatch unless I clear the hadoop data directories  
from the slaves. is this correct?
- We did not start secondary namenode due to HADOOP-3069. I have a  
locally patched 0.15.3 with this JIRA, Will it be help and is it safe  
if we add the secondary namenode now?

We have tried to look into jira for more leads, but are not sure which  
ones fit our case specifically. Pointers to JIRA issues are also  
welcome. Of course, please feel free to address any part of the  
problem :)

Thanks
Bagri

Re: Practical limits on number of blocks per datanode.

Posted by Johan Oskarsson <jo...@oskarsson.nu>.

Hi Rick,

unfortunately 4,800,000 blocks per node is going to be too much. Ideally
you'd want to merge your files into as few as possible, even 1MB per
file is quite small for Hadoop. Would it be possible to merge them into
hundreds of mbs or preferably gigabyte files?

In newer Hadoop versions there is an archive feature that can put many
files into an archive for you. This can then be processed transparently
by Hadoop. I haven't used that though so can't tell if it's worth the
effort.

I ran into issues with too many blocks per datanode before and it's not
fun, they start losing contact with the namenode with all kinds of
interesting side effects.

/Johan

Rick Hangartner wrote:
> 
> Hi,
> 
> We are in the midst of considering Hadoop as a prototype solution for a
> system we are building.  In the abstract Hadoop and MapReduce are very
> well-suited to our computational problem.  However, this email exchange
> has caused us some concern that we are hoping the user community might
> allay.  We've searched JIRA for relevant issues but didn't turn up
> anything. (We probably aren't as adept as we might be at surfacing
> appropriate items though.)
> 
> Here are the relevant numbers for the data we are using to prototype a
> system using Hadoop 0.18.1:
> 
> We have 16,000,000 files that are 10K each, or about 160GB total.  We
> have 10 datanodes with the default replication factor of 3.   Each file
> will probably be stored as a single block, right?  This means we would
> be storing 48,000,000 blocks on 10 datanodes or 4,800,000 blocks per node.
> 
> At 160GB, the total data is not particularly large.  Unfortunately, the
> attached email exchange suggests  we could have a problem with the large
> number of blocks per node.  We have considered combining a number of
> small files into larger files (say concatenating  sets of 100 files into
> single larger files so we have 48,000 blocks that are 1MB in size per
> node.)  This would not significantly effect our MapReduce algorithm, but
> it could undesirably complicate other components of the system that use
> this data.
> 
> Thanks in advance for any insights on the match between Hadoop (0.18.x
> and later) and our particular system requirements.
> 
> RDH
> 
> Begin forwarded message:
> 
>> From: Konstantin Shvachko <sh...@yahoo-inc.com>
>> Date: November 17, 2008 6:27:42 PM PST
>> To: core-user@hadoop.apache.org
>> Subject: Re: The Case of a Long Running Hadoop System
>> Reply-To: core-user@hadoop.apache.org
>>
>> Bagri,
>>
>> According to the numbers you posted your cluster has 6,000,000 block
>> replicas
>> and only 12 data-nodes. The blocks are small on average about 78KB
>> according
>> to fsck. So each node contains about 40GB worth of block data.
>> But the number of blocks is really huge 500,000 per node. Is my math
>> correct?
>> I haven't seen data-nodes that big yet.
>> The problem here is that a data-node keeps a map of all its blocks in
>> memory.
>> The map is a HashMap. With 500,000 entries you can get long lookup
>> times I guess.
>> And also block reports can take long time.
>>
>> So I believe restarting name-node will not help you.
>> You should somehow pack your small files into larger ones.
>> Alternatively, you can increase your cluster size, probably 5 to 10
>> times larger.
>> I don't remember whether we had any optimization patches related to
>> data-nodes
>> block map since 0.15. Please advise if anybody remembers.
>>
>> Thanks,
>> --Konstantin
>>
>>
>> Abhijit Bagri wrote:
>>> We do not have a secondary namenode because 0.15.3 has serious bug
>>> which truncates the namenode image if there is a failure while
>>> namenode fetches image from secondary namenode. See HADOOP-3069
>>> I have a patched version of 0.15.3 for this issue. From the patch of
>>> HADOOP-3069, the changes are on namenode _and_ secondary namenode,
>>> which means I just cant fire up a seconday namenode.
>>> - Bagri
>>> On Nov 15, 2008, at 11:36 PM, Billy Pearson wrote:
>>>> If I understand the secondary namenode merges the edits log in to
>>>> the fsimage and reduces the edit log size.
>>>> Which is likely the root of your problems 8.5G seams large and
>>>> likely putting a strain on your master servers memory and io bandwidth
>>>> Why do you not have a secondary namenode?
>>>>
>>>> If you do not have the memory on the master I would look in to
>>>> stopping a datanode/tasktracker on a server and loading the
>>>> secondary namenode on it
>>>>
>>>> Let it run for a while and watch your log for the secondary namenode
>>>> you should see your edit log get smaller
>>>>
>>>> I am not an expert but that would be my first action.
>>>>
>>>> Billy
>>>>
>>>>
>

Practical limits on number of blocks per datanode.

Posted by Rick Hangartner <ha...@strands.com>.

Hi,

We are in the midst of considering Hadoop as a prototype solution for  
a system we are building.  In the abstract Hadoop and MapReduce are  
very well-suited to our computational problem.  However, this email  
exchange has caused us some concern that we are hoping the user  
community might allay.  We've searched JIRA for relevant issues but  
didn't turn up anything. (We probably aren't as adept as we might be  
at surfacing appropriate items though.)

Here are the relevant numbers for the data we are using to prototype a  
system using Hadoop 0.18.1:

We have 16,000,000 files that are 10K each, or about 160GB total.  We  
have 10 datanodes with the default replication factor of 3.   Each  
file will probably be stored as a single block, right?  This means we  
would be storing 48,000,000 blocks on 10 datanodes or 4,800,000 blocks  
per node.

At 160GB, the total data is not particularly large.  Unfortunately,  
the attached email exchange suggests  we could have a problem with the  
large number of blocks per node.  We have considered combining a  
number of small files into larger files (say concatenating  sets of  
100 files into single larger files so we have 48,000 blocks that are  
1MB in size per node.)  This would not significantly effect our  
MapReduce algorithm, but it could undesirably complicate other  
components of the system that use this data.

Thanks in advance for any insights on the match between Hadoop (0.18.x  
and later) and our particular system requirements.

RDH

Begin forwarded message:

> From: Konstantin Shvachko <sh...@yahoo-inc.com>
> Date: November 17, 2008 6:27:42 PM PST
> To: core-user@hadoop.apache.org
> Subject: Re: The Case of a Long Running Hadoop System
> Reply-To: core-user@hadoop.apache.org
>
> Bagri,
>
> According to the numbers you posted your cluster has 6,000,000 block  
> replicas
> and only 12 data-nodes. The blocks are small on average about 78KB  
> according
> to fsck. So each node contains about 40GB worth of block data.
> But the number of blocks is really huge 500,000 per node. Is my math  
> correct?
> I haven't seen data-nodes that big yet.
> The problem here is that a data-node keeps a map of all its blocks  
> in memory.
> The map is a HashMap. With 500,000 entries you can get long lookup  
> times I guess.
> And also block reports can take long time.
>
> So I believe restarting name-node will not help you.
> You should somehow pack your small files into larger ones.
> Alternatively, you can increase your cluster size, probably 5 to 10  
> times larger.
> I don't remember whether we had any optimization patches related to  
> data-nodes
> block map since 0.15. Please advise if anybody remembers.
>
> Thanks,
> --Konstantin
>
>
> Abhijit Bagri wrote:
>> We do not have a secondary namenode because 0.15.3 has serious bug  
>> which truncates the namenode image if there is a failure while  
>> namenode fetches image from secondary namenode. See HADOOP-3069
>> I have a patched version of 0.15.3 for this issue. From the patch  
>> of HADOOP-3069, the changes are on namenode _and_ secondary  
>> namenode, which means I just cant fire up a seconday namenode.
>> - Bagri
>> On Nov 15, 2008, at 11:36 PM, Billy Pearson wrote:
>>> If I understand the secondary namenode merges the edits log in to  
>>> the fsimage and reduces the edit log size.
>>> Which is likely the root of your problems 8.5G seams large and  
>>> likely putting a strain on your master servers memory and io  
>>> bandwidth
>>> Why do you not have a secondary namenode?
>>>
>>> If you do not have the memory on the master I would look in to  
>>> stopping a datanode/tasktracker on a server and loading the  
>>> secondary namenode on it
>>>
>>> Let it run for a while and watch your log for the secondary  
>>> namenode you should see your edit log get smaller
>>>
>>> I am not an expert but that would be my first action.
>>>
>>> Billy
>>>
>>>

Re: The Case of a Long Running Hadoop System

Posted by Konstantin Shvachko <sh...@yahoo-inc.com>.

Bagri,

According to the numbers you posted your cluster has 6,000,000 block replicas
and only 12 data-nodes. The blocks are small on average about 78KB according
to fsck. So each node contains about 40GB worth of block data.
But the number of blocks is really huge 500,000 per node. Is my math correct?
I haven't seen data-nodes that big yet.
The problem here is that a data-node keeps a map of all its blocks in memory.
The map is a HashMap. With 500,000 entries you can get long lookup times I guess.
And also block reports can take long time.

So I believe restarting name-node will not help you.
You should somehow pack your small files into larger ones.
Alternatively, you can increase your cluster size, probably 5 to 10 times larger.
I don't remember whether we had any optimization patches related to data-nodes
block map since 0.15. Please advise if anybody remembers.

Thanks,
--Konstantin


Abhijit Bagri wrote:
> We do not have a secondary namenode because 0.15.3 has serious bug which 
> truncates the namenode image if there is a failure while namenode 
> fetches image from secondary namenode. See HADOOP-3069
> 
> I have a patched version of 0.15.3 for this issue. From the patch of 
> HADOOP-3069, the changes are on namenode _and_ secondary namenode, which 
> means I just cant fire up a seconday namenode.
> 
> - Bagri
> 
> 
> On Nov 15, 2008, at 11:36 PM, Billy Pearson wrote:
> 
>> If I understand the secondary namenode merges the edits log in to the 
>> fsimage and reduces the edit log size.
>> Which is likely the root of your problems 8.5G seams large and likely 
>> putting a strain on your master servers memory and io bandwidth
>> Why do you not have a secondary namenode?
>>
>> If you do not have the memory on the master I would look in to 
>> stopping a datanode/tasktracker on a server and loading the secondary 
>> namenode on it
>>
>> Let it run for a while and watch your log for the secondary namenode 
>> you should see your edit log get smaller
>>
>> I am not an expert but that would be my first action.
>>
>> Billy
>>
>>
>>
>> "Abhijit Bagri" <ab...@yahoo-inc.com> wrote in message 
>> news:4C1984EA-74B3-40D1-B9BE-AB2442537217@yahoo-inc.com...
>>> Hi,
>>>
>>> This is a long mail as I have tried to put in as much details as might
>>> help any of the Hadoop dev/users to help us out. The gist is this:
>>>
>>> We have a long running Hadoop system (masters not restarted for about
>>> 3 months). We have recently started seeing the DFS responding very
>>> slowly which has resulted in failures on a system which depends on
>>> Hadoop. Further, the DFS seems to be an unstable state (i.e if fsck is
>>> a good representation which I believe it is). The edits file
>>>
>>> These are the details (skip/return here later and jump to the
>>> questions at the end of the mail for a quicker read) :
>>>
>>> Hadoop Version: 0.15.3 on 32 bit systems.
>>>
>>> Number of slaves: 12
>>> Slaves heap size: 1G
>>> Namenode heap: 2G
>>> Jobtracker heap: 2G
>>>
>>> The namenode and jobtrackers have not been restarted for about 3
>>> months. We did restart slaves(all of them within a few hours) a few
>>> times for some maintaineance in between though. We do not have a
>>> secondary namenode in place.
>>>
>>> There is another system X which talks to this hadoop cluster. X writes
>>> to the Hadoop DFS and submits jobs to the Jobtracker. The number of
>>> jobs submitted to Hadoop so far is over 650,000 ( I am using the job
>>> id for jobs for this), each job may rad/write to multiple files and
>>> has several dependent libraries which it loads from Distributed Cache.
>>>
>>> Recently, we started seeing that there were several timeouts happening
>>> while X tries to read/write to the DFS. This in turn results in DFS
>>> becoming very slow in response. The writes are especially slow. The
>>> trace we get in the logs are:
>>>
>>> java.net.SocketTimeoutException: Read timed out
>>>        at java.net.SocketInputStream.socketRead0(Native Method)
>>>        at java.net.SocketInputStream.read(SocketInputStream.java:129)
>>>        at java.net.SocketInputStream.read(SocketInputStream.java:182)
>>>        at java.io.DataInputStream.readShort(DataInputStream.java:284)
>>>        at org.apache.hadoop.dfs.DFSClient
>>> $DFSOutputStream.endBlock(DFSClient.java:1660)
>>>        at org.apache.hadoop.dfs.DFSClient
>>> $DFSOutputStream.close(DFSClient.java:1733)
>>>        at org.apache.hadoop.fs.FSDataOutputStream
>>> $PositionCache.close(FSDataOutputStream.java:49)
>>>        at
>>> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:
>>> 64)
>>>        ...
>>>
>>> Also, datanode logs show a lot of traces like these:
>>>
>>> 2008-11-14 21:21:49,429 ERROR org.apache.hadoop.dfs.DataNode:
>>> DataXceiver: java.io.IOException: Block blk_-1310124865741110666 is
>>> valid, and cannot be written to.
>>>        at
>>> org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:551)
>>>        at org.apache.hadoop.dfs.DataNode
>>> $BlockReceiver.<init>(DataNode.java:1257)
>>>        at org.apache.hadoop.dfs.DataNode
>>> $DataXceiver.writeBlock(DataNode.java:901)
>>>        at org.apache.hadoop.dfs.DataNode
>>> $DataXceiver.run(DataNode.java:804)
>>>        at java.lang.Thread.run(Thread.java:595)
>>>
>>> and these
>>>
>>> 2008-11-14 21:21:50,695 WARN org.apache.hadoop.dfs.DataNode:
>>> java.io.IOException: Error in deleting blocks.
>>>        at org.apache.hadoop.dfs.FSDataset.invalidate(FSDataset.java:
>>> 719)
>>>        at org.apache.hadoop.
>>>
>>> The first one seems to be same as HADOOP-1845. We have been seeing
>>> this for a long time, but as HADOOP-1845 says, it hasn't been creating
>>> any problems. We have thus ignored this for a while, but this recent
>>> problems have raised eyebrows again if this really is a non-issue.
>>>
>>> We also saw a spike in number of TCP connections on the system X.
>>> Also, a lsof on the system X shows a lot of connections to datanodes
>>> in the CLOSE_WAIT state.
>>>
>>> While investigating the issue, the fsck output started saying that DFS
>>> is corrupt. The output is:
>>>
>>> Total size:    152873519485 B
>>> Total blocks:  1963655 (avg. block size 77851 B)
>>> Total dirs:    1390820
>>> Total files:   1962680
>>> Over-replicated blocks:        111 (0.0061619785 %)
>>> Under-replicated blocks:       13 (6.6203077E-4 %)
>>> Target replication factor:     3
>>> Real replication factor:       3.0000503
>>>
>>>
>>> The filesystem under path '/' is CORRUPT
>>>
>>> After running a fsck -move, the system went back to healthy state.
>>> However, after some time it again started showing corrupt and fsk -
>>> move restored it to a healthy state.
>>>
>>> We are considering restarting the masters as a possible solution. I
>>> have earlier had issue with restarting Master with 2G heap and edits
>>> file size of about 2.5G. So, we decided to lookup the size of the
>>> edits file. The edits file is now 8.5G! Since, we are on a 32 bit
>>> system, we cannot push JVM heap beyond about 2.5G. So, I am not sure
>>> that if we bring down the master, we will be able to restart it at
>>> all. Please note that we do not have a secondary namenode.
>>>
>>>
>>> Questions for hadoop-users and hadoop-dev?
>>>
>>> About Hadoop restart:
>>>
>>> 1. Does namenode try to load edits file into memory on starrtup?
>>> 2. Will we be able to restart the namenode without a format at all?
>>> 3. Apart from taking backup, is there a safe way to restart the system
>>> and retain the DFS data?
>>>
>>> About DFS being slow:
>>>
>>> 4. Is HADOOP-1845 a non-issue? If it is, what may cause the DFS to
>>> become slow and lead to SocketTimeoutException on client.
>>> 5. What may cause a spike in TCP connections and several connection
>>> fds open in CLOSE_WAIT state? Is HADOOP-3071 relevant?
>>> 6. Is restart of master a solution at all?
>>>
>>> About the health of the DFS:
>>>
>>> 7. What are the implications of DFS going into CORRUPT state every
>>> once in a while? Should this increase my caffeine intake? More
>>> importantly, is the Hadoop system on verge of a collapse?
>>> 8. Would restarting all slaves within a few hours without restarting
>>> the masters have an implication on DFS health? Note that, this was
>>> done to prevent any downtime for system X.
>>>
>>> Our target is to return to a state where the DFS functions normally
>>> (like it was doing till some time back) and of course to maintain a
>>> healthy DFS with its current data. Please provide any insights
>>>
>>> To add to the above,
>>>
>>> - Due to the kind of dependency of X on this Hadoop system, api, DFS
>>> data, and internal data structure of Hadoop,  it is non trivial for us
>>> to jump to higher versions of hadoop in very near future. We could
>>> apply some patches which could go with 0.15.3
>>> - Loss of DFS data will be fatal. I believe once I format, I will get
>>> a namespace id mismatch unless I clear the hadoop data directories
>>> from the slaves. is this correct?
>>> - We did not start secondary namenode due to HADOOP-3069. I have a
>>> locally patched 0.15.3 with this JIRA, Will it be help and is it safe
>>> if we add the secondary namenode now?
>>>
>>> We have tried to look into jira for more leads, but are not sure which
>>> ones fit our case specifically. Pointers to JIRA issues are also
>>> welcome. Of course, please feel free to address any part of the
>>> problem :)
>>>
>>> Thanks
>>> Bagri
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
> 
>

Re: The Case of a Long Running Hadoop System

Posted by Abhijit Bagri <ab...@yahoo-inc.com>.

We do not have a secondary namenode because 0.15.3 has serious bug  
which truncates the namenode image if there is a failure while  
namenode fetches image from secondary namenode. See HADOOP-3069

I have a patched version of 0.15.3 for this issue. From the patch of  
HADOOP-3069, the changes are on namenode _and_ secondary namenode,  
which means I just cant fire up a seconday namenode.

- Bagri


On Nov 15, 2008, at 11:36 PM, Billy Pearson wrote:

> If I understand the secondary namenode merges the edits log in to  
> the fsimage and reduces the edit log size.
> Which is likely the root of your problems 8.5G seams large and  
> likely putting a strain on your master servers memory and io bandwidth
> Why do you not have a secondary namenode?
>
> If you do not have the memory on the master I would look in to  
> stopping a datanode/tasktracker on a server and loading the  
> secondary namenode on it
>
> Let it run for a while and watch your log for the secondary namenode  
> you should see your edit log get smaller
>
> I am not an expert but that would be my first action.
>
> Billy
>
>
>
> "Abhijit Bagri" <ab...@yahoo-inc.com> wrote in message news:4C1984EA-74B3-40D1-B9BE-AB2442537217@yahoo-inc.com 
> ...
>> Hi,
>>
>> This is a long mail as I have tried to put in as much details as  
>> might
>> help any of the Hadoop dev/users to help us out. The gist is this:
>>
>> We have a long running Hadoop system (masters not restarted for about
>> 3 months). We have recently started seeing the DFS responding very
>> slowly which has resulted in failures on a system which depends on
>> Hadoop. Further, the DFS seems to be an unstable state (i.e if fsck  
>> is
>> a good representation which I believe it is). The edits file
>>
>> These are the details (skip/return here later and jump to the
>> questions at the end of the mail for a quicker read) :
>>
>> Hadoop Version: 0.15.3 on 32 bit systems.
>>
>> Number of slaves: 12
>> Slaves heap size: 1G
>> Namenode heap: 2G
>> Jobtracker heap: 2G
>>
>> The namenode and jobtrackers have not been restarted for about 3
>> months. We did restart slaves(all of them within a few hours) a few
>> times for some maintaineance in between though. We do not have a
>> secondary namenode in place.
>>
>> There is another system X which talks to this hadoop cluster. X  
>> writes
>> to the Hadoop DFS and submits jobs to the Jobtracker. The number of
>> jobs submitted to Hadoop so far is over 650,000 ( I am using the job
>> id for jobs for this), each job may rad/write to multiple files and
>> has several dependent libraries which it loads from Distributed  
>> Cache.
>>
>> Recently, we started seeing that there were several timeouts  
>> happening
>> while X tries to read/write to the DFS. This in turn results in DFS
>> becoming very slow in response. The writes are especially slow. The
>> trace we get in the logs are:
>>
>> java.net.SocketTimeoutException: Read timed out
>>        at java.net.SocketInputStream.socketRead0(Native Method)
>>        at java.net.SocketInputStream.read(SocketInputStream.java:129)
>>        at java.net.SocketInputStream.read(SocketInputStream.java:182)
>>        at java.io.DataInputStream.readShort(DataInputStream.java:284)
>>        at org.apache.hadoop.dfs.DFSClient
>> $DFSOutputStream.endBlock(DFSClient.java:1660)
>>        at org.apache.hadoop.dfs.DFSClient
>> $DFSOutputStream.close(DFSClient.java:1733)
>>        at org.apache.hadoop.fs.FSDataOutputStream
>> $PositionCache.close(FSDataOutputStream.java:49)
>>        at
>> org 
>> .apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:
>> 64)
>>        ...
>>
>> Also, datanode logs show a lot of traces like these:
>>
>> 2008-11-14 21:21:49,429 ERROR org.apache.hadoop.dfs.DataNode:
>> DataXceiver: java.io.IOException: Block blk_-1310124865741110666 is
>> valid, and cannot be written to.
>>        at
>> org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:551)
>>        at org.apache.hadoop.dfs.DataNode
>> $BlockReceiver.<init>(DataNode.java:1257)
>>        at org.apache.hadoop.dfs.DataNode
>> $DataXceiver.writeBlock(DataNode.java:901)
>>        at org.apache.hadoop.dfs.DataNode
>> $DataXceiver.run(DataNode.java:804)
>>        at java.lang.Thread.run(Thread.java:595)
>>
>> and these
>>
>> 2008-11-14 21:21:50,695 WARN org.apache.hadoop.dfs.DataNode:
>> java.io.IOException: Error in deleting blocks.
>>        at org.apache.hadoop.dfs.FSDataset.invalidate(FSDataset.java:
>> 719)
>>        at org.apache.hadoop.
>>
>> The first one seems to be same as HADOOP-1845. We have been seeing
>> this for a long time, but as HADOOP-1845 says, it hasn't been  
>> creating
>> any problems. We have thus ignored this for a while, but this recent
>> problems have raised eyebrows again if this really is a non-issue.
>>
>> We also saw a spike in number of TCP connections on the system X.
>> Also, a lsof on the system X shows a lot of connections to datanodes
>> in the CLOSE_WAIT state.
>>
>> While investigating the issue, the fsck output started saying that  
>> DFS
>> is corrupt. The output is:
>>
>> Total size:    152873519485 B
>> Total blocks:  1963655 (avg. block size 77851 B)
>> Total dirs:    1390820
>> Total files:   1962680
>> Over-replicated blocks:        111 (0.0061619785 %)
>> Under-replicated blocks:       13 (6.6203077E-4 %)
>> Target replication factor:     3
>> Real replication factor:       3.0000503
>>
>>
>> The filesystem under path '/' is CORRUPT
>>
>> After running a fsck -move, the system went back to healthy state.
>> However, after some time it again started showing corrupt and fsk -
>> move restored it to a healthy state.
>>
>> We are considering restarting the masters as a possible solution. I
>> have earlier had issue with restarting Master with 2G heap and edits
>> file size of about 2.5G. So, we decided to lookup the size of the
>> edits file. The edits file is now 8.5G! Since, we are on a 32 bit
>> system, we cannot push JVM heap beyond about 2.5G. So, I am not sure
>> that if we bring down the master, we will be able to restart it at
>> all. Please note that we do not have a secondary namenode.
>>
>>
>> Questions for hadoop-users and hadoop-dev?
>>
>> About Hadoop restart:
>>
>> 1. Does namenode try to load edits file into memory on starrtup?
>> 2. Will we be able to restart the namenode without a format at all?
>> 3. Apart from taking backup, is there a safe way to restart the  
>> system
>> and retain the DFS data?
>>
>> About DFS being slow:
>>
>> 4. Is HADOOP-1845 a non-issue? If it is, what may cause the DFS to
>> become slow and lead to SocketTimeoutException on client.
>> 5. What may cause a spike in TCP connections and several connection
>> fds open in CLOSE_WAIT state? Is HADOOP-3071 relevant?
>> 6. Is restart of master a solution at all?
>>
>> About the health of the DFS:
>>
>> 7. What are the implications of DFS going into CORRUPT state every
>> once in a while? Should this increase my caffeine intake? More
>> importantly, is the Hadoop system on verge of a collapse?
>> 8. Would restarting all slaves within a few hours without restarting
>> the masters have an implication on DFS health? Note that, this was
>> done to prevent any downtime for system X.
>>
>> Our target is to return to a state where the DFS functions normally
>> (like it was doing till some time back) and of course to maintain a
>> healthy DFS with its current data. Please provide any insights
>>
>> To add to the above,
>>
>> - Due to the kind of dependency of X on this Hadoop system, api, DFS
>> data, and internal data structure of Hadoop,  it is non trivial for  
>> us
>> to jump to higher versions of hadoop in very near future. We could
>> apply some patches which could go with 0.15.3
>> - Loss of DFS data will be fatal. I believe once I format, I will get
>> a namespace id mismatch unless I clear the hadoop data directories
>> from the slaves. is this correct?
>> - We did not start secondary namenode due to HADOOP-3069. I have a
>> locally patched 0.15.3 with this JIRA, Will it be help and is it safe
>> if we add the secondary namenode now?
>>
>> We have tried to look into jira for more leads, but are not sure  
>> which
>> ones fit our case specifically. Pointers to JIRA issues are also
>> welcome. Of course, please feel free to address any part of the
>> problem :)
>>
>> Thanks
>> Bagri
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>

Re: The Case of a Long Running Hadoop System

Posted by Billy Pearson <sa...@pearsonwholesale.com>.

If I understand the secondary namenode merges the edits log in to the 
fsimage and reduces the edit log size.
Which is likely the root of your problems 8.5G seams large and likely 
putting a strain on your master servers memory and io bandwidth
Why do you not have a secondary namenode?

If you do not have the memory on the master I would look in to stopping a 
datanode/tasktracker on a server and loading the secondary namenode on it

Let it run for a while and watch your log for the secondary namenode you 
should see your edit log get smaller

I am not an expert but that would be my first action.

Billy



"Abhijit Bagri" <ab...@yahoo-inc.com> wrote in 
message news:4C1984EA-74B3-40D1-B9BE-AB2442537217@yahoo-inc.com...
> Hi,
>
> This is a long mail as I have tried to put in as much details as might
> help any of the Hadoop dev/users to help us out. The gist is this:
>
> We have a long running Hadoop system (masters not restarted for about
> 3 months). We have recently started seeing the DFS responding very
> slowly which has resulted in failures on a system which depends on
> Hadoop. Further, the DFS seems to be an unstable state (i.e if fsck is
> a good representation which I believe it is). The edits file
>
> These are the details (skip/return here later and jump to the
> questions at the end of the mail for a quicker read) :
>
> Hadoop Version: 0.15.3 on 32 bit systems.
>
> Number of slaves: 12
> Slaves heap size: 1G
> Namenode heap: 2G
> Jobtracker heap: 2G
>
> The namenode and jobtrackers have not been restarted for about 3
> months. We did restart slaves(all of them within a few hours) a few
> times for some maintaineance in between though. We do not have a
> secondary namenode in place.
>
> There is another system X which talks to this hadoop cluster. X writes
> to the Hadoop DFS and submits jobs to the Jobtracker. The number of
> jobs submitted to Hadoop so far is over 650,000 ( I am using the job
> id for jobs for this), each job may rad/write to multiple files and
> has several dependent libraries which it loads from Distributed Cache.
>
> Recently, we started seeing that there were several timeouts happening
> while X tries to read/write to the DFS. This in turn results in DFS
> becoming very slow in response. The writes are especially slow. The
> trace we get in the logs are:
>
> java.net.SocketTimeoutException: Read timed out
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(SocketInputStream.java:129)
>         at java.net.SocketInputStream.read(SocketInputStream.java:182)
>         at java.io.DataInputStream.readShort(DataInputStream.java:284)
>         at org.apache.hadoop.dfs.DFSClient
> $DFSOutputStream.endBlock(DFSClient.java:1660)
>         at org.apache.hadoop.dfs.DFSClient
> $DFSOutputStream.close(DFSClient.java:1733)
>         at org.apache.hadoop.fs.FSDataOutputStream
> $PositionCache.close(FSDataOutputStream.java:49)
>         at
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:
> 64)
>         ...
>
> Also, datanode logs show a lot of traces like these:
>
> 2008-11-14 21:21:49,429 ERROR org.apache.hadoop.dfs.DataNode:
> DataXceiver: java.io.IOException: Block blk_-1310124865741110666 is
> valid, and cannot be written to.
>         at
> org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:551)
>         at org.apache.hadoop.dfs.DataNode
> $BlockReceiver.<init>(DataNode.java:1257)
>         at org.apache.hadoop.dfs.DataNode
> $DataXceiver.writeBlock(DataNode.java:901)
>         at org.apache.hadoop.dfs.DataNode
> $DataXceiver.run(DataNode.java:804)
>         at java.lang.Thread.run(Thread.java:595)
>
> and these
>
> 2008-11-14 21:21:50,695 WARN org.apache.hadoop.dfs.DataNode:
> java.io.IOException: Error in deleting blocks.
>         at org.apache.hadoop.dfs.FSDataset.invalidate(FSDataset.java:
> 719)
>         at org.apache.hadoop.
>
> The first one seems to be same as HADOOP-1845. We have been seeing
> this for a long time, but as HADOOP-1845 says, it hasn't been creating
> any problems. We have thus ignored this for a while, but this recent
> problems have raised eyebrows again if this really is a non-issue.
>
> We also saw a spike in number of TCP connections on the system X.
> Also, a lsof on the system X shows a lot of connections to datanodes
> in the CLOSE_WAIT state.
>
> While investigating the issue, the fsck output started saying that DFS
> is corrupt. The output is:
>
> Total size:    152873519485 B
> Total blocks:  1963655 (avg. block size 77851 B)
> Total dirs:    1390820
> Total files:   1962680
> Over-replicated blocks:        111 (0.0061619785 %)
> Under-replicated blocks:       13 (6.6203077E-4 %)
> Target replication factor:     3
> Real replication factor:       3.0000503
>
>
> The filesystem under path '/' is CORRUPT
>
> After running a fsck -move, the system went back to healthy state.
> However, after some time it again started showing corrupt and fsk -
> move restored it to a healthy state.
>
> We are considering restarting the masters as a possible solution. I
> have earlier had issue with restarting Master with 2G heap and edits
> file size of about 2.5G. So, we decided to lookup the size of the
> edits file. The edits file is now 8.5G! Since, we are on a 32 bit
> system, we cannot push JVM heap beyond about 2.5G. So, I am not sure
> that if we bring down the master, we will be able to restart it at
> all. Please note that we do not have a secondary namenode.
>
>
> Questions for hadoop-users and hadoop-dev?
>
> About Hadoop restart:
>
> 1. Does namenode try to load edits file into memory on starrtup?
> 2. Will we be able to restart the namenode without a format at all?
> 3. Apart from taking backup, is there a safe way to restart the system
> and retain the DFS data?
>
> About DFS being slow:
>
> 4. Is HADOOP-1845 a non-issue? If it is, what may cause the DFS to
> become slow and lead to SocketTimeoutException on client.
> 5. What may cause a spike in TCP connections and several connection
> fds open in CLOSE_WAIT state? Is HADOOP-3071 relevant?
> 6. Is restart of master a solution at all?
>
> About the health of the DFS:
>
> 7. What are the implications of DFS going into CORRUPT state every
> once in a while? Should this increase my caffeine intake? More
> importantly, is the Hadoop system on verge of a collapse?
> 8. Would restarting all slaves within a few hours without restarting
> the masters have an implication on DFS health? Note that, this was
> done to prevent any downtime for system X.
>
> Our target is to return to a state where the DFS functions normally
> (like it was doing till some time back) and of course to maintain a
> healthy DFS with its current data. Please provide any insights
>
> To add to the above,
>
> - Due to the kind of dependency of X on this Hadoop system, api, DFS
> data, and internal data structure of Hadoop,  it is non trivial for us
> to jump to higher versions of hadoop in very near future. We could
> apply some patches which could go with 0.15.3
> - Loss of DFS data will be fatal. I believe once I format, I will get
> a namespace id mismatch unless I clear the hadoop data directories
> from the slaves. is this correct?
> - We did not start secondary namenode due to HADOOP-3069. I have a
> locally patched 0.15.3 with this JIRA, Will it be help and is it safe
> if we add the secondary namenode now?
>
> We have tried to look into jira for more leads, but are not sure which
> ones fit our case specifically. Pointers to JIRA issues are also
> welcome. Of course, please feel free to address any part of the
> problem :)
>
> Thanks
> Bagri
>
>
>
>
>
>
>
>
>
>
>
>