You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Johan Oskarsson (JIRA)" <ji...@apache.org> on 2008/05/01 16:34:55 UTC

[jira] Updated: (HADOOP-3232) Datanodes time out

     [ https://issues.apache.org/jira/browse/HADOOP-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Johan Oskarsson updated HADOOP-3232:
------------------------------------

    Attachment: du-nonblocking-v1.patch

Created a simple patch that makes DU run the shell command in a new thread and never block on getUsage(). It does change the behavior a bit, but in most cases it shouldn't be a problem.
I've not had a chance to test this on a real cluster yet, works in my local testing though.

Unfortunately it turns out this only solves part of the problem. Block reports are still an issue.
That's harder to solve though, not sure how to restructure that. Ideas?

> Datanodes time out
> ------------------
>
>                 Key: HADOOP-3232
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3232
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.16.2
>         Environment: 10 node cluster + 1 namenode
>            Reporter: Johan Oskarsson
>            Priority: Critical
>             Fix For: 0.18.0
>
>         Attachments: du-nonblocking-v1.patch, hadoop-hadoop-datanode-new.log, hadoop-hadoop-datanode-new.out, hadoop-hadoop-datanode.out, hadoop-hadoop-namenode-master2.out
>
>
> I recently upgraded to 0.16.2 from 0.15.2 on our 10 node cluster.
> Unfortunately we're seeing datanode timeout issues. In previous versions we've often seen in the nn webui that one or two datanodes "last contact" goes from the usual 0-3 sec to ~200-300 before it drops down to 0 again.
> This causes mild discomfort but the big problems appear when all nodes do this at once, as happened a few times after the upgrade.
> It was suggested that this could be due to namenode garbage collection, but looking at the gc log output it doesn't seem to be the case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.