You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@accumulo.apache.org by "Bill Havanki (JIRA)" <ji...@apache.org> on 2014/01/22 16:43:24 UTC
[jira] [Comment Edited] (ACCUMULO-2227) Concurrent randomwalk fails when namenode dies after bulk import step

    [ https://issues.apache.org/jira/browse/ACCUMULO-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13878758#comment-13878758 ] 

Bill Havanki edited comment on ACCUMULO-2227 at 1/22/14 3:42 PM:
-----------------------------------------------------------------

The failure here is ultimately due to running the test under Hadoop 2.0.0-cdh4.5.0. Up until 2.0.0, client-to-namenode calls like {{delete}}, annotated as {{AtMostOnce}} in {{org.apache.hadoop.hdfs.protocol.ClientProtocol}}, were not retried; only operations marked {{Idempotent}} were. Starting with the implementation of HADOOP-9792 in Hadoop 2.1.0, {{AtMostOnce}}-annotated operations are also retried. So, I expect that upgrading my cluster to Hadoop 2.1.0 or higher, or to a CDH release that includes a backport of HADOOP-9792, would resolve this issue.

The {{mkdirs}} call is annotated as {{Idempotent}} so it should not cause this problem, even under Hadoop 2.0.0.

I'm not sure that adding an _ad hoc_ retry here is the best idea to resolve this, so any opinions are welcome.


was (Author: bhavanki):
The failure here is ultimately due to running the test under Hadoop 2.0.0. Up until then, client-to-namenode calls like {{delete}}, annotated as {{AtMostOnce}} in {{org.apache.hadoop.hdfs.protocol.ClientProtocol}}, were not retried; only operations marked {{Idempotent}} were. Starting with the implementation of HADOOP-9792 in Hadoop 2.1.0, {{AtMostOnce}}-annotated operations are also retried. So, I expect that upgrading my cluster to Hadoop 2.1.0 or higher would resolve this issue.

The {{mkdirs}} call is annotated as {{Idempotent}} so it should not cause this problem, even under Hadoop 2.0.0.

I'm not sure that adding an _ad hoc_ retry here is the best idea to resolve this, so any opinions are welcome.

> Concurrent randomwalk fails when namenode dies after bulk import step
> ---------------------------------------------------------------------
>
>                 Key: ACCUMULO-2227
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2227
>             Project: Accumulo
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 1.4.4
>            Reporter: Bill Havanki
>              Labels: ha, randomwalk, test
>
> Running Concurrent randomwalk under HDFS HA, if the active namenode is killed:
> {noformat}
> 20 12:27:51,119 [retry.RetryInvocationHandler] WARN : Exception while invoking class org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.delete. Not retrying because the invoked method is not idempotent, and unable to determine whether it was invoked
> java.io.IOException: Failed on local exception: java.io.IOException: Response is null.; Host Details : local host is: "slave.domain.com/10.20.200.113"; destination host is: "namenode.domain.com":8020;
> ...
>  at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:1487)
> at org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:355)
> at org.apache.accumulo.server.test.randomwalk.concurrent.BulkImport.visit(BulkImport.java:140)
> ...
> Caused by: java.io.IOException: Response is null.
> at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:952)
> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:847)
> {noformat}
> This arises from an HDFS path delete call that cleans up from the bulk import. The test should be resilient here (and when the paths are made earlier in the test) so that the test can continue once failover has completed.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)