You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@accumulo.apache.org by "Josh Elser (JIRA)" <ji...@apache.org> on 2015/05/13 19:05:59 UTC

[jira] [Commented] (ACCUMULO-3811) Improve exception during held commits sent back to clients from BatchWriter

    [ https://issues.apache.org/jira/browse/ACCUMULO-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14542248#comment-14542248 ] 

Josh Elser commented on ACCUMULO-3811:
--------------------------------------

I'm seeing this pretty regularly with DN agitation on, commits held being the cause each time. I am surprised that I'm seeing this little resilience coming out of HDFS:

Big CI picture w/ agitation
{noformat}
20150513 08:58:26 Killing datanode on cn021
20150513 09:08:26 Starting datanode on cn021

2015-05-13 09:08:16,473 [tserver.TabletServer$ThriftClientHandler] ERROR: Commits are held
org.apache.accumulo.tserver.HoldTimeoutException: Commits are held

2015-05-13 09:08:16,479 [impl.TabletServerBatchWriter] ERROR: Server side error on cn022:9997: org.apache.thrift.TApplicationException: Internal error processing closeUpdate
2015-05-13 09:08:16,483 [start.Main] ERROR: Thread 'org.apache.accumulo.test.continuous.ContinuousIngest' died.
{noformat}

Maybe some blocks aren't fully replicated? I'm not sure but I feel like things shouldn't bog down like this.

> Improve exception during held commits sent back to clients from BatchWriter
> ---------------------------------------------------------------------------
>
>                 Key: ACCUMULO-3811
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3811
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: client, tserver
>            Reporter: Josh Elser
>             Fix For: 1.8.0
>
>
> Running CI on 1.7.0_rc3, I'm noticing that with datanode agitation, I'm frequently seeing the BatchWriter die.
> It seems to be that when the ingester is trying to flush right after a datanode dies, the system is polling to minor compact, which blocks the flush and ultimately results in throwing a HoldTimeoutException.
> It might be due to under-replication that there are no other datanode available to serve the necessary block, but it's a good example of how clients have no way to recover from this case. Client should be able to know if the system is blocking writes and be able to wait and then retry their update. Right now they just see an opaque AccumuloSecurityException without reason as to the nature of the failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)