You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-issues@hadoop.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/08/18 22:46:00 UTC

[jira] [Commented] (HDFS-16659) JournalNode should throw CacheMissException if SinceTxId is bigger than HighestWrittenTxId

    [ https://issues.apache.org/jira/browse/HDFS-16659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17581564#comment-17581564 ] 

ASF GitHub Bot commented on HDFS-16659:
---------------------------------------

xkrogen commented on PR #4560:
URL: https://github.com/apache/hadoop/pull/4560#issuecomment-1220049310

   Thanks for trying to tackle this issue! Actually @shvachko and I discussed this potential issue long ago but had not observed problems in practice; I guess it is made much worse by using cross-DC JNs.
   
   I don't feel that a `CacheMissException` is correct. The situation where the NN requests edits newer than what the JNs have is expected to be common, especially if the transaction rate is low, since in this situation the NN will constantly poll the JNs for new edits by sending `sinceTxID = highestWrittenTxId + 1`. I see you're trying to handle this by special-casing when the `sinceTxId` is `getHighestWrittenTxId() + 1`, but it seems pretty hacky/brittle.
   
   My initial thought is that we should make a special-case return value when `sinceTxId > highestWrittenTxId` (maybe `-1`) and on the NN side, if you find some responses with `txnCount > 0` and some responses with `txnCount < 0`, then you only use the responses with `txnCount > 0`. The main issue I see with this is that `AsyncLoggerSet#waitForWriteQuorum()` isn't set up to handle this kind of situation; it will just return as soon as there are a quorum of non-error responses.
   
   As an alternative, we could create a new exception different from `CacheMissException`, like `NewerTxnIdException`, which the JN throws in the situation of `startTxId > highestWrittenTxId`. Since it's an exception, `waitForWriteQuorum()` will try to throw away JNs that threw it. If only some JNs throw the exception, then we still get a valid result from `waitForWriteQuorum()`. If too many JNs throw the exception, then we can catch it on the NN side and swallow the exception to treat it as a normal/expected situation. I think this would avoid us having to special-case `startTxId + 1` on the JN side.
   
   WDYT?




> JournalNode should throw CacheMissException if SinceTxId is bigger than HighestWrittenTxId
> ------------------------------------------------------------------------------------------
>
>                 Key: HDFS-16659
>                 URL: https://issues.apache.org/jira/browse/HDFS-16659
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: ZanderXu
>            Assignee: ZanderXu
>            Priority: Critical
>              Labels: pull-request-available
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> JournalNode should throw `CacheMissException` if `sinceTxId` is bigger than `highestWrittenTxId` during handling `getJournaledEdits` rpc from NNs. 
> Current logic may cause in-progress EditlogTailer cannot replay any Edits from JournalNodes in some corner cases, resulting in ObserverNameNode cannot handle requests from clients.
> Suppose there are 3 journalNodes, JN0 ~ JN1.
> * JN0 has some abnormal cases when Active Namenode is syncing 10 Edits with first txid 11
> * NameNode just ignore the abnormal JN0 and continue to sync Edits to Journal 1 and 2
> * JN0 backed to health
> * NameNode continue sync 10 Edits with first txid 21.
> * At this point, there are no Edits 11 ~ 30 in the cache of JN0
> * Observer NameNode try to select EditLogInputStream through `getJournaledEdits` with since txId 21
> * Journal 2 has some abnormal cases and caused a slow response
> The expected result is: Response should contain 20 Edits from txId 21 to txId 30 from JN1 and JN2. Because Active NameNode successfully write these Edits to JN1 and JN2 and failed write these edits to JN0.
> But in the current implementation,  the response is [Response(0) from JN0, Response(10) from JN1], because  there are some abnormal cases in  JN2, such as GC, bad network,  cause a slow response. So the `maxAllowedTxns` will be 0, NameNode will not replay any Edits.
> As above, the root case is that JournalNode should throw Miss Cache Exception when `sinceTxid` is more than `highestWrittenTxId`.
> And the bug code as blew:
> {code:java}
> if (sinceTxId > getHighestWrittenTxId()) {
>     // Requested edits that don't exist yet; short-circuit the cache here
>     metrics.rpcEmptyResponses.incr();
>     return GetJournaledEditsResponseProto.newBuilder().setTxnCount(0).build(); 
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org