You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@couchdb.apache.org by GitBox <gi...@apache.org> on 2021/10/21 00:24:59 UTC

[GitHub] [couchdb] tonysun83 opened a new pull request #3795: fix case_clause when HitId and DocId do not match when include_docs=true

tonysun83 opened a new pull request #3795:
URL: https://github.com/apache/couchdb/pull/3795


   ## Overview
   
   Sometimes, a clouseau index may become out of sync with the db or
   become corrupt under certain conditions (heap exhaustion, garbage collection).
   This leads to certain HitIds that get returned which may not exist
   in the current db. Before this fix, we assume the Hit's Id will
   automatically match an Id return from:
   
   `
   fabric:all_docs(DbName, fun callback/2, [], [{keys, DocIds}, {include_docs, true}]).
   `
   
   When the document does not exist, {error, false} is returned for the row.
   This fix  accounts for this scenario and possible other scenarios
   by returning an error to the user and also logging the error when the
   Hit Id does not match what is returned from the _all_docs call.
   
   Another option is to implement a read-repair solution but I don't think that it's possible without rebuilding the index.
   Note that this issue surfaced when a user ran similar queries across several databases with similar documents and indexes. One of the queries resulted in a document id being returned in a Hit that the user claims belongs to another database. That caused the `case_clause`. A rebuild of the index resolved the issue temporarily but surfaced again as heap levels reached max capacity again.
   
   
   ## Testing recommendations
   Unit test added for possible error scenarios. Current regression tests should pass.
   
   ## Related Issues or Pull Requests
   
   <!-- If your changes affects multiple components in different
        repositories please put links to those issues or pull requests here.  -->
   
   ## Checklist
   
   - [ ] Code is written and works correctly
   - [ ] Changes are covered by tests
   - [ ] Any new configurable parameters are documented in `rel/overlay/etc/default.ini`
   - [ ] A PR for documentation changes has been made in https://github.com/apache/couchdb-documentation
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] tonysun83 edited a comment on pull request #3795: fix case_clause when HitId and DocId do not match when include_docs=true

Posted by GitBox <gi...@apache.org>.
tonysun83 edited a comment on pull request #3795:
URL: https://github.com/apache/couchdb/pull/3795#issuecomment-948783750


   > I'm worried about any code that tries to mask situations that should not occur.
   
   I agree that this is a bandaid approach and it masks the underlying root cause. The right approach would be to recreate the exact scenario and see how the indexes get corrupted and see if a change in clouseau can prevent such a situation. Unfortunately, due to security reasons, I am unable to access the user's cluster and I can't reproduce the issue locally (specifically they claim that their particular document that causes the case clause is from a different db). 
   
   > The index should only ever be either current or stale. I don't recognize "out of sync" or "corrupt", can you describe this more?
   
   The user's cluster usage/workflow patterns leads clouseau to constantly allocate more and more heap memory until it reaches the max. Garbage collection kicks frequently and degrades the cluster. Occasionally clouseau will become non-responsive and a restart is required. Some of the clouseau indexes will contain:
   
   ```
   WARNING: 5 broken segments (containing 15231 documents) detected
   WARNING: would write new segments file, and 15231 documents would be lost, if -fix were specified
   ```
   As you well know, this means that the indexes are corrupt. Running `CheckIndex` -fix doesn't resolve the issue and the only way the issue resolves is if the index is rebuilt. 
   
   > If the jvm died for some reason, the index would simply be stale (behind) relative to the database, and the next query, once the jvm has restarted, should update the index as it was before.
   
   What if the document never existed before? https://github.com/apache/couchdb/blob/3.x/src/dreyfus/src/dreyfus_index_updater.erl#L129
   We only update the index upon checking for the latest changes to the db. If the document never existed in the db, then on the dreyfus side, it would appear as though the database is up to date. Whereas on the clouseau, you have this phantom docid that appeared out of nowhere that was never part of the database. That's what the customer is claiming. I wonder if this is even possible. Could there be a situation where clouseau, under high load, writes a document id to a different segment that maps to a different db? Or would some segment merge operation do this? This is all conjecture on my part since I don't have access to the cluster itself. 
   
   > I'm inferring this PR is not about addressing that known case?
   
   It's a lesser of two evils in the short term. The user's application can't continue because of the case_clause. But I have the same concern as you: their application wouldn't crash right away, but they are dealing with incorrect data. Perhaps we should throw a 500 with a more meaningful message? I'm definitely open to suggestions. 
   
   Note that this only happens when include_docs=true. So for the longest time now, under similar situations, users have been getting wrong results back already but they just never got an exception when include_docs=false. This "bug" has been around forever.
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] rnewson edited a comment on pull request #3795: fix case_clause when HitId and DocId do not match when include_docs=true

Posted by GitBox <gi...@apache.org>.
rnewson edited a comment on pull request #3795:
URL: https://github.com/apache/couchdb/pull/3795#issuecomment-973141018


   @tonysun83 thanks for the details, this worries me. I can, sadly, imagine there might be bugs in the pid<>index path logic, either in dreyfus or clouseau or in combination.  Clearly, though, they must manifest rarely. Have you confirmed that the document from "another database" is genuinely only present in that database and is not (now, or in the past) present in the expected database?
   
   The fix in this PR remains the wrong approach, I think we both agree. As for returning a fatal/500 error, it's not entirely clear that's right either. For one thing the 500 would only manifest if a search query matched one of these misplaced documents, and not if they don't. If we have indexed documents into the wrong search index then the index is corrupt and no results can be trusted. So really the index should be deleted at that moment and forced to rebuild or, if we prefer, marked as corrupt and no longer allowed to serve requests until someone (the user? the administrator?) intervenes.
   
   The true fix, I think, is to remove the pid<>path mapping entirely, changing the rpc messages to state the index path each time. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] rnewson commented on pull request #3795: fix case_clause when HitId and DocId do not match when include_docs=true

Posted by GitBox <gi...@apache.org>.
rnewson commented on pull request #3795:
URL: https://github.com/apache/couchdb/pull/3795#issuecomment-950734749


   for an index with 
   
   ```WARNING: 5 broken segments (containing 15231 documents) detected
   WARNING: would write new segments file, and 15231 documents would be lost, if -fix were specified
   ```
   we should be deleting the index and letting it rebuild, or, at the very least, logging a 'fatal' error for the administrator and returning a 500 status code indicating the index is broken to the user.
   
   A lucene index managed by clouseau is specific to a single copy of a single shard range of a single database. A lucene index consists of 1 or more "segments", this is a tradeoff between performance (1 segment being ideal) and the ability to update the index efficiently. Segments are merged in the background and Lucene ensures this has no effect on the integrity of the index. That CheckIndex -fix would wipe out all the existing segments and write a new, empty one tells us the index is useless and no results from it should be returned to the user (they would only coincidentally be correct, which is no basis to proceed). We are, of course, on a very old version of Lucene (4.6.1, and latest is 8.10.1) and many fixes and enhancements have happened since.
   
   ```Could there be a situation where clouseau, under high load, writes a document id to a different segment that maps to a different db```
   
   That would be alarming indeed. I don't believe it is possible, as indexes are named after the corresponding source .couch file. It would be a helpful exercise if you could check into this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] tonysun83 edited a comment on pull request #3795: fix case_clause when HitId and DocId do not match when include_docs=true

Posted by GitBox <gi...@apache.org>.
tonysun83 edited a comment on pull request #3795:
URL: https://github.com/apache/couchdb/pull/3795#issuecomment-948783750


   > I'm worried about any code that tries to mask situations that should not occur.
   
   I agree that this is a bandaid approach and it masks the underlying root cause. The right approach would be to recreate the exact scenario and see how the indexes get corrupted and see if a change in clouseau can prevent such a situation. Unfortunately, due to security reasons, I am unable to access the user's cluster and I can't reproduce the issue locally (specifically they claim that their particular document that causes the case clause is from a different db). 
   
   > The index should only ever be either current or stale. I don't recognize "out of sync" or "corrupt", can you describe this more?
   
   The user's cluster usage/workflow patterns leads clouseau to constantly allocate more and more heap memory until it reaches the max. Garbage collection kicks frequently and degrades the cluster. Occasionally clouseau will become non-responsive and a restart is required. Some of the clouseau indexes will contain:
   
   ```
   WARNING: 5 broken segments (containing 15231 documents) detected
   WARNING: would write new segments file, and 15231 documents would be lost, if -fix were specified
   ```
   As you well know, this means that the indexes are corrupt. Running `CheckIndex` -fix doesn't resolve the issue and the only way the issue resolves is if the index is rebuilt. 
   
   > If the jvm died for some reason, the index would simply be stale (behind) relative to the database, and the next query, once the jvm has restarted, should update the index as it was before.
   
   What if the document never existed before? https://github.com/apache/couchdb/blob/3.x/src/dreyfus/src/dreyfus_index_updater.erl#L129
   We only update the index upon checking for the latest changes to the db. If the document never existed in the db, then on the dreyfus side, it would appear as though the database is up to date. Whereas on the clouseau, you have this phantom docid that appeared out of nowhere that was never part of the database. That's what the customer is claiming. I wonder if this is even possible. Could there be a situation where clouseau, under high load, writes a document id to a different segment that maps to a different db? Or would some segment merge operation do this? This is all conjecture on my part since I don't have access to the cluster itself. 
   
   > I'm inferring this PR is not about addressing that known case?
   It's a lesser of two evils in the short term. The user's application can't continue because of the case_clause. But I have the same concern as you: their application wouldn't crash right away, but they are dealing with incorrect data. Perhaps we should throw a 500 with a more meaningful message? I'm definitely open to suggestions. 
   
   Note that this only happens when include_docs=true. So for the longest time now, under similar situations, users have been getting wrong results back already but they just never got an exception when include_docs=false. This "bug" has been around forever.
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] tonysun83 commented on pull request #3795: fix case_clause when HitId and DocId do not match when include_docs=true

Posted by GitBox <gi...@apache.org>.
tonysun83 commented on pull request #3795:
URL: https://github.com/apache/couchdb/pull/3795#issuecomment-973084186


   @rnewson , I'm unable to reproduce the issue for this particular user. However, they confirmed that the document id belongs to a different database that resides on the same cluster.
   
   The code shows that the segment files are written in a directory path based simply on DbName + Hash:
   https://github.com/apache/couchdb/blob/3.x/src/dreyfus/src/dreyfus_index.erl#L250. A
   
   On the clouseau side, as you're fully aware, the PATH is mapped to PIDs which get returned to dreyfus:
   
   https://github.com/cloudant-labs/clouseau/blob/master/src/main/scala/com/cloudant/clouseau/IndexManagerService.scala#L44-L45
   
   For the user behavior to manifest, one possibility is that the PID mapping is modified and the PID  writes to a different path. I don't see that occurring anywhere in the  Clouseau codebase. That doesn't exclude the possibility it's happening, but unfortunately, I cannot reproduce it.
   
   It's a corner case and the user is getting a function_clause. I propose that instead of return results, that we do: 
   
   > logging a 'fatal' error for the administrator and returning a 500 status code indicating the index is broken to the user.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] tonysun83 commented on pull request #3795: fix case_clause when HitId and DocId do not match when include_docs=true

Posted by GitBox <gi...@apache.org>.
tonysun83 commented on pull request #3795:
URL: https://github.com/apache/couchdb/pull/3795#issuecomment-948783750


   > I'm worried about any code that tries to mask situations that should not occur.
   
   I agree that this is a bandaid approach and it masks the underlying root cause. The right approach would be to recreate the exact scenario and see how the indexes get corrupted and see if a change in clouseau can prevent such a situation. Unfortunately, due to security reasons, I am unable to access the user's cluster and I can't reproduce the issue locally (specifically they claim that their particular document that causes the case clause is from a different db). 
   
   > The index should only ever be either current or stale. I don't recognize "out of sync" or "corrupt", can you describe this more?
   
   The user's cluster usage/workflow patterns leads clouseau to constantly allocate more and more heap memory until it reaches the max. Garbage collection kicks frequently and degrades the cluster. Occasionally clouseau will become non-responsive to clouseau and a restart is required. Some of the clouseau indexes will contain:
   
   ```
   WARNING: 5 broken segments (containing 15231 documents) detected
   WARNING: would write new segments file, and 15231 documents would be lost, if -fix were specified
   ```
   As you well know, this means that the indexes are corrupt. Running `CheckIndex` -fix doesn't resolve the issue and the only way the issue resolves is if the index is rebuilt. 
   
   > If the jvm died for some reason, the index would simply be stale (behind) relative to the database, and the next query, once the jvm has restarted, should update the index as it was before.
   
   What if the document never existed before? https://github.com/apache/couchdb/blob/3.x/src/dreyfus/src/dreyfus_index_updater.erl#L129
   We only update the index upon checking for the latest changes to the db. If the document never existed in the db, then on the dreyfus side, it would appear as though the database is up to date. Whereas on the clouseau, you have this phantom docid that appeared out of nowhere that was never part of the database. That's what the customer is claiming. I wonder if this is even possible. Could there be a situation where clouseau, under high load, writes a document id to a different segment that maps to a different db? Or would some segment merge operation do this? This is all conjecture on my part since I don't have access to the cluster itself. 
   
   > I'm inferring this PR is not about addressing that known case?
   It's a lesser of two evils in the short term. The user's application can't continue because of the case_clause. But I have the same concern as you: their application wouldn't crash right away, but they are dealing with incorrect data. Perhaps we should throw a 500 with a more meaningful message? I'm definitely open to suggestions. 
   
   Note that this only happens when include_docs=true. So for the longest time now, under similar situations, users have been getting wrong results back already but they just never got an exception when include_docs=false. This "bug" has been around forever.
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] rnewson edited a comment on pull request #3795: fix case_clause when HitId and DocId do not match when include_docs=true

Posted by GitBox <gi...@apache.org>.
rnewson edited a comment on pull request #3795:
URL: https://github.com/apache/couchdb/pull/3795#issuecomment-948409540


   I'm worried about any code that tries to mask situations that should not occur.
   
   ``` a clouseau index may become out of sync with the db or
   become corrupt under certain conditions (heap exhaustion, garbage collection).
   ```
   
   The index should only ever be either current or stale. I don't recognise "out of sync" or "corrupt", can you describe this more? If the jvm died for some reason, the index would simply be stale (behind) relative to the database, and the next query, once the jvm has restarted, should update the index as it was before.
   
   I would expect the doc to be missing for an include_docs=true if the doc were deleted between the start of the _search request and its completion, but I'm inferring this PR is not about addressing that known case?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] rnewson commented on pull request #3795: fix case_clause when HitId and DocId do not match when include_docs=true

Posted by GitBox <gi...@apache.org>.
rnewson commented on pull request #3795:
URL: https://github.com/apache/couchdb/pull/3795#issuecomment-948409540


   I'm worried about any code that tries to mask situations that should not occur.
   
   ``` a clouseau index may become out of sync with the db or
   become corrupt under certain conditions (heap exhaustion, garbage collection).```
   
   The index should only ever be either current or stale. I don't recognise "out of sync" or "corrupt", can you describe this more? If the jvm died for some reason, the index would simply be stale (behind) relative to the database, and the next query, once the jvm has restarted, should update the index as it was before.
   
   I would expect the doc to be missing for an include_docs=true if the doc were deleted between the start of the _search request and its completion, but I'm inferring this PR is not about addressing that known case?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] tonysun83 commented on pull request #3795: fix case_clause when HitId and DocId do not match when include_docs=true

Posted by GitBox <gi...@apache.org>.
tonysun83 commented on pull request #3795:
URL: https://github.com/apache/couchdb/pull/3795#issuecomment-951103516


   >It would be a helpful exercise if you could check into this.
   
   Will verify with the user.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] rnewson commented on pull request #3795: fix case_clause when HitId and DocId do not match when include_docs=true

Posted by GitBox <gi...@apache.org>.
rnewson commented on pull request #3795:
URL: https://github.com/apache/couchdb/pull/3795#issuecomment-950735474


   my summary so far: -1 to this fix as it appears to be an attempt to hide a fatal condition (corrupted index) rather than fix it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] rnewson commented on pull request #3795: fix case_clause when HitId and DocId do not match when include_docs=true

Posted by GitBox <gi...@apache.org>.
rnewson commented on pull request #3795:
URL: https://github.com/apache/couchdb/pull/3795#issuecomment-973141018


   @tonysun83 thanks for the details, this worries me. I can, sadly, imagine there might be bugs in the pid<>index path logic, either in dreyfus or clouseau or in combination.  Clearly, though, they must manifest rarely. Have you confirmed that the document from "another database" is genuinely only present in that database and is not (now, or in the past) been present in the expected database?
   
   The fix in this PR remains the wrong approach, I think we both agree. As for returning a fatal/500 error, it's not entirely clear that's right either. For one thing the 500 would only manifest if a search query matched one of these misplaced documents, and not if they don't. If we have indexed documents into the wrong search index then the index is corrupt and no results can be trusted. So really the index should be deleted at that moment and forced to rebuild or, if we prefer, marked as corrupt and no longer allowed to serve requests until someone (the user? the administrator?) intervenes.
   
   The true fix, I think, is to remove the pid<>path mapping entirely, changing the rpc messages to state the index path each time. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] tonysun83 commented on pull request #3795: fix case_clause when HitId and DocId do not match when include_docs=true

Posted by GitBox <gi...@apache.org>.
tonysun83 commented on pull request #3795:
URL: https://github.com/apache/couchdb/pull/3795#issuecomment-1009289045


   closing this as the right approach requires a fairly large refactoring. Will use a different PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [couchdb] tonysun83 closed pull request #3795: fix case_clause when HitId and DocId do not match when include_docs=true

Posted by GitBox <gi...@apache.org>.
tonysun83 closed pull request #3795:
URL: https://github.com/apache/couchdb/pull/3795


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org