You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Yonik Seeley (JIRA)" <ji...@apache.org> on 2016/02/11 05:23:18 UTC

[jira] [Comment Edited] (SOLR-8586) Implement hash over all documents to check for shard synchronization

    [ https://issues.apache.org/jira/browse/SOLR-8586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15142215#comment-15142215 ] 

Yonik Seeley edited comment on SOLR-8586 at 2/11/16 4:22 AM:
-------------------------------------------------------------

bq. Yep, I've been looping a custom version of the HDFS-nothing-safe test that among other things, only does adds, no deletes.

Update: when I reverted my custom changes to the chaos test (so that it also did deletes), I got a high amount of shard-out-of-sync errors... seemingly even more than before, so I've been trying to track those down.  What I saw were issues that did not look related to PeerSync... I saw missing documents from a shard that replicated from the leader while buffering documents, and I saw the missing documents come in and get buffered, pointing to transaction log buffering or replay issues.

Then I realized that I had tested "adds only" before committing, and tested the normal test after committing and doing a "git pull".  In-between those times was SOLR-8575, which was a fix to the HDFS tlog!  I've been looping the test for a number of hours with those changes reverted, and I haven't seen a shards-out-of-sync fail so far.  I've also done a quick review of SOLR-8575, but didn't see anything obviously incorrect.  The changes in that issue may just be uncovering another bug (due to timing) rather than causing one... too early to tell.

I've also been running the non-hdfs version of the test for over a day, and also had no inconsistent shard failures.


was (Author: yseeley@gmail.com):
bq. Yep, I've been looping a custom version of the HDFS-nothing-safe test that among other things, only does adds, no deletes.

Update: when I reverted my custom changes to the chaos test (so that it also did deletes), I got a high amount of shard-out-of-sync errors... seemingly even more than before, so I've been trying to track those down.  What I saw were issues that did not look related to PeerSync... I saw missing documents from a shard that replicated from the leader while buffering documents, and I saw the missing documents come in and get buffered, pointing to transaction log buffering or replay issues.

Then I realized that I had tested "adds only" before committing, and tested the normal test after committing and doing a "git pull".  In-between those times was SOLR-8575, which was a fix to the HDFS tlog!  I've been looping the test for a number of hours with those changes reverted, and I haven't seen a shards-out-of-sync fail so far.  I've also done a quick review of SOLR-8575, but didn't see anything obviously incorrect.

I've also been running the non-hdfs version of the test for over a day, and also had no inconsistent shard failures.

> Implement hash over all documents to check for shard synchronization
> --------------------------------------------------------------------
>
>                 Key: SOLR-8586
>                 URL: https://issues.apache.org/jira/browse/SOLR-8586
>             Project: Solr
>          Issue Type: Improvement
>          Components: SolrCloud
>            Reporter: Yonik Seeley
>             Fix For: 5.5, master
>
>         Attachments: SOLR-8586.patch, SOLR-8586.patch, SOLR-8586.patch, SOLR-8586.patch
>
>
> An order-independent hash across all of the versions in the index should suffice.  The hash itself is pretty easy, but we need to figure out when/where to do this check (for example, I think PeerSync is currently used in multiple contexts and this check would perhaps not be appropriate for all PeerSync calls?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org