You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org> on 2014/05/21 10:15:40 UTC

[jira] [Commented] (SOLR-5309) Investigate ShardSplitTest failures

    [ https://issues.apache.org/jira/browse/SOLR-5309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14004472#comment-14004472 ] 

Shalin Shekhar Mangar commented on SOLR-5309:
---------------------------------------------

I am looking at these failure again today. Yeah, it's been that busy around here :(

I implemented a RateLimitedDirectoryFactory for Solr with a very small limit and forced ShardSplitTest to use it always. This helped reproduce the issue for me. I have finally managed to track down the root cause. It always perplexed me that the difference between expected and actual doc counts was almost always 1.

Whenever we add/delete documents during shard splitting, we synchronously forward the request to the appropriate sub-shard. For add requests, a single sub-shard is selected but for delete by ids, we weren't selecting a single sub-shard. Instead we are forwarding the delete by id to all sub-shards. This works out fine and doesn't cause any damage in practice because the id exists only on one shard. However, when one sub-shard (the right one) accepts the delete and the other rejects it (maybe because it became active in the mean-time) then the client (ShardSplitTest) gets an error back and assumes that the delete did not succeed whereas it actually succeeded on the right sub-shard.

We always advise our users to retry update operations upon failure and they would be fine if they follow this advise during shard splitting also. ShardSplitTest unfortunately doesn't follow that advice and just counts success/failures and ends up with an inconsistent state.

I'll start by fixing delete-by-id to route requests to the correct (single) sub-shard and enabling this test again.

> Investigate ShardSplitTest failures
> -----------------------------------
>
>                 Key: SOLR-5309
>                 URL: https://issues.apache.org/jira/browse/SOLR-5309
>             Project: Solr
>          Issue Type: Task
>          Components: SolrCloud
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>            Priority: Blocker
>
> Investigate why ShardSplitTest if failing sporadically.
> Some recent failures:
> http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Windows/3328/
> http://jenkins.thetaphi.de/job/Lucene-Solr-trunk-Linux/7760/
> http://jenkins.thetaphi.de/job/Lucene-Solr-4.x-MacOSX/861/



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org