You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Matt Pearce <ma...@flax.co.uk> on 2018/09/21 14:17:10 UTC

Corrupted index in SolrCloud

Hi,

We've just been working with a client who had a corruption issue with 
their SolrCloud install. They're running Solr 5.3.1, with a collection 
spread across 12 shards. Each shard has a single replica.

They were seeing "Index Corruption" errors when running certain queries. 
We investigated, and narrowed it down to a single shard. Using the 
Lucene CheckIndex utility, we tested both the primary and replica copies 
of the data, and found the same issue with both - the first segment, 
containing the majority of the documents, was reporting corruption. They 
were able to restore from a backup, but it would be good to get some 
idea what could have caused the problem in SolrCloud. One of the 
machines ran out of disk space last week during indexing, which we guess 
could have been the starting point for the corrupted data files.

Our question is: why would the corruption have spread to the replica as 
well? Could a corrupted document be replicated and cause the replica 
index to break as well?

Thanks,

Matt

-- 
Matt Pearce
Flax - Open Source Enterprise Search
www.flax.co.uk

Re: Corrupted index in SolrCloud

Posted by Matt Pearce <ma...@flax.co.uk>.
Thanks for the explanation Erick, that makes sense!

Matt

On 21/09/2018 15:50, Erick Erickson wrote:
> The disk corruption is, of course, a red flag and likely the root cause.
> 
> As for how it replicated let's assume a 2 replica shard (leader +
> follower). If the follower ever went into full recovery it would use
> old-style replication to copy down the entire index, corrupted index
> and all, from the leader. The follower can go into "full recovery" for
> a number of reasons, from it being shut down for a while and indexing
> still happening to the leader to communications burps.
> 
> There's been a lot of work put in to making fewer full recoveries, but
> much of that only came to fruition in recent Solr releases, especially
> starting with Solr 7.3. (SOLR-11702)
> 
> Best,
> Erick
> On Fri, Sep 21, 2018 at 7:17 AM Matt Pearce <ma...@flax.co.uk> wrote:
>>
>> Hi,
>>
>> We've just been working with a client who had a corruption issue with
>> their SolrCloud install. They're running Solr 5.3.1, with a collection
>> spread across 12 shards. Each shard has a single replica.
>>
>> They were seeing "Index Corruption" errors when running certain queries.
>> We investigated, and narrowed it down to a single shard. Using the
>> Lucene CheckIndex utility, we tested both the primary and replica copies
>> of the data, and found the same issue with both - the first segment,
>> containing the majority of the documents, was reporting corruption. They
>> were able to restore from a backup, but it would be good to get some
>> idea what could have caused the problem in SolrCloud. One of the
>> machines ran out of disk space last week during indexing, which we guess
>> could have been the starting point for the corrupted data files.
>>
>> Our question is: why would the corruption have spread to the replica as
>> well? Could a corrupted document be replicated and cause the replica
>> index to break as well?
>>
>> Thanks,
>>
>> Matt
>>
>> --
>> Matt Pearce
>> Flax - Open Source Enterprise Search
>> www.flax.co.uk

-- 
Matt Pearce
Flax - Open Source Enterprise Search
www.flax.co.uk

Re: Corrupted index in SolrCloud

Posted by Erick Erickson <er...@gmail.com>.
The disk corruption is, of course, a red flag and likely the root cause.

As for how it replicated let's assume a 2 replica shard (leader +
follower). If the follower ever went into full recovery it would use
old-style replication to copy down the entire index, corrupted index
and all, from the leader. The follower can go into "full recovery" for
a number of reasons, from it being shut down for a while and indexing
still happening to the leader to communications burps.

There's been a lot of work put in to making fewer full recoveries, but
much of that only came to fruition in recent Solr releases, especially
starting with Solr 7.3. (SOLR-11702)

Best,
Erick
On Fri, Sep 21, 2018 at 7:17 AM Matt Pearce <ma...@flax.co.uk> wrote:
>
> Hi,
>
> We've just been working with a client who had a corruption issue with
> their SolrCloud install. They're running Solr 5.3.1, with a collection
> spread across 12 shards. Each shard has a single replica.
>
> They were seeing "Index Corruption" errors when running certain queries.
> We investigated, and narrowed it down to a single shard. Using the
> Lucene CheckIndex utility, we tested both the primary and replica copies
> of the data, and found the same issue with both - the first segment,
> containing the majority of the documents, was reporting corruption. They
> were able to restore from a backup, but it would be good to get some
> idea what could have caused the problem in SolrCloud. One of the
> machines ran out of disk space last week during indexing, which we guess
> could have been the starting point for the corrupted data files.
>
> Our question is: why would the corruption have spread to the replica as
> well? Could a corrupted document be replicated and cause the replica
> index to break as well?
>
> Thanks,
>
> Matt
>
> --
> Matt Pearce
> Flax - Open Source Enterprise Search
> www.flax.co.uk