You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Nishanth S <ni...@gmail.com> on 2015/01/21 18:13:20 UTC

Solr Recovery process

Hello Everyone,

I am hitting a few issues with solr replicas going into recovery and then
doing a full index copy.I am trying to understand the solr recovery
process.I have read a few blogs  on this and saw  that when leader notifies
a replica to  recover(in my case it is due to connection resets) it will
try to do a peer sync first and  if the missed updates are more than 100 it
will do a full index copy from the leader.I am trying to understand what
peer sync is and where does tlog come into picture.Are tlogs replayed only
during server restart?.Can some one  help me with this?

Thanks,
Nishanth

Re: Solr Recovery process

Posted by Erick Erickson <er...@gmail.com>.

Shalin:

Just to see if my understanding is correct, how often would you expect <2> to
occur? My assumption so far is that it would be quite rare that the leader
and all replicas happened to hit autocommit points at the same time and thus it
would be save to just bring down a few segments. But that's
an assumption, I have no facts to back that up.

Nishanth:

Currently no, you can't configure the missed updates and still peer
sync. Getting
to the bottom of the connection resets seems indicated.

Best
Erick

On Wed, Jan 21, 2015 at 6:46 PM, Nishanth S <ni...@gmail.com> wrote:
> Thank you Shalin.So in a system where the indexing rate is more than 5K TPS
> or so the replica  will never be able to recover   through peer sync
> process.In  my case I have mostly seen  step 3 where a full copy happens
> and  if the index size is huge it takes a very long time for replicas to
> recover.Is there a way we can  configure the  number of missed updates for
> peer sync.
>
> Thanks,
> Nishanth
>
> On Wed, Jan 21, 2015 at 4:47 PM, Shalin Shekhar Mangar <
> shalinmangar@gmail.com> wrote:
>
>> Hi Nishanth,
>>
>> The recovery happens as follows:
>>
>> 1. PeerSync is attempted first. If the number of new updates on leader is
>> less than 100 then the missing documents are fetched directly and indexed
>> locally. The tlog tells us the last 100 updates very quickly. Other uses of
>> the tlog are for durability of updates and of course, startup recovery.
>> 2. If the above step fails then replication recovery is attempted. A hard
>> commit is called on the leader and then the leader is polled for the latest
>> index version and generation. If the leader's version and generation are
>> greater than local index's version/generation then the difference of the
>> index files between leader and replica are fetched and installed.
>> 3. If the above fails (because leader's version/generation is somehow equal
>> or more than local) then a full index recovery happens and the entire index
>> from the leader is fetched and installed locally.
>>
>> There are some other details involved in this process too but probably not
>> worth going into here.
>>
>> On Wed, Jan 21, 2015 at 5:13 PM, Nishanth S <ni...@gmail.com>
>> wrote:
>>
>> > Hello Everyone,
>> >
>> > I am hitting a few issues with solr replicas going into recovery and then
>> > doing a full index copy.I am trying to understand the solr recovery
>> > process.I have read a few blogs  on this and saw  that when leader
>> notifies
>> > a replica to  recover(in my case it is due to connection resets) it will
>> > try to do a peer sync first and  if the missed updates are more than 100
>> it
>> > will do a full index copy from the leader.I am trying to understand what
>> > peer sync is and where does tlog come into picture.Are tlogs replayed
>> only
>> > during server restart?.Can some one  help me with this?
>> >
>> > Thanks,
>> > Nishanth
>> >
>>
>>
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>

Re: Solr Recovery process

Posted by Nishanth S <ni...@gmail.com>.

Thank you Ram.

On Mon, Jan 26, 2015 at 1:49 AM, Ramkumar R. Aiyengar <
andyetitmoves@gmail.com> wrote:

> https://issues.apache.org/jira/browse/SOLR-6359 has a patch which allows
> this to be configured, it has not gone in as yet.
>
> Note that the current design of the UpdateLog causes it to be less
> efficient if the number is bumped up too much, but certainly worth
> experimenting with.
> On 22 Jan 2015 02:47, "Nishanth S" <ni...@gmail.com> wrote:
>
> > Thank you Shalin.So in a system where the indexing rate is more than 5K
> TPS
> > or so the replica  will never be able to recover   through peer sync
> > process.In  my case I have mostly seen  step 3 where a full copy happens
> > and  if the index size is huge it takes a very long time for replicas to
> > recover.Is there a way we can  configure the  number of missed updates
> for
> > peer sync.
> >
> > Thanks,
> > Nishanth
> >
> > On Wed, Jan 21, 2015 at 4:47 PM, Shalin Shekhar Mangar <
> > shalinmangar@gmail.com> wrote:
> >
> > > Hi Nishanth,
> > >
> > > The recovery happens as follows:
> > >
> > > 1. PeerSync is attempted first. If the number of new updates on leader
> is
> > > less than 100 then the missing documents are fetched directly and
> indexed
> > > locally. The tlog tells us the last 100 updates very quickly. Other
> uses
> > of
> > > the tlog are for durability of updates and of course, startup recovery.
> > > 2. If the above step fails then replication recovery is attempted. A
> hard
> > > commit is called on the leader and then the leader is polled for the
> > latest
> > > index version and generation. If the leader's version and generation
> are
> > > greater than local index's version/generation then the difference of
> the
> > > index files between leader and replica are fetched and installed.
> > > 3. If the above fails (because leader's version/generation is somehow
> > equal
> > > or more than local) then a full index recovery happens and the entire
> > index
> > > from the leader is fetched and installed locally.
> > >
> > > There are some other details involved in this process too but probably
> > not
> > > worth going into here.
> > >
> > > On Wed, Jan 21, 2015 at 5:13 PM, Nishanth S <ni...@gmail.com>
> > > wrote:
> > >
> > > > Hello Everyone,
> > > >
> > > > I am hitting a few issues with solr replicas going into recovery and
> > then
> > > > doing a full index copy.I am trying to understand the solr recovery
> > > > process.I have read a few blogs  on this and saw  that when leader
> > > notifies
> > > > a replica to  recover(in my case it is due to connection resets) it
> > will
> > > > try to do a peer sync first and  if the missed updates are more than
> > 100
> > > it
> > > > will do a full index copy from the leader.I am trying to understand
> > what
> > > > peer sync is and where does tlog come into picture.Are tlogs replayed
> > > only
> > > > during server restart?.Can some one  help me with this?
> > > >
> > > > Thanks,
> > > > Nishanth
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > > Shalin Shekhar Mangar.
> > >
> >
>

Re: Solr Recovery process

Posted by "Ramkumar R. Aiyengar" <an...@gmail.com>.

https://issues.apache.org/jira/browse/SOLR-6359 has a patch which allows
this to be configured, it has not gone in as yet.

Note that the current design of the UpdateLog causes it to be less
efficient if the number is bumped up too much, but certainly worth
experimenting with.
On 22 Jan 2015 02:47, "Nishanth S" <ni...@gmail.com> wrote:

> Thank you Shalin.So in a system where the indexing rate is more than 5K TPS
> or so the replica  will never be able to recover   through peer sync
> process.In  my case I have mostly seen  step 3 where a full copy happens
> and  if the index size is huge it takes a very long time for replicas to
> recover.Is there a way we can  configure the  number of missed updates for
> peer sync.
>
> Thanks,
> Nishanth
>
> On Wed, Jan 21, 2015 at 4:47 PM, Shalin Shekhar Mangar <
> shalinmangar@gmail.com> wrote:
>
> > Hi Nishanth,
> >
> > The recovery happens as follows:
> >
> > 1. PeerSync is attempted first. If the number of new updates on leader is
> > less than 100 then the missing documents are fetched directly and indexed
> > locally. The tlog tells us the last 100 updates very quickly. Other uses
> of
> > the tlog are for durability of updates and of course, startup recovery.
> > 2. If the above step fails then replication recovery is attempted. A hard
> > commit is called on the leader and then the leader is polled for the
> latest
> > index version and generation. If the leader's version and generation are
> > greater than local index's version/generation then the difference of the
> > index files between leader and replica are fetched and installed.
> > 3. If the above fails (because leader's version/generation is somehow
> equal
> > or more than local) then a full index recovery happens and the entire
> index
> > from the leader is fetched and installed locally.
> >
> > There are some other details involved in this process too but probably
> not
> > worth going into here.
> >
> > On Wed, Jan 21, 2015 at 5:13 PM, Nishanth S <ni...@gmail.com>
> > wrote:
> >
> > > Hello Everyone,
> > >
> > > I am hitting a few issues with solr replicas going into recovery and
> then
> > > doing a full index copy.I am trying to understand the solr recovery
> > > process.I have read a few blogs  on this and saw  that when leader
> > notifies
> > > a replica to  recover(in my case it is due to connection resets) it
> will
> > > try to do a peer sync first and  if the missed updates are more than
> 100
> > it
> > > will do a full index copy from the leader.I am trying to understand
> what
> > > peer sync is and where does tlog come into picture.Are tlogs replayed
> > only
> > > during server restart?.Can some one  help me with this?
> > >
> > > Thanks,
> > > Nishanth
> > >
> >
> >
> >
> > --
> > Regards,
> > Shalin Shekhar Mangar.
> >
>

Re: Solr Recovery process

Posted by Nishanth S <ni...@gmail.com>.

Thank you Shalin.So in a system where the indexing rate is more than 5K TPS
or so the replica  will never be able to recover   through peer sync
process.In  my case I have mostly seen  step 3 where a full copy happens
and  if the index size is huge it takes a very long time for replicas to
recover.Is there a way we can  configure the  number of missed updates for
peer sync.

Thanks,
Nishanth

On Wed, Jan 21, 2015 at 4:47 PM, Shalin Shekhar Mangar <
shalinmangar@gmail.com> wrote:

> Hi Nishanth,
>
> The recovery happens as follows:
>
> 1. PeerSync is attempted first. If the number of new updates on leader is
> less than 100 then the missing documents are fetched directly and indexed
> locally. The tlog tells us the last 100 updates very quickly. Other uses of
> the tlog are for durability of updates and of course, startup recovery.
> 2. If the above step fails then replication recovery is attempted. A hard
> commit is called on the leader and then the leader is polled for the latest
> index version and generation. If the leader's version and generation are
> greater than local index's version/generation then the difference of the
> index files between leader and replica are fetched and installed.
> 3. If the above fails (because leader's version/generation is somehow equal
> or more than local) then a full index recovery happens and the entire index
> from the leader is fetched and installed locally.
>
> There are some other details involved in this process too but probably not
> worth going into here.
>
> On Wed, Jan 21, 2015 at 5:13 PM, Nishanth S <ni...@gmail.com>
> wrote:
>
> > Hello Everyone,
> >
> > I am hitting a few issues with solr replicas going into recovery and then
> > doing a full index copy.I am trying to understand the solr recovery
> > process.I have read a few blogs  on this and saw  that when leader
> notifies
> > a replica to  recover(in my case it is due to connection resets) it will
> > try to do a peer sync first and  if the missed updates are more than 100
> it
> > will do a full index copy from the leader.I am trying to understand what
> > peer sync is and where does tlog come into picture.Are tlogs replayed
> only
> > during server restart?.Can some one  help me with this?
> >
> > Thanks,
> > Nishanth
> >
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

Re: Solr Recovery process

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

Hi Nishanth,

The recovery happens as follows:

1. PeerSync is attempted first. If the number of new updates on leader is
less than 100 then the missing documents are fetched directly and indexed
locally. The tlog tells us the last 100 updates very quickly. Other uses of
the tlog are for durability of updates and of course, startup recovery.
2. If the above step fails then replication recovery is attempted. A hard
commit is called on the leader and then the leader is polled for the latest
index version and generation. If the leader's version and generation are
greater than local index's version/generation then the difference of the
index files between leader and replica are fetched and installed.
3. If the above fails (because leader's version/generation is somehow equal
or more than local) then a full index recovery happens and the entire index
from the leader is fetched and installed locally.

There are some other details involved in this process too but probably not
worth going into here.

On Wed, Jan 21, 2015 at 5:13 PM, Nishanth S <ni...@gmail.com> wrote:

> Hello Everyone,
>
> I am hitting a few issues with solr replicas going into recovery and then
> doing a full index copy.I am trying to understand the solr recovery
> process.I have read a few blogs  on this and saw  that when leader notifies
> a replica to  recover(in my case it is due to connection resets) it will
> try to do a peer sync first and  if the missed updates are more than 100 it
> will do a full index copy from the leader.I am trying to understand what
> peer sync is and where does tlog come into picture.Are tlogs replayed only
> during server restart?.Can some one  help me with this?
>
> Thanks,
> Nishanth
>

-- 
Regards,
Shalin Shekhar Mangar.