You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by S G <sg...@gmail.com> on 2018/01/03 03:02:26 UTC

New replica types

Hi,

I was excited to see some good work in having more replica types for Solr.

However, Solr documentation left me with a few questions.
https://lucene.apache.org/solr/guide/7_2/shards-and-indexing-data-in-solrcloud.html#types-of-replicas


This is what I could come up with:
(Note that each point compares with corresponding point in each
replica-type, so it's easy to compare)


NRT
  1) Indexes locally
  2) Remains in-sync with leader, hence leader eligible
  3) Queries return latest data
  4) Replicates tlog or full-index depending on how far behind it is from
the leader
  5) Recommended when no query should return stale data ever !
  6) Penalizes the leader for full-index copy only when the replica is
missing a lot of updates (configurable though).


TLOG
  1) Does not index locally
  2) Remains in-sync with leader, hence leader eligible
  3) Queries generally return stale data as it does not index locally
  4) Replicates tlog to remain in sync but also does periodic full-index
copy from leader to get indexed data
  5) Recommended for very high throughputs at the cost of getting stale
results.
  6) Can penalize the leader for large full-index copies


PULL
  0) Same as TLOG Replica but copies only the indexed data periodically
  1) Does not index locally
  2) Does not remain in-sync with leader, hence not eligible for leader
election
  3) Queries generally return stale data as it does not index locally
  4) Only does a periodic full-index copy from leader to get indexed data
  5) Recommended for use with NRT or TLOG replicas only to increase read
throughput at the cost of getting stale results.
  6) Can penalize the leader for large full-index copies


If the above is incorrect, can someone please point that out?

Thanks
SG

Re: New replica types

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
Comments inline:

On Wed, Jan 3, 2018 at 11:39 AM, S G <sg...@gmail.com> wrote:
> AFAIK, tlog file is truncated with a hard-commit.
> So if the TLOG replica is only pulling the tlog-file, it would become out
> of date if it does not pull the full index too.
> That means that the TLOG replica would do a full copy every time there is a
> commit on the leader.

TLOG replica does not pull the tlog file. Instead, each update is
pushed from the leader to all TLOG replicas as a synchronous
operation. The same thing happens for NRT replicas as well. Such
updates are appended to the transaction logs on the PULL replicas.
Similar to the old master/slave model, the leader (hard) commits its
index frequently and the TLOG replica polls the leader to check if a
newer index version is available for download. If there is a new
version on the leader then the replica downloads only the new segments
from the leader that are not present locally.

>
> PULL replica, by definition copies index files only and so it would do full
> recoveries often too.

PULL replicas also download only the newest segments present on the
leader which haven't been copied over previously. At no time does a
full recovery happen unless the replica is new i.e. it has an empty
index or it has been out of sync for so long that leader has
completely new segments due to merges.

One thing to note is that you should never mix NRT replicas with other
types. Either have a collection with only NRT replicas or have a mix
of TLOG and PULL replicas. This way you ensure that the leader is
never different enough for a full recovery to be required.

>
>
> How intelligent are the two replica types in determining that they need to
> do a full recovery vs partial recovery?
> Does full recovery happen every hard-commit on the leader?
> Or does it happen with segment merges on the leader? (because index files
> will look much different after a segment-merge)
>
>
>
> NRT replicas will typically have very different files in their on-disk
>> indexes even though they contain the same documents.
>
>
> This is something which has caused full recoveries many times in my
> clusters and I wish there was a solution to this one.
> Do you think it would make sense for all replicas of a shard to agree upon
> the segment where a document should go to?
> Coupling this with an agreed cadence on segment merges, Solr would never do
> full recovery. (It's a very high level view of course and will need lot of
> refinements if implemented).
>
> Getting a cadence on segment merges could possibly be implemented by a
> time-based merging strategy where documents arriving within a particular
> time-range only will form a particular segment.
> So documents arriving between 1pm-2pm go to segment 1, those between
> 2pm-3pm go to segment 2 and so on.
> That ways replicas will only copy the last N segments (with N being 1
> generally) instead of doing a full recovery.
> Even if merging happens on the leader, the last N segments should not be
> cleared to avoid full recoveries on the replicas.
> (I know something like this happens today, but not very sure about the
> internal details and it's nowhere documented clearly).
>
> Currently, I see my replicas go into full-recovery even when I dynamically
> add a field to a collection or a replica missed updates for a few seconds.
> (I do have high values for catchup rather than the default 100)
>
>
> Thanks
> SG
>
>
>
>
>
>
> On Tue, Jan 2, 2018 at 8:58 PM, Shawn Heisey <ap...@elyograg.org> wrote:
>
>> On 1/2/2018 8:02 PM, S G wrote:
>>
>>> If the above is incorrect, can someone please point that out?
>>>
>>
>> Assuming I have a correct understanding of how the different replica types
>> work, I have some small clarifications.  If my understanding is incorrect,
>> I hope somebody will point out my errors.
>>
>> TLOG is leader eligible because it keeps transaction logs from ongoing
>> indexing, although it does not perform that indexing on its own index
>> unless it becomes the leader.  Transaction logs are necessary for operation
>> as leader.
>>
>> PULL does not keep transaction logs, which is why it is not leader
>> eligible.  It only copies the index data.
>>
>> Either TLOG or PULL would do a full index copy if the local index is
>> suddenly very different from the leader.  This could happen in situations
>> where you have NRT replicas and the leader changes -- NRT replicas will
>> typically have very different files in their on-disk indexes even though
>> they contain the same documents.  When the leader changes to a different
>> NRT replica, TLOG/PULL replicas will suddenly find that they have a very
>> different list of index files, so they will fetch the entire index.
>>
>> Thanks,
>> Shawn
>>



-- 
Regards,
Shalin Shekhar Mangar.

Re: New replica types

Posted by S G <sg...@gmail.com>.
AFAIK, tlog file is truncated with a hard-commit.
So if the TLOG replica is only pulling the tlog-file, it would become out
of date if it does not pull the full index too.
That means that the TLOG replica would do a full copy every time there is a
commit on the leader.

PULL replica, by definition copies index files only and so it would do full
recoveries often too.


How intelligent are the two replica types in determining that they need to
do a full recovery vs partial recovery?
Does full recovery happen every hard-commit on the leader?
Or does it happen with segment merges on the leader? (because index files
will look much different after a segment-merge)



NRT replicas will typically have very different files in their on-disk
> indexes even though they contain the same documents.


This is something which has caused full recoveries many times in my
clusters and I wish there was a solution to this one.
Do you think it would make sense for all replicas of a shard to agree upon
the segment where a document should go to?
Coupling this with an agreed cadence on segment merges, Solr would never do
full recovery. (It's a very high level view of course and will need lot of
refinements if implemented).

Getting a cadence on segment merges could possibly be implemented by a
time-based merging strategy where documents arriving within a particular
time-range only will form a particular segment.
So documents arriving between 1pm-2pm go to segment 1, those between
2pm-3pm go to segment 2 and so on.
That ways replicas will only copy the last N segments (with N being 1
generally) instead of doing a full recovery.
Even if merging happens on the leader, the last N segments should not be
cleared to avoid full recoveries on the replicas.
(I know something like this happens today, but not very sure about the
internal details and it's nowhere documented clearly).

Currently, I see my replicas go into full-recovery even when I dynamically
add a field to a collection or a replica missed updates for a few seconds.
(I do have high values for catchup rather than the default 100)


Thanks
SG






On Tue, Jan 2, 2018 at 8:58 PM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 1/2/2018 8:02 PM, S G wrote:
>
>> If the above is incorrect, can someone please point that out?
>>
>
> Assuming I have a correct understanding of how the different replica types
> work, I have some small clarifications.  If my understanding is incorrect,
> I hope somebody will point out my errors.
>
> TLOG is leader eligible because it keeps transaction logs from ongoing
> indexing, although it does not perform that indexing on its own index
> unless it becomes the leader.  Transaction logs are necessary for operation
> as leader.
>
> PULL does not keep transaction logs, which is why it is not leader
> eligible.  It only copies the index data.
>
> Either TLOG or PULL would do a full index copy if the local index is
> suddenly very different from the leader.  This could happen in situations
> where you have NRT replicas and the leader changes -- NRT replicas will
> typically have very different files in their on-disk indexes even though
> they contain the same documents.  When the leader changes to a different
> NRT replica, TLOG/PULL replicas will suddenly find that they have a very
> different list of index files, so they will fetch the entire index.
>
> Thanks,
> Shawn
>

Re: New replica types

Posted by Shawn Heisey <ap...@elyograg.org>.
On 1/2/2018 8:02 PM, S G wrote:
> If the above is incorrect, can someone please point that out?

Assuming I have a correct understanding of how the different replica 
types work, I have some small clarifications.  If my understanding is 
incorrect, I hope somebody will point out my errors.

TLOG is leader eligible because it keeps transaction logs from ongoing 
indexing, although it does not perform that indexing on its own index 
unless it becomes the leader.  Transaction logs are necessary for 
operation as leader.

PULL does not keep transaction logs, which is why it is not leader 
eligible.  It only copies the index data.

Either TLOG or PULL would do a full index copy if the local index is 
suddenly very different from the leader.  This could happen in 
situations where you have NRT replicas and the leader changes -- NRT 
replicas will typically have very different files in their on-disk 
indexes even though they contain the same documents.  When the leader 
changes to a different NRT replica, TLOG/PULL replicas will suddenly 
find that they have a very different list of index files, so they will 
fetch the entire index.

Thanks,
Shawn