You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@solr.apache.org by lstusr 5u93n4 <ls...@gmail.com> on 2021/09/03 15:19:41 UTC

Index dependent groups of data

Hi All,

We have a scenario where we need to:
 - index a group of data (group 1)
 - index a second group of data, sometimes querying for records that were
added in group 1.

We have 6 shards, each composed of two TLOG replicas.

What we're seeing is the following:
 - index some data
 - issue a hard commit
 - issue a query for that data
 - sometimes the query gets routed to a replica that is not yet updated,
and doesn't contain the data.

Some thoughts for how to solve:
 1. issue a hard commit that forces all TLOG replicas to update, not just
the leaders. (Is this possible?)
 2. force the query to go only to the leaders.

Are there any good options for this? I'm sure it's a common pattern to
index data and make it searchable to other data that's still being
indexed... Curious as to how others have solved this same problem.

Thanks!

Kyle

Re: Index dependent groups of data

Posted by Walter Underwood <wu...@wunderwood.org>.

> On Sep 7, 2021, at 9:01 AM, lstusr 5u93n4 <ls...@gmail.com> wrote:
> 
> Well that's kind of the crux of the issue. We're issuing a hard commit
> which (from what I've read) appears to be a synchronous operation. So. when
> the call comes back with a 200 http response code, we can be assured that
> the operation has gone through

This is your mistake. Solr is not transactional. You are assuming ACID properties,
but Solr does not guarantee those, especially cluster-wide.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

Re: Index dependent groups of data

Posted by Cassandra Targett <ca...@gmail.com>.

As Shawn explained, when a TLOG replica is not the leader, it does not index the documents directly but pulls index segments from the leader. However, this operation is generally rather fast - within a second or two - since it copies the changed segments, not the full index (and 70 million docs isn’t usually all that big anyway). I didn’t see where you said how soon you attempt to query the replicas, that might be helpful to know to understand if you really do have a problem.

It’s important to note that a leader does not notify its replicas that it has new segments. This is controlled by the commit configuration - non-leader replicas poll their leader for possible changes at half the interval set for autoCommits, and if that is not set, then half the autoSoftCommit time. Whether you commit or not on the leader has nothing to do with what the replica does - it’s what has been configured for those params.

Other factors may make the replication take longer, particularly a slow network or if you’re merging segments down to a single very large segment (then each replica has to pull the entire index every time you make an update).

You can see when replication happens by looking at the logs for one of the replicas - if you have the default logging levels enabled, you’ll clearly see messages about polling for new segments, and when replication starts and finishes. If you make an index update and don’t see a replica poll within a reasonable amount of time, you may want to change those commit settings I mentioned. If you see it start but take a long time, it’s more likely the network or you’ve merged down to a big segment that takes a while over the wire.

It seems inefficient to me to check who is a leader and only query the leader - that feels very much a workaround for an otherwise misconfigured cluster. The main reason for having replicas is to distribute the query load, and if you direct all your queries to leaders the replicas are basically doing nothing but waiting in case the leader goes down (which, however, is fine if all you care about is disaster recovery, but there are probably lighter weight approaches if that’s your only need).
On Sep 8, 2021, 10:02 AM -0500, lstusr 5u93n4 <ls...@gmail.com>, wrote:
> > Info you might already know: TLOG (and PULL) replicas do not index,
> > unless a TLOG replica is the leader, in which case it behaves exactly
> > like NRT. A PULL replica can never become leader.
> >
> > When you have TLOG or PULL replicas, Solr is only going to do indexing
> > on the shard leaders. When a commit finishes, it should be done on all
> > cores that participate in indexing.
> >
> > Replication of the completed index segments to TLOG and PULL replicas
> > will happen AFTER the commit is done, not concurrently. I don't think
> > there's a reliable way of asking Solr to tell you when all replications
> > are complete.
>
> Thanks Shawn, it's good to have this all spelled out. Validates what we're
> seeing.
>
> > Does your "query only the leaders" code check clusterstate in ZK to
> > figure out which replicas are leader? Leaders can change in response to
> > problems.
>
> Yeah, exactly. Working implementation is to check
> `/collections/<name>/state.json` in ZK to determine the leaders, and put a
> watch on that node to react if the cluster state changes.
>
> I see what you're saying about determining if the replications are
> complete. However, querying the leaders post-commit is good enough for our
> particular use case, so we'll opt to keep the indexing speed as high as
> possible and not wait on the replication before proceeding to the next
> group of data.
>
> Thanks for all your help!
>
> Kyle
>
>
> On Tue, 7 Sept 2021 at 17:13, Shawn Heisey <ap...@elyograg.org> wrote:
>
> > On 9/7/2021 3:08 PM, Shawn Heisey wrote:
> > > I don't think there's a reliable way of asking Solr to tell you when
> > > all replications are complete.
> >
> >
> > You could use the replication handler (/solr/corename/replication) to
> > gather this info and compare info from the leader index with info from
> > the follower index(es). For this to be reliable, you would need to
> > check clusterstate in ZK so you're absolutely sure which cores are
> > leaders. I do not know off the top of my head what parameters need to
> > be sent to the replication handler to gather that info.
> >
> > Thanks,
> > Shawn
> >
> >

Re: Index dependent groups of data

Posted by lstusr 5u93n4 <ls...@gmail.com>.

> Info you might already know:  TLOG (and PULL) replicas do not index,
> unless a TLOG replica is the leader, in which case it behaves exactly
> like NRT.  A PULL replica can never become leader.
>
> When you have TLOG or PULL replicas, Solr is only going to do indexing
> on the shard leaders.  When a commit finishes, it should be done on all
> cores that participate in indexing.
>
> Replication of the completed index segments to TLOG and PULL replicas
> will happen AFTER the commit is done, not concurrently.  I don't think
> there's a reliable way of asking Solr to tell you when all replications
> are complete.

Thanks Shawn, it's good to have this all spelled out. Validates what we're
seeing.

> Does your "query only the leaders" code check clusterstate in ZK to
> figure out which replicas are leader?  Leaders can change in response to
> problems.

Yeah, exactly. Working implementation is to check
`/collections/<name>/state.json` in ZK to determine the leaders, and put a
watch on that node to react if the cluster state changes.

I see what you're saying about determining if the replications are
complete. However, querying the leaders post-commit is good enough for our
particular use case, so we'll opt to keep the indexing speed as high as
possible and not wait on the replication before proceeding to the next
group of data.

Thanks for all your help!

Kyle

On Tue, 7 Sept 2021 at 17:13, Shawn Heisey <ap...@elyograg.org> wrote:

> On 9/7/2021 3:08 PM, Shawn Heisey wrote:
> > I don't think there's a reliable way of asking Solr to tell you when
> > all replications are complete.
>
>
> You could use the replication handler (/solr/corename/replication) to
> gather this info and compare info from the leader index with info from
> the follower index(es).  For this to be reliable, you would need to
> check clusterstate in ZK so you're absolutely sure which cores are
> leaders.  I do not know off the top of my head what parameters need to
> be sent to the replication handler to gather that info.
>
> Thanks,
> Shawn
>
>

Re: Index dependent groups of data

Posted by Shawn Heisey <ap...@elyograg.org>.

On 9/7/2021 3:08 PM, Shawn Heisey wrote:
> I don't think there's a reliable way of asking Solr to tell you when 
> all replications are complete. 

You could use the replication handler (/solr/corename/replication) to 
gather this info and compare info from the leader index with info from 
the follower index(es).  For this to be reliable, you would need to 
check clusterstate in ZK so you're absolutely sure which cores are 
leaders.  I do not know off the top of my head what parameters need to 
be sent to the replication handler to gather that info.

Thanks,
Shawn

Re: Index dependent groups of data

Posted by Shawn Heisey <ap...@elyograg.org>.

On 9/7/2021 10:01 AM, lstusr 5u93n4 wrote:
> Seems like our experimentation is showing that it doesn't at least for TLOG
> replica types. If we bound the query to the leaders, we can get accurate
> results immediately after the commit. If we don't add that restriction,
> sometimes the results sometimes won't show the groups of data that were
> indexed in the previous.

Info you might already know:  TLOG (and PULL) replicas do not index, 
unless a TLOG replica is the leader, in which case it behaves exactly 
like NRT.  A PULL replica can never become leader.

When you have TLOG or PULL replicas, Solr is only going to do indexing 
on the shard leaders.  When a commit finishes, it should be done on all 
cores that participate in indexing.

Replication of the completed index segments to TLOG and PULL replicas 
will happen AFTER the commit is done, not concurrently.  I don't think 
there's a reliable way of asking Solr to tell you when all replications 
are complete.

If all replicas were NRT, then I think you wouldn't have this problem.  
But indexing is slower, because all replicas are going to do it, mostly 
concurrently.  In some cases the slowdown might be significant.

Does your "query only the leaders" code check clusterstate in ZK to 
figure out which replicas are leader?  Leaders can change in response to 
problems.

Thanks,
Shawn

Re: Index dependent groups of data

Posted by lstusr 5u93n4 <ls...@gmail.com>.

> How long are you waiting between the hard commit and the query?
> Are you waiting for the commit operation to return a response before you
try to
> query?

Well that's kind of the crux of the issue. We're issuing a hard commit
which (from what I've read) appears to be a synchronous operation. So. when
the call comes back with a 200 http response code, we can be assured that
the operation has gone through. But there's no artificial "wait time" after
that, because how are we to know how long that should be?

> I actually don't know whether a commit operation will wait for
> all replicas when you're in cloud mode.

Seems like our experimentation is showing that it doesn't at least for TLOG
replica types. If we bound the query to the leaders, we can get accurate
results immediately after the commit. If we don't add that restriction,
sometimes the results sometimes won't show the groups of data that were
indexed in the previous.

At this point we're proceeding with a strategy of only querying the leaders
during this operation... Seems to be working out so far.

Thanks!

Kyle

On Fri, 3 Sept 2021 at 12:16, Shawn Heisey <ap...@elyograg.org> wrote:

> On 9/3/2021 9:19 AM, lstusr 5u93n4 wrote:
> > What we're seeing is the following:
> >   - index some data
> >   - issue a hard commit
> >   - issue a query for that data
> >   - sometimes the query gets routed to a replica that is not yet updated,
> > and doesn't contain the data.
>
> How long are you waiting between the hard commit and the query? Are you
> waiting for the commit operation to return a response before you try to
> query?  I actually don't know whether a commit operation will wait for
> all replicas when you're in cloud mode.  I don't have a lot of
> experience with SolrCloud yet.  I did set up a cloud deployment at an
> old job, but it was VERY small.  All my large-index experience is in
> standalone mode.
>
> Commits can sometimes be very slow.  This is mostly dependent on your
> cache autowarm configuration and any manual warming queries that you
> have defined.
>
> Thanks,
> Shawn
>
>

Re: Index dependent groups of data

Posted by lstusr 5u93n4 <ls...@gmail.com>.

>  How about doing your queries against the leader only?

This seems to work. We haven't been able to produce an instance where the
primary data isn't there in the case where we bound the queries only to the
leaders.

> Solr is not transactional. You are assuming ACID properties,
> but Solr does not guarantee those, especially cluster-wide.

Yeah, understood. Trying to determine if there's a way we could
understand if a save + commit + query (optionally to leader) approaches a
"transaction", or if that's simply a non-starter given Solr's nature.

Kyle

On Tue, 7 Sept 2021 at 12:14, Walter Underwood <wu...@wunderwood.org>
wrote:

> How about doing your queries against the leader only?
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Sep 7, 2021, at 9:06 AM, lstusr 5u93n4 <ls...@gmail.com> wrote:
> >
> >> Is there a particular reason for using TLOG replica types?
> >
> > We used to use NRT replica types, but we switched to TLOG a year or two
> ago
> > in order to prioritize indexing speed above all else, understanding that
> it
> > might take a while for query results to be identical across replicas.
> This
> > is the first time we've had a use case where we need to query immediately
> > after indexing. Had we known then what we know now, maybe we wouldn't
> have
> > switched... but that's hindsight I guess.
> >
> > With an NRT replica type, do you know if we issue a commit does it apply
> to
> > all replicas? We're not too far down the path that we couldn't switch
> back,
> > and I assume that the effect would be minimized if we did so. However,
> I'd
> > like to know that the issue would be completely GONE, not just reduced in
> > frequency if we did switch back...
> >
> > Thanks!
> >
> > Kyle
> >
> > On Fri, 3 Sept 2021 at 13:02, Nick Vladiceanu <vl...@gmail.com>
> > wrote:
> >
> >> Is there a particular reason for using TLOG replica types? For such a
> >> small cluster and the scenario you’ve described it sounds more
> reasonable
> >> to use NRT, that will (almost) guarantee that once you write your data -
> >> it’ll be (almost) immediately available on all the nodes.
> >>
> >>
> >>> On 3. Sep 2021, at 6:16 PM, Shawn Heisey <ap...@elyograg.org> wrote:
> >>>
> >>> On 9/3/2021 9:19 AM, lstusr 5u93n4 wrote:
> >>>> What we're seeing is the following:
> >>>> - index some data
> >>>> - issue a hard commit
> >>>> - issue a query for that data
> >>>> - sometimes the query gets routed to a replica that is not yet
> updated,
> >>>> and doesn't contain the data.
> >>>
> >>> How long are you waiting between the hard commit and the query? Are you
> >> waiting for the commit operation to return a response before you try to
> >> query?  I actually don't know whether a commit operation will wait for
> all
> >> replicas when you're in cloud mode.  I don't have a lot of experience
> with
> >> SolrCloud yet.  I did set up a cloud deployment at an old job, but it
> was
> >> VERY small.  All my large-index experience is in standalone mode.
> >>>
> >>> Commits can sometimes be very slow.  This is mostly dependent on your
> >> cache autowarm configuration and any manual warming queries that you
> have
> >> defined.
> >>>
> >>> Thanks,
> >>> Shawn
> >>>
> >>
> >>
>
>

Re: Index dependent groups of data

Posted by Walter Underwood <wu...@wunderwood.org>.

How about doing your queries against the leader only?

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Sep 7, 2021, at 9:06 AM, lstusr 5u93n4 <ls...@gmail.com> wrote:
> 
>> Is there a particular reason for using TLOG replica types?
> 
> We used to use NRT replica types, but we switched to TLOG a year or two ago
> in order to prioritize indexing speed above all else, understanding that it
> might take a while for query results to be identical across replicas. This
> is the first time we've had a use case where we need to query immediately
> after indexing. Had we known then what we know now, maybe we wouldn't have
> switched... but that's hindsight I guess.
> 
> With an NRT replica type, do you know if we issue a commit does it apply to
> all replicas? We're not too far down the path that we couldn't switch back,
> and I assume that the effect would be minimized if we did so. However, I'd
> like to know that the issue would be completely GONE, not just reduced in
> frequency if we did switch back...
> 
> Thanks!
> 
> Kyle
> 
> On Fri, 3 Sept 2021 at 13:02, Nick Vladiceanu <vl...@gmail.com>
> wrote:
> 
>> Is there a particular reason for using TLOG replica types? For such a
>> small cluster and the scenario you’ve described it sounds more reasonable
>> to use NRT, that will (almost) guarantee that once you write your data -
>> it’ll be (almost) immediately available on all the nodes.
>> 
>> 
>>> On 3. Sep 2021, at 6:16 PM, Shawn Heisey <ap...@elyograg.org> wrote:
>>> 
>>> On 9/3/2021 9:19 AM, lstusr 5u93n4 wrote:
>>>> What we're seeing is the following:
>>>> - index some data
>>>> - issue a hard commit
>>>> - issue a query for that data
>>>> - sometimes the query gets routed to a replica that is not yet updated,
>>>> and doesn't contain the data.
>>> 
>>> How long are you waiting between the hard commit and the query? Are you
>> waiting for the commit operation to return a response before you try to
>> query?  I actually don't know whether a commit operation will wait for all
>> replicas when you're in cloud mode.  I don't have a lot of experience with
>> SolrCloud yet.  I did set up a cloud deployment at an old job, but it was
>> VERY small.  All my large-index experience is in standalone mode.
>>> 
>>> Commits can sometimes be very slow.  This is mostly dependent on your
>> cache autowarm configuration and any manual warming queries that you have
>> defined.
>>> 
>>> Thanks,
>>> Shawn
>>> 
>> 
>>

Re: Index dependent groups of data

Posted by lstusr 5u93n4 <ls...@gmail.com>.

>  Is there a particular reason for using TLOG replica types?

We used to use NRT replica types, but we switched to TLOG a year or two ago
in order to prioritize indexing speed above all else, understanding that it
might take a while for query results to be identical across replicas. This
is the first time we've had a use case where we need to query immediately
after indexing. Had we known then what we know now, maybe we wouldn't have
switched... but that's hindsight I guess.

With an NRT replica type, do you know if we issue a commit does it apply to
all replicas? We're not too far down the path that we couldn't switch back,
and I assume that the effect would be minimized if we did so. However, I'd
like to know that the issue would be completely GONE, not just reduced in
frequency if we did switch back...

Thanks!

Kyle

On Fri, 3 Sept 2021 at 13:02, Nick Vladiceanu <vl...@gmail.com>
wrote:

> Is there a particular reason for using TLOG replica types? For such a
> small cluster and the scenario you’ve described it sounds more reasonable
> to use NRT, that will (almost) guarantee that once you write your data -
> it’ll be (almost) immediately available on all the nodes.
>
>
> > On 3. Sep 2021, at 6:16 PM, Shawn Heisey <ap...@elyograg.org> wrote:
> >
> > On 9/3/2021 9:19 AM, lstusr 5u93n4 wrote:
> >> What we're seeing is the following:
> >>  - index some data
> >>  - issue a hard commit
> >>  - issue a query for that data
> >>  - sometimes the query gets routed to a replica that is not yet updated,
> >> and doesn't contain the data.
> >
> > How long are you waiting between the hard commit and the query? Are you
> waiting for the commit operation to return a response before you try to
> query?  I actually don't know whether a commit operation will wait for all
> replicas when you're in cloud mode.  I don't have a lot of experience with
> SolrCloud yet.  I did set up a cloud deployment at an old job, but it was
> VERY small.  All my large-index experience is in standalone mode.
> >
> > Commits can sometimes be very slow.  This is mostly dependent on your
> cache autowarm configuration and any manual warming queries that you have
> defined.
> >
> > Thanks,
> > Shawn
> >
>
>

Re: Index dependent groups of data

Posted by Nick Vladiceanu <vl...@gmail.com>.

Is there a particular reason for using TLOG replica types? For such a small cluster and the scenario you’ve described it sounds more reasonable to use NRT, that will (almost) guarantee that once you write your data - it’ll be (almost) immediately available on all the nodes. 


> On 3. Sep 2021, at 6:16 PM, Shawn Heisey <ap...@elyograg.org> wrote:
> 
> On 9/3/2021 9:19 AM, lstusr 5u93n4 wrote:
>> What we're seeing is the following:
>>  - index some data
>>  - issue a hard commit
>>  - issue a query for that data
>>  - sometimes the query gets routed to a replica that is not yet updated,
>> and doesn't contain the data.
> 
> How long are you waiting between the hard commit and the query? Are you waiting for the commit operation to return a response before you try to query?  I actually don't know whether a commit operation will wait for all replicas when you're in cloud mode.  I don't have a lot of experience with SolrCloud yet.  I did set up a cloud deployment at an old job, but it was VERY small.  All my large-index experience is in standalone mode.
> 
> Commits can sometimes be very slow.  This is mostly dependent on your cache autowarm configuration and any manual warming queries that you have defined.
> 
> Thanks,
> Shawn
>

Re: Index dependent groups of data

Posted by Shawn Heisey <ap...@elyograg.org>.

On 9/3/2021 9:19 AM, lstusr 5u93n4 wrote:
> What we're seeing is the following:
>   - index some data
>   - issue a hard commit
>   - issue a query for that data
>   - sometimes the query gets routed to a replica that is not yet updated,
> and doesn't contain the data.

How long are you waiting between the hard commit and the query? Are you 
waiting for the commit operation to return a response before you try to 
query?  I actually don't know whether a commit operation will wait for 
all replicas when you're in cloud mode.  I don't have a lot of 
experience with SolrCloud yet.  I did set up a cloud deployment at an 
old job, but it was VERY small.  All my large-index experience is in 
standalone mode.

Commits can sometimes be very slow.  This is mostly dependent on your 
cache autowarm configuration and any manual warming queries that you 
have defined.

Thanks,
Shawn