You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Webster Homer <we...@sial.com> on 2017/05/22 20:41:57 UTC

solrcloud replicas not in sync

I have a solrcloud collection with 2 shards and 4 replicas. The replicas
for shard 1 have different numbers of records, so different queries will
return different numbers of records.

I am not certain how this occurred, it happened in a collection that was a
cdcr target.

Is there a way to limit a search to a specific replica of a shard? We want
to understand the differences

Is there a way to recover when a shard has inconsistent replicas.
If I use the delete replica API call to delete one of them and then use add
replica to create it from scratch will it auto-populate from the other
replica in the shard?

Thanks,
Webster

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.

Re: solrcloud replicas not in sync

Posted by Webster Homer <we...@sial.com>.

oh, those logs probably reflect the update job that runs every 15 minutes
if there are updates, typically 1 or 2 changes. thanks for the info

On Wed, May 24, 2017 at 10:37 AM, Erick Erickson <er...@gmail.com>
wrote:

> By default, enough closed log files will be kept to hold the last 100
> documents indexed. This is for "peer sync" purposes. Say replica1 goes
> offline for a bit. When it comes back online, if it's fallen behind by
> no more than 100 docs, the docs are replayed from another replica's
> tlog.
>
> Having such tiny tlogs is kind of unusual. My guess is that your
> ingestion rate is quite low. Every time a hard commit happens, a new
> tlog is opened up and the old one is closed. Having such tiny tlogs
> implies that you are getting one or a few documents per autocommit
> interval, so each tlog contains just a few docs. There's nothing wrong
> with that, mind you, so it's not a problem.
>
> When do log files get deleted? It Depends (tm). In the non-CDCR case,
> if the most recent N closed tlogs contain 100 or more documents, the
> tlogs older than N are deleted.
>
> In the CDCR case, the above condition must be true _and_ the docs in
> tlogs older than N must have been transmitted to the target cluster.
>
> Best,
> Erick
>
> On Wed, May 24, 2017 at 8:27 AM, Webster Homer <we...@sial.com>
> wrote:
> > The tlog sizes are strange
> > In the case of the collection where we had issues with the replicas the
> > tlog sizes are 740 bytes and 938 bytes on the target side and the same on
> > the source side. There are a lot of them on the source side, when do tlog
> > files get deleted?
> >
> >
> >
> > On Tue, May 23, 2017 at 12:52 PM, Erick Erickson <
> erickerickson@gmail.com>
> > wrote:
> >
> >> This is all quite strange. Optimize (BTW, it's rarely
> >> necessary/desirable on an index that changes, despite its name)
> >> shouldn't matter here. CDCR forwards the raw documents to the target
> >> cluster.
> >>
> >> Ample time indeed. With a soft commit of 15 seconds, that's your
> >> window (with some slop for how long CDCR takes).
> >>
> >> If you do a search and sort by your timestamp descending, what do you
> >> see on the target cluster? And when you are indexing and CDCR is
> >> running, your target cluster solr logs should show updates coming in.
> >> Mostly checking if the data is even getting to the target cluster
> >> here.
> >>
> >> Also check the tlogs on the source cluster. By "check" here I just
> >> mean "are they reasonable size", and "reasonable" should be very
> >> small. The tlogs are the "queue" that CDCR uses to store docs before
> >> forwarding to the target cluster, so this is just a sanity check. If
> >> they're huge, then CDCR is not forwarding anything to the target
> >> cluster.
> >>
> >> It's also vaguely possible that
> >> IgnoreCommitOptimizeUpdateProcessorFactory is interfering, if so it's
> >> a bug and should be reported as a JIRA. If you remove that on the
> >> target cluster, does the behavior change?
> >>
> >> I'm mystified here as you can tell.
> >>
> >> Best,
> >> Erick
> >>
> >> On Tue, May 23, 2017 at 10:12 AM, Webster Homer <webster.homer@sial.com
> >
> >> wrote:
> >> > We see a pretty consistent issue where the replicas show in the admin
> >> > console as not current, indicating that our auto commit isn't
> commiting.
> >> In
> >> > one case we loaded the data to the source, cdcr replicated it to the
> >> > targets and we see the source and the target as having current =
> false.
> >> It
> >> > is searchable so the soft commits are happening. We turned off data
> >> loading
> >> > to investigate this issue, and the replicas are still not current
> after 3
> >> > days. So there should have been ample time to catch up.
> >> > This is our autoCommit
> >> >      <autoCommit>
> >> >        <maxDocs>25000</maxDocs>
> >> >        <maxTime>${solr.autoCommit.maxTime:300000}</maxTime>
> >> >        <openSearcher>false</openSearcher>
> >> >      </autoCommit>
> >> >
> >> > This is our autoSoftCommit
> >> >      <autoSoftCommit>
> >> >        <maxTime>${solr.autoSoftCommit.maxTime:15000}</maxTime>
> >> >      </autoSoftCommit>
> >> > neither property, solr.autoCommit.maxTime or
> solr.autoSoftCommit.maxTime
> >> > are set.
> >> >
> >> > We also have an updateChain that calls the
> >> > solr.IgnoreCommitOptimizeUpdateProcessorFactory to ignore client
> >> commits.
> >> > Could that be the cause of our
> >> >       <updateRequestProcessorChain name="cleanup">
> >> >      <!-- Ignore commits from clients, telling them all's OK -->
> >> >        <processor class="solr.IgnoreCommitOptimizeUpdateProc
> >> essorFactory">
> >> >          <int name="statusCode">200</int>
> >> >        </processor>
> >> >
> >> >        <processor class="TrimFieldUpdateProcessorFactory" />
> >> >        <processor class="RemoveBlankFieldUpdateProcessorFactory" />
> >> >
> >> >        <processor class="solr.LogUpdateProcessorFactory" />
> >> >        <processor class="solr.RunUpdateProcessorFactory" />
> >> >      </updateRequestProcessorChain>
> >> >
> >> > We did create a date field to all our collections that defaults to NOW
> >> so I
> >> > can see that no new data was added, but the replicas don't seem to get
> >> the
> >> > commit. I assume this is something in our configuration (see above).
> >> >
> >> > Is there a way to determine when the last commit occurred?
> >> >
> >> > I believe that the one replica got out of sync due to an admin
> running an
> >> > optimize while cdcr was still running.
> >> > That was one collection, but it looks like we are missing commits on
> most
> >> > of our collections.
> >> >
> >> > Any help would be greatly appreciated!
> >> >
> >> > Thanks,
> >> > Webster Homer
> >> >
> >> > On Mon, May 22, 2017 at 4:12 PM, Erick Erickson <
> erickerickson@gmail.com
> >> >
> >> > wrote:
> >> >
> >> >> You can ping individual replicas by addressing to a specific replica
> >> >> and setting distrib=false, something like
> >> >>
> >> >>      http://SOLR_NODE:port/solr/collection1_shard1_replica1/
> >> >> query?distrib=false&q=......
> >> >>
> >> >> But one thing to check first is that you've committed. I'd:
> >> >> 1> turn off indexing on the source cluster.
> >> >> 2> wait until the CDCR had caught up (if necessary).
> >> >> 3> issue a hard commit on the target
> >> >> 4> _then_ see if the counts were what is expected.
> >> >>
> >> >> Due to the fact that autocommit settings can fire at different clock
> >> >> times even for replicas on the same shard, it's easier to track
> >> >> whether it's a transient issue. The other thing I've seen people do
> is
> >> >> have a timestamp on the docs set to NOW (there's an update processor
> >> >> that can do this). Then when you check for consistency you can use
> >> >> fq=timestamp:[* TO NOW - (some interval significantly longer than
> your
> >> >> autocommit interval)].
> >> >>
> >> >> bq: Is there a way to recover when a shard has inconsistent replicas.
> >> >> If I use the delete replica API call to delete one of them and then
> use
> >> add
> >> >> replica to create it from scratch will it auto-populate from the
> other
> >> >> replica in the shard?
> >> >>
> >> >> Yes. Whenever you ADDREPLICA it'll catch itself up from the leader
> >> >> before becoming active. It'll have to copy the _entire_ index from
> the
> >> >> leader, so you'll see network traffic spike.
> >> >>
> >> >> Best,
> >> >> Erick
> >> >>
> >> >> On Mon, May 22, 2017 at 1:41 PM, Webster Homer <
> webster.homer@sial.com>
> >> >> wrote:
> >> >> > I have a solrcloud collection with 2 shards and 4 replicas. The
> >> replicas
> >> >> > for shard 1 have different numbers of records, so different queries
> >> will
> >> >> > return different numbers of records.
> >> >> >
> >> >> > I am not certain how this occurred, it happened in a collection
> that
> >> was
> >> >> a
> >> >> > cdcr target.
> >> >> >
> >> >> > Is there a way to limit a search to a specific replica of a shard?
> We
> >> >> want
> >> >> > to understand the differences
> >> >> >
> >> >> > Is there a way to recover when a shard has inconsistent replicas.
> >> >> > If I use the delete replica API call to delete one of them and then
> >> use
> >> >> add
> >> >> > replica to create it from scratch will it auto-populate from the
> other
> >> >> > replica in the shard?
> >> >> >
> >> >> > Thanks,
> >> >> > Webster
> >> >> >
> >> >> > --
> >> >> >
> >> >> >
> >> >> > This message and any attachment are confidential and may be
> >> privileged or
> >> >> > otherwise protected from disclosure. If you are not the intended
> >> >> recipient,
> >> >> > you must not copy this message or attachment or disclose the
> contents
> >> to
> >> >> > any other person. If you have received this transmission in error,
> >> please
> >> >> > notify the sender immediately and delete the message and any
> >> attachment
> >> >> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> >> >> > subsidiaries do not accept liability for any omissions or errors in
> >> this
> >> >> > message which may arise as a result of E-Mail-transmission or for
> >> damages
> >> >> > resulting from any unauthorized changes of the content of this
> message
> >> >> and
> >> >> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of
> its
> >> >> > subsidiaries do not guarantee that this message is free of viruses
> and
> >> >> does
> >> >> > not accept liability for any damages caused by any virus
> transmitted
> >> >> > therewith.
> >> >> >
> >> >> > Click http://www.emdgroup.com/disclaimer to access the German,
> >> French,
> >> >> > Spanish and Portuguese versions of this disclaimer.
> >> >>
> >> >
> >> > --
> >> >
> >> >
> >> > This message and any attachment are confidential and may be
> privileged or
> >> > otherwise protected from disclosure. If you are not the intended
> >> recipient,
> >> > you must not copy this message or attachment or disclose the contents
> to
> >> > any other person. If you have received this transmission in error,
> please
> >> > notify the sender immediately and delete the message and any
> attachment
> >> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> >> > subsidiaries do not accept liability for any omissions or errors in
> this
> >> > message which may arise as a result of E-Mail-transmission or for
> damages
> >> > resulting from any unauthorized changes of the content of this message
> >> and
> >> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> >> > subsidiaries do not guarantee that this message is free of viruses and
> >> does
> >> > not accept liability for any damages caused by any virus transmitted
> >> > therewith.
> >> >
> >> > Click http://www.emdgroup.com/disclaimer to access the German,
> French,
> >> > Spanish and Portuguese versions of this disclaimer.
> >>
> >
> > --
> >
> >
> > This message and any attachment are confidential and may be privileged or
> > otherwise protected from disclosure. If you are not the intended
> recipient,
> > you must not copy this message or attachment or disclose the contents to
> > any other person. If you have received this transmission in error, please
> > notify the sender immediately and delete the message and any attachment
> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not accept liability for any omissions or errors in this
> > message which may arise as a result of E-Mail-transmission or for damages
> > resulting from any unauthorized changes of the content of this message
> and
> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not guarantee that this message is free of viruses and
> does
> > not accept liability for any damages caused by any virus transmitted
> > therewith.
> >
> > Click http://www.emdgroup.com/disclaimer to access the German, French,
> > Spanish and Portuguese versions of this disclaimer.
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.

Re: solrcloud replicas not in sync

Posted by Erick Erickson <er...@gmail.com>.

By default, enough closed log files will be kept to hold the last 100
documents indexed. This is for "peer sync" purposes. Say replica1 goes
offline for a bit. When it comes back online, if it's fallen behind by
no more than 100 docs, the docs are replayed from another replica's
tlog.

Having such tiny tlogs is kind of unusual. My guess is that your
ingestion rate is quite low. Every time a hard commit happens, a new
tlog is opened up and the old one is closed. Having such tiny tlogs
implies that you are getting one or a few documents per autocommit
interval, so each tlog contains just a few docs. There's nothing wrong
with that, mind you, so it's not a problem.

When do log files get deleted? It Depends (tm). In the non-CDCR case,
if the most recent N closed tlogs contain 100 or more documents, the
tlogs older than N are deleted.

In the CDCR case, the above condition must be true _and_ the docs in
tlogs older than N must have been transmitted to the target cluster.

Best,
Erick

On Wed, May 24, 2017 at 8:27 AM, Webster Homer <we...@sial.com> wrote:
> The tlog sizes are strange
> In the case of the collection where we had issues with the replicas the
> tlog sizes are 740 bytes and 938 bytes on the target side and the same on
> the source side. There are a lot of them on the source side, when do tlog
> files get deleted?
>
>
>
> On Tue, May 23, 2017 at 12:52 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> This is all quite strange. Optimize (BTW, it's rarely
>> necessary/desirable on an index that changes, despite its name)
>> shouldn't matter here. CDCR forwards the raw documents to the target
>> cluster.
>>
>> Ample time indeed. With a soft commit of 15 seconds, that's your
>> window (with some slop for how long CDCR takes).
>>
>> If you do a search and sort by your timestamp descending, what do you
>> see on the target cluster? And when you are indexing and CDCR is
>> running, your target cluster solr logs should show updates coming in.
>> Mostly checking if the data is even getting to the target cluster
>> here.
>>
>> Also check the tlogs on the source cluster. By "check" here I just
>> mean "are they reasonable size", and "reasonable" should be very
>> small. The tlogs are the "queue" that CDCR uses to store docs before
>> forwarding to the target cluster, so this is just a sanity check. If
>> they're huge, then CDCR is not forwarding anything to the target
>> cluster.
>>
>> It's also vaguely possible that
>> IgnoreCommitOptimizeUpdateProcessorFactory is interfering, if so it's
>> a bug and should be reported as a JIRA. If you remove that on the
>> target cluster, does the behavior change?
>>
>> I'm mystified here as you can tell.
>>
>> Best,
>> Erick
>>
>> On Tue, May 23, 2017 at 10:12 AM, Webster Homer <we...@sial.com>
>> wrote:
>> > We see a pretty consistent issue where the replicas show in the admin
>> > console as not current, indicating that our auto commit isn't commiting.
>> In
>> > one case we loaded the data to the source, cdcr replicated it to the
>> > targets and we see the source and the target as having current = false.
>> It
>> > is searchable so the soft commits are happening. We turned off data
>> loading
>> > to investigate this issue, and the replicas are still not current after 3
>> > days. So there should have been ample time to catch up.
>> > This is our autoCommit
>> >      <autoCommit>
>> >        <maxDocs>25000</maxDocs>
>> >        <maxTime>${solr.autoCommit.maxTime:300000}</maxTime>
>> >        <openSearcher>false</openSearcher>
>> >      </autoCommit>
>> >
>> > This is our autoSoftCommit
>> >      <autoSoftCommit>
>> >        <maxTime>${solr.autoSoftCommit.maxTime:15000}</maxTime>
>> >      </autoSoftCommit>
>> > neither property, solr.autoCommit.maxTime or solr.autoSoftCommit.maxTime
>> > are set.
>> >
>> > We also have an updateChain that calls the
>> > solr.IgnoreCommitOptimizeUpdateProcessorFactory to ignore client
>> commits.
>> > Could that be the cause of our
>> >       <updateRequestProcessorChain name="cleanup">
>> >      <!-- Ignore commits from clients, telling them all's OK -->
>> >        <processor class="solr.IgnoreCommitOptimizeUpdateProc
>> essorFactory">
>> >          <int name="statusCode">200</int>
>> >        </processor>
>> >
>> >        <processor class="TrimFieldUpdateProcessorFactory" />
>> >        <processor class="RemoveBlankFieldUpdateProcessorFactory" />
>> >
>> >        <processor class="solr.LogUpdateProcessorFactory" />
>> >        <processor class="solr.RunUpdateProcessorFactory" />
>> >      </updateRequestProcessorChain>
>> >
>> > We did create a date field to all our collections that defaults to NOW
>> so I
>> > can see that no new data was added, but the replicas don't seem to get
>> the
>> > commit. I assume this is something in our configuration (see above).
>> >
>> > Is there a way to determine when the last commit occurred?
>> >
>> > I believe that the one replica got out of sync due to an admin running an
>> > optimize while cdcr was still running.
>> > That was one collection, but it looks like we are missing commits on most
>> > of our collections.
>> >
>> > Any help would be greatly appreciated!
>> >
>> > Thanks,
>> > Webster Homer
>> >
>> > On Mon, May 22, 2017 at 4:12 PM, Erick Erickson <erickerickson@gmail.com
>> >
>> > wrote:
>> >
>> >> You can ping individual replicas by addressing to a specific replica
>> >> and setting distrib=false, something like
>> >>
>> >>      http://SOLR_NODE:port/solr/collection1_shard1_replica1/
>> >> query?distrib=false&q=......
>> >>
>> >> But one thing to check first is that you've committed. I'd:
>> >> 1> turn off indexing on the source cluster.
>> >> 2> wait until the CDCR had caught up (if necessary).
>> >> 3> issue a hard commit on the target
>> >> 4> _then_ see if the counts were what is expected.
>> >>
>> >> Due to the fact that autocommit settings can fire at different clock
>> >> times even for replicas on the same shard, it's easier to track
>> >> whether it's a transient issue. The other thing I've seen people do is
>> >> have a timestamp on the docs set to NOW (there's an update processor
>> >> that can do this). Then when you check for consistency you can use
>> >> fq=timestamp:[* TO NOW - (some interval significantly longer than your
>> >> autocommit interval)].
>> >>
>> >> bq: Is there a way to recover when a shard has inconsistent replicas.
>> >> If I use the delete replica API call to delete one of them and then use
>> add
>> >> replica to create it from scratch will it auto-populate from the other
>> >> replica in the shard?
>> >>
>> >> Yes. Whenever you ADDREPLICA it'll catch itself up from the leader
>> >> before becoming active. It'll have to copy the _entire_ index from the
>> >> leader, so you'll see network traffic spike.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Mon, May 22, 2017 at 1:41 PM, Webster Homer <we...@sial.com>
>> >> wrote:
>> >> > I have a solrcloud collection with 2 shards and 4 replicas. The
>> replicas
>> >> > for shard 1 have different numbers of records, so different queries
>> will
>> >> > return different numbers of records.
>> >> >
>> >> > I am not certain how this occurred, it happened in a collection that
>> was
>> >> a
>> >> > cdcr target.
>> >> >
>> >> > Is there a way to limit a search to a specific replica of a shard? We
>> >> want
>> >> > to understand the differences
>> >> >
>> >> > Is there a way to recover when a shard has inconsistent replicas.
>> >> > If I use the delete replica API call to delete one of them and then
>> use
>> >> add
>> >> > replica to create it from scratch will it auto-populate from the other
>> >> > replica in the shard?
>> >> >
>> >> > Thanks,
>> >> > Webster
>> >> >
>> >> > --
>> >> >
>> >> >
>> >> > This message and any attachment are confidential and may be
>> privileged or
>> >> > otherwise protected from disclosure. If you are not the intended
>> >> recipient,
>> >> > you must not copy this message or attachment or disclose the contents
>> to
>> >> > any other person. If you have received this transmission in error,
>> please
>> >> > notify the sender immediately and delete the message and any
>> attachment
>> >> > from your system. Merck KGaA, Darmstadt, Germany and any of its
>> >> > subsidiaries do not accept liability for any omissions or errors in
>> this
>> >> > message which may arise as a result of E-Mail-transmission or for
>> damages
>> >> > resulting from any unauthorized changes of the content of this message
>> >> and
>> >> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>> >> > subsidiaries do not guarantee that this message is free of viruses and
>> >> does
>> >> > not accept liability for any damages caused by any virus transmitted
>> >> > therewith.
>> >> >
>> >> > Click http://www.emdgroup.com/disclaimer to access the German,
>> French,
>> >> > Spanish and Portuguese versions of this disclaimer.
>> >>
>> >
>> > --
>> >
>> >
>> > This message and any attachment are confidential and may be privileged or
>> > otherwise protected from disclosure. If you are not the intended
>> recipient,
>> > you must not copy this message or attachment or disclose the contents to
>> > any other person. If you have received this transmission in error, please
>> > notify the sender immediately and delete the message and any attachment
>> > from your system. Merck KGaA, Darmstadt, Germany and any of its
>> > subsidiaries do not accept liability for any omissions or errors in this
>> > message which may arise as a result of E-Mail-transmission or for damages
>> > resulting from any unauthorized changes of the content of this message
>> and
>> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>> > subsidiaries do not guarantee that this message is free of viruses and
>> does
>> > not accept liability for any damages caused by any virus transmitted
>> > therewith.
>> >
>> > Click http://www.emdgroup.com/disclaimer to access the German, French,
>> > Spanish and Portuguese versions of this disclaimer.
>>
>
> --
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
> Click http://www.emdgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.

Re: solrcloud replicas not in sync

Posted by Webster Homer <we...@sial.com>.

The tlog sizes are strange
In the case of the collection where we had issues with the replicas the
tlog sizes are 740 bytes and 938 bytes on the target side and the same on
the source side. There are a lot of them on the source side, when do tlog
files get deleted?



On Tue, May 23, 2017 at 12:52 PM, Erick Erickson <er...@gmail.com>
wrote:

> This is all quite strange. Optimize (BTW, it's rarely
> necessary/desirable on an index that changes, despite its name)
> shouldn't matter here. CDCR forwards the raw documents to the target
> cluster.
>
> Ample time indeed. With a soft commit of 15 seconds, that's your
> window (with some slop for how long CDCR takes).
>
> If you do a search and sort by your timestamp descending, what do you
> see on the target cluster? And when you are indexing and CDCR is
> running, your target cluster solr logs should show updates coming in.
> Mostly checking if the data is even getting to the target cluster
> here.
>
> Also check the tlogs on the source cluster. By "check" here I just
> mean "are they reasonable size", and "reasonable" should be very
> small. The tlogs are the "queue" that CDCR uses to store docs before
> forwarding to the target cluster, so this is just a sanity check. If
> they're huge, then CDCR is not forwarding anything to the target
> cluster.
>
> It's also vaguely possible that
> IgnoreCommitOptimizeUpdateProcessorFactory is interfering, if so it's
> a bug and should be reported as a JIRA. If you remove that on the
> target cluster, does the behavior change?
>
> I'm mystified here as you can tell.
>
> Best,
> Erick
>
> On Tue, May 23, 2017 at 10:12 AM, Webster Homer <we...@sial.com>
> wrote:
> > We see a pretty consistent issue where the replicas show in the admin
> > console as not current, indicating that our auto commit isn't commiting.
> In
> > one case we loaded the data to the source, cdcr replicated it to the
> > targets and we see the source and the target as having current = false.
> It
> > is searchable so the soft commits are happening. We turned off data
> loading
> > to investigate this issue, and the replicas are still not current after 3
> > days. So there should have been ample time to catch up.
> > This is our autoCommit
> >      <autoCommit>
> >        <maxDocs>25000</maxDocs>
> >        <maxTime>${solr.autoCommit.maxTime:300000}</maxTime>
> >        <openSearcher>false</openSearcher>
> >      </autoCommit>
> >
> > This is our autoSoftCommit
> >      <autoSoftCommit>
> >        <maxTime>${solr.autoSoftCommit.maxTime:15000}</maxTime>
> >      </autoSoftCommit>
> > neither property, solr.autoCommit.maxTime or solr.autoSoftCommit.maxTime
> > are set.
> >
> > We also have an updateChain that calls the
> > solr.IgnoreCommitOptimizeUpdateProcessorFactory to ignore client
> commits.
> > Could that be the cause of our
> >       <updateRequestProcessorChain name="cleanup">
> >      <!-- Ignore commits from clients, telling them all's OK -->
> >        <processor class="solr.IgnoreCommitOptimizeUpdateProc
> essorFactory">
> >          <int name="statusCode">200</int>
> >        </processor>
> >
> >        <processor class="TrimFieldUpdateProcessorFactory" />
> >        <processor class="RemoveBlankFieldUpdateProcessorFactory" />
> >
> >        <processor class="solr.LogUpdateProcessorFactory" />
> >        <processor class="solr.RunUpdateProcessorFactory" />
> >      </updateRequestProcessorChain>
> >
> > We did create a date field to all our collections that defaults to NOW
> so I
> > can see that no new data was added, but the replicas don't seem to get
> the
> > commit. I assume this is something in our configuration (see above).
> >
> > Is there a way to determine when the last commit occurred?
> >
> > I believe that the one replica got out of sync due to an admin running an
> > optimize while cdcr was still running.
> > That was one collection, but it looks like we are missing commits on most
> > of our collections.
> >
> > Any help would be greatly appreciated!
> >
> > Thanks,
> > Webster Homer
> >
> > On Mon, May 22, 2017 at 4:12 PM, Erick Erickson <erickerickson@gmail.com
> >
> > wrote:
> >
> >> You can ping individual replicas by addressing to a specific replica
> >> and setting distrib=false, something like
> >>
> >>      http://SOLR_NODE:port/solr/collection1_shard1_replica1/
> >> query?distrib=false&q=......
> >>
> >> But one thing to check first is that you've committed. I'd:
> >> 1> turn off indexing on the source cluster.
> >> 2> wait until the CDCR had caught up (if necessary).
> >> 3> issue a hard commit on the target
> >> 4> _then_ see if the counts were what is expected.
> >>
> >> Due to the fact that autocommit settings can fire at different clock
> >> times even for replicas on the same shard, it's easier to track
> >> whether it's a transient issue. The other thing I've seen people do is
> >> have a timestamp on the docs set to NOW (there's an update processor
> >> that can do this). Then when you check for consistency you can use
> >> fq=timestamp:[* TO NOW - (some interval significantly longer than your
> >> autocommit interval)].
> >>
> >> bq: Is there a way to recover when a shard has inconsistent replicas.
> >> If I use the delete replica API call to delete one of them and then use
> add
> >> replica to create it from scratch will it auto-populate from the other
> >> replica in the shard?
> >>
> >> Yes. Whenever you ADDREPLICA it'll catch itself up from the leader
> >> before becoming active. It'll have to copy the _entire_ index from the
> >> leader, so you'll see network traffic spike.
> >>
> >> Best,
> >> Erick
> >>
> >> On Mon, May 22, 2017 at 1:41 PM, Webster Homer <we...@sial.com>
> >> wrote:
> >> > I have a solrcloud collection with 2 shards and 4 replicas. The
> replicas
> >> > for shard 1 have different numbers of records, so different queries
> will
> >> > return different numbers of records.
> >> >
> >> > I am not certain how this occurred, it happened in a collection that
> was
> >> a
> >> > cdcr target.
> >> >
> >> > Is there a way to limit a search to a specific replica of a shard? We
> >> want
> >> > to understand the differences
> >> >
> >> > Is there a way to recover when a shard has inconsistent replicas.
> >> > If I use the delete replica API call to delete one of them and then
> use
> >> add
> >> > replica to create it from scratch will it auto-populate from the other
> >> > replica in the shard?
> >> >
> >> > Thanks,
> >> > Webster
> >> >
> >> > --
> >> >
> >> >
> >> > This message and any attachment are confidential and may be
> privileged or
> >> > otherwise protected from disclosure. If you are not the intended
> >> recipient,
> >> > you must not copy this message or attachment or disclose the contents
> to
> >> > any other person. If you have received this transmission in error,
> please
> >> > notify the sender immediately and delete the message and any
> attachment
> >> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> >> > subsidiaries do not accept liability for any omissions or errors in
> this
> >> > message which may arise as a result of E-Mail-transmission or for
> damages
> >> > resulting from any unauthorized changes of the content of this message
> >> and
> >> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> >> > subsidiaries do not guarantee that this message is free of viruses and
> >> does
> >> > not accept liability for any damages caused by any virus transmitted
> >> > therewith.
> >> >
> >> > Click http://www.emdgroup.com/disclaimer to access the German,
> French,
> >> > Spanish and Portuguese versions of this disclaimer.
> >>
> >
> > --
> >
> >
> > This message and any attachment are confidential and may be privileged or
> > otherwise protected from disclosure. If you are not the intended
> recipient,
> > you must not copy this message or attachment or disclose the contents to
> > any other person. If you have received this transmission in error, please
> > notify the sender immediately and delete the message and any attachment
> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not accept liability for any omissions or errors in this
> > message which may arise as a result of E-Mail-transmission or for damages
> > resulting from any unauthorized changes of the content of this message
> and
> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not guarantee that this message is free of viruses and
> does
> > not accept liability for any damages caused by any virus transmitted
> > therewith.
> >
> > Click http://www.emdgroup.com/disclaimer to access the German, French,
> > Spanish and Portuguese versions of this disclaimer.
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.

Re: solrcloud replicas not in sync

Posted by Walter Underwood <wu...@wunderwood.org>.

Funny, I took a different approach to the same monitoring problem.

Each document has a published_timestamp field set when it is generated. The schema has an indexed_timestamp field with a default of NOW. I wrote some Python to get the set of nodes in the collection, query each one, then report the freshness to Graphite. It is generally under 300 ms.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On May 24, 2017, at 12:51 PM, Webster Homer <we...@sial.com> wrote:
> 
> Actually I wrote a service that calls the collections API Cluster Status,
> but it adds data for each replica by calling the Core Admin STATUS
> https://cwiki.apache.org/confluence/display/solr/CoreAdmin+API#CoreAdminAPI-STATUS
> 
> my service fills in the index information for more data
> 
> This returns the current flag, and it may not always be correct?
> 
> On Wed, May 24, 2017 at 10:21 AM, Erick Erickson <er...@gmail.com>
> wrote:
> 
>> I wouldn't rely on the "current" flag in the admin UI as an indicator.
>> As long as your numDocs and the like match I'd say it's a UI issue.
>> 
>> Best,
>> Erick
>> 
>> On Wed, May 24, 2017 at 8:15 AM, Webster Homer <we...@sial.com>
>> wrote:
>>> We see data in the target clusters. CDCR replication is working. We first
>>> noticed the current=false flag on the target replicas, but since I
>> started
>>> looking I see it on the source too.
>>> 
>>> 
>>> I have removed the IgnoreCommitOptimizeUpdateProcessorFactory from our
>>> update processor chain, I did two data loads to different collections.
>>> These collections are part of our development system, they are not
>>> configured to use cdcr they are directly loaded by our data load. The ETL
>>> to our solrs use the /update/json request handler and does not send
>>> commits. These collections mirror our production collections and have 2
>>> shards with 2 replicas. I see the situation where the replicas are marked
>>> current=false which should not happen if autoCommit was working
>> correctly.
>>> The last load was yesterday at 5pm and I didn't check until this morning
>>> where I found bb-catalog-material_shard1_replica1 (the leader) was not
>>> current, but the other was. The last modified date on the leader was
>>> 2017-05-23T22:44:54.618Z.
>>> 
>>> My modified autoCommit:
>>>      <autoCommit>
>>>       <maxTime>${solr.autoCommit.maxTime:600000}</maxTime>
>>>       <openSearcher>false</openSearcher>
>>>     </autoCommit>
>>> 
>>>     <autoSoftCommit>
>>>       <maxTime>${solr.autoSoftCommit.maxTime:60000}</maxTime>
>>>     </autoSoftCommit>
>>> 
>>> The last indexed record from a search matches up with the above time. For
>>> this test,the numDocs are the same between the two replicas. I think the
>>> soft commit is working. Why wouldn't both replicas be current after so
>> many
>>> hours?
>>> We are using solr 6.2 fyi. I expect to upgrade to solr 6.6 when it
>> becomes
>>> available
>>> 
>>> Thanks,
>>> Webster
>>> 
>>> On Tue, May 23, 2017 at 12:52 PM, Erick Erickson <
>> erickerickson@gmail.com>
>>> wrote:
>>> 
>>>> This is all quite strange. Optimize (BTW, it's rarely
>>>> necessary/desirable on an index that changes, despite its name)
>>>> shouldn't matter here. CDCR forwards the raw documents to the target
>>>> cluster.
>>>> 
>>>> Ample time indeed. With a soft commit of 15 seconds, that's your
>>>> window (with some slop for how long CDCR takes).
>>>> 
>>>> If you do a search and sort by your timestamp descending, what do you
>>>> see on the target cluster? And when you are indexing and CDCR is
>>>> running, your target cluster solr logs should show updates coming in.
>>>> Mostly checking if the data is even getting to the target cluster
>>>> here.
>>>> 
>>>> Also check the tlogs on the source cluster. By "check" here I just
>>>> mean "are they reasonable size", and "reasonable" should be very
>>>> small. The tlogs are the "queue" that CDCR uses to store docs before
>>>> forwarding to the target cluster, so this is just a sanity check. If
>>>> they're huge, then CDCR is not forwarding anything to the target
>>>> cluster.
>>>> 
>>>> It's also vaguely possible that
>>>> IgnoreCommitOptimizeUpdateProcessorFactory is interfering, if so it's
>>>> a bug and should be reported as a JIRA. If you remove that on the
>>>> target cluster, does the behavior change?
>>>> 
>>>> I'm mystified here as you can tell.
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>> On Tue, May 23, 2017 at 10:12 AM, Webster Homer <webster.homer@sial.com
>>> 
>>>> wrote:
>>>>> We see a pretty consistent issue where the replicas show in the admin
>>>>> console as not current, indicating that our auto commit isn't
>> commiting.
>>>> In
>>>>> one case we loaded the data to the source, cdcr replicated it to the
>>>>> targets and we see the source and the target as having current =
>> false.
>>>> It
>>>>> is searchable so the soft commits are happening. We turned off data
>>>> loading
>>>>> to investigate this issue, and the replicas are still not current
>> after 3
>>>>> days. So there should have been ample time to catch up.
>>>>> This is our autoCommit
>>>>>     <autoCommit>
>>>>>       <maxDocs>25000</maxDocs>
>>>>>       <maxTime>${solr.autoCommit.maxTime:300000}</maxTime>
>>>>>       <openSearcher>false</openSearcher>
>>>>>     </autoCommit>
>>>>> 
>>>>> This is our autoSoftCommit
>>>>>     <autoSoftCommit>
>>>>>       <maxTime>${solr.autoSoftCommit.maxTime:15000}</maxTime>
>>>>>     </autoSoftCommit>
>>>>> neither property, solr.autoCommit.maxTime or
>> solr.autoSoftCommit.maxTime
>>>>> are set.
>>>>> 
>>>>> We also have an updateChain that calls the
>>>>> solr.IgnoreCommitOptimizeUpdateProcessorFactory to ignore client
>>>> commits.
>>>>> Could that be the cause of our
>>>>>      <updateRequestProcessorChain name="cleanup">
>>>>>     <!-- Ignore commits from clients, telling them all's OK -->
>>>>>       <processor class="solr.IgnoreCommitOptimizeUpdateProc
>>>> essorFactory">
>>>>>         <int name="statusCode">200</int>
>>>>>       </processor>
>>>>> 
>>>>>       <processor class="TrimFieldUpdateProcessorFactory" />
>>>>>       <processor class="RemoveBlankFieldUpdateProcessorFactory" />
>>>>> 
>>>>>       <processor class="solr.LogUpdateProcessorFactory" />
>>>>>       <processor class="solr.RunUpdateProcessorFactory" />
>>>>>     </updateRequestProcessorChain>
>>>>> 
>>>>> We did create a date field to all our collections that defaults to NOW
>>>> so I
>>>>> can see that no new data was added, but the replicas don't seem to get
>>>> the
>>>>> commit. I assume this is something in our configuration (see above).
>>>>> 
>>>>> Is there a way to determine when the last commit occurred?
>>>>> 
>>>>> I believe that the one replica got out of sync due to an admin
>> running an
>>>>> optimize while cdcr was still running.
>>>>> That was one collection, but it looks like we are missing commits on
>> most
>>>>> of our collections.
>>>>> 
>>>>> Any help would be greatly appreciated!
>>>>> 
>>>>> Thanks,
>>>>> Webster Homer
>>>>> 
>>>>> On Mon, May 22, 2017 at 4:12 PM, Erick Erickson <
>> erickerickson@gmail.com
>>>>> 
>>>>> wrote:
>>>>> 
>>>>>> You can ping individual replicas by addressing to a specific replica
>>>>>> and setting distrib=false, something like
>>>>>> 
>>>>>>     http://SOLR_NODE:port/solr/collection1_shard1_replica1/
>>>>>> query?distrib=false&q=......
>>>>>> 
>>>>>> But one thing to check first is that you've committed. I'd:
>>>>>> 1> turn off indexing on the source cluster.
>>>>>> 2> wait until the CDCR had caught up (if necessary).
>>>>>> 3> issue a hard commit on the target
>>>>>> 4> _then_ see if the counts were what is expected.
>>>>>> 
>>>>>> Due to the fact that autocommit settings can fire at different clock
>>>>>> times even for replicas on the same shard, it's easier to track
>>>>>> whether it's a transient issue. The other thing I've seen people do
>> is
>>>>>> have a timestamp on the docs set to NOW (there's an update processor
>>>>>> that can do this). Then when you check for consistency you can use
>>>>>> fq=timestamp:[* TO NOW - (some interval significantly longer than
>> your
>>>>>> autocommit interval)].
>>>>>> 
>>>>>> bq: Is there a way to recover when a shard has inconsistent replicas.
>>>>>> If I use the delete replica API call to delete one of them and then
>> use
>>>> add
>>>>>> replica to create it from scratch will it auto-populate from the
>> other
>>>>>> replica in the shard?
>>>>>> 
>>>>>> Yes. Whenever you ADDREPLICA it'll catch itself up from the leader
>>>>>> before becoming active. It'll have to copy the _entire_ index from
>> the
>>>>>> leader, so you'll see network traffic spike.
>>>>>> 
>>>>>> Best,
>>>>>> Erick
>>>>>> 
>>>>>> On Mon, May 22, 2017 at 1:41 PM, Webster Homer <
>> webster.homer@sial.com>
>>>>>> wrote:
>>>>>>> I have a solrcloud collection with 2 shards and 4 replicas. The
>>>> replicas
>>>>>>> for shard 1 have different numbers of records, so different queries
>>>> will
>>>>>>> return different numbers of records.
>>>>>>> 
>>>>>>> I am not certain how this occurred, it happened in a collection
>> that
>>>> was
>>>>>> a
>>>>>>> cdcr target.
>>>>>>> 
>>>>>>> Is there a way to limit a search to a specific replica of a shard?
>> We
>>>>>> want
>>>>>>> to understand the differences
>>>>>>> 
>>>>>>> Is there a way to recover when a shard has inconsistent replicas.
>>>>>>> If I use the delete replica API call to delete one of them and then
>>>> use
>>>>>> add
>>>>>>> replica to create it from scratch will it auto-populate from the
>> other
>>>>>>> replica in the shard?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Webster
>>>>>>> 
>>>>>>> --
>>>>>>> 
>>>>>>> 
>>>>>>> This message and any attachment are confidential and may be
>>>> privileged or
>>>>>>> otherwise protected from disclosure. If you are not the intended
>>>>>> recipient,
>>>>>>> you must not copy this message or attachment or disclose the
>> contents
>>>> to
>>>>>>> any other person. If you have received this transmission in error,
>>>> please
>>>>>>> notify the sender immediately and delete the message and any
>>>> attachment
>>>>>>> from your system. Merck KGaA, Darmstadt, Germany and any of its
>>>>>>> subsidiaries do not accept liability for any omissions or errors in
>>>> this
>>>>>>> message which may arise as a result of E-Mail-transmission or for
>>>> damages
>>>>>>> resulting from any unauthorized changes of the content of this
>> message
>>>>>> and
>>>>>>> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of
>> its
>>>>>>> subsidiaries do not guarantee that this message is free of viruses
>> and
>>>>>> does
>>>>>>> not accept liability for any damages caused by any virus
>> transmitted
>>>>>>> therewith.
>>>>>>> 
>>>>>>> Click http://www.emdgroup.com/disclaimer to access the German,
>>>> French,
>>>>>>> Spanish and Portuguese versions of this disclaimer.
>>>>>> 
>>>>> 
>>>>> --
>>>>> 
>>>>> 
>>>>> This message and any attachment are confidential and may be
>> privileged or
>>>>> otherwise protected from disclosure. If you are not the intended
>>>> recipient,
>>>>> you must not copy this message or attachment or disclose the contents
>> to
>>>>> any other person. If you have received this transmission in error,
>> please
>>>>> notify the sender immediately and delete the message and any
>> attachment
>>>>> from your system. Merck KGaA, Darmstadt, Germany and any of its
>>>>> subsidiaries do not accept liability for any omissions or errors in
>> this
>>>>> message which may arise as a result of E-Mail-transmission or for
>> damages
>>>>> resulting from any unauthorized changes of the content of this message
>>>> and
>>>>> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>>>>> subsidiaries do not guarantee that this message is free of viruses and
>>>> does
>>>>> not accept liability for any damages caused by any virus transmitted
>>>>> therewith.
>>>>> 
>>>>> Click http://www.emdgroup.com/disclaimer to access the German,
>> French,
>>>>> Spanish and Portuguese versions of this disclaimer.
>>>> 
>>> 
>>> --
>>> 
>>> 
>>> This message and any attachment are confidential and may be privileged or
>>> otherwise protected from disclosure. If you are not the intended
>> recipient,
>>> you must not copy this message or attachment or disclose the contents to
>>> any other person. If you have received this transmission in error, please
>>> notify the sender immediately and delete the message and any attachment
>>> from your system. Merck KGaA, Darmstadt, Germany and any of its
>>> subsidiaries do not accept liability for any omissions or errors in this
>>> message which may arise as a result of E-Mail-transmission or for damages
>>> resulting from any unauthorized changes of the content of this message
>> and
>>> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>>> subsidiaries do not guarantee that this message is free of viruses and
>> does
>>> not accept liability for any damages caused by any virus transmitted
>>> therewith.
>>> 
>>> Click http://www.emdgroup.com/disclaimer to access the German, French,
>>> Spanish and Portuguese versions of this disclaimer.
>> 
> 
> -- 
> 
> 
> This message and any attachment are confidential and may be privileged or 
> otherwise protected from disclosure. If you are not the intended recipient, 
> you must not copy this message or attachment or disclose the contents to 
> any other person. If you have received this transmission in error, please 
> notify the sender immediately and delete the message and any attachment 
> from your system. Merck KGaA, Darmstadt, Germany and any of its 
> subsidiaries do not accept liability for any omissions or errors in this 
> message which may arise as a result of E-Mail-transmission or for damages 
> resulting from any unauthorized changes of the content of this message and 
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
> subsidiaries do not guarantee that this message is free of viruses and does 
> not accept liability for any damages caused by any virus transmitted 
> therewith.
> 
> Click http://www.emdgroup.com/disclaimer to access the German, French, 
> Spanish and Portuguese versions of this disclaimer.

Re: solrcloud replicas not in sync

Posted by Webster Homer <we...@sial.com>.

Actually I wrote a service that calls the collections API Cluster Status,
but it adds data for each replica by calling the Core Admin STATUS
https://cwiki.apache.org/confluence/display/solr/CoreAdmin+API#CoreAdminAPI-STATUS

my service fills in the index information for more data

This returns the current flag, and it may not always be correct?

On Wed, May 24, 2017 at 10:21 AM, Erick Erickson <er...@gmail.com>
wrote:

> I wouldn't rely on the "current" flag in the admin UI as an indicator.
> As long as your numDocs and the like match I'd say it's a UI issue.
>
> Best,
> Erick
>
> On Wed, May 24, 2017 at 8:15 AM, Webster Homer <we...@sial.com>
> wrote:
> > We see data in the target clusters. CDCR replication is working. We first
> > noticed the current=false flag on the target replicas, but since I
> started
> > looking I see it on the source too.
> >
> >
> > I have removed the IgnoreCommitOptimizeUpdateProcessorFactory from our
> > update processor chain, I did two data loads to different collections.
> > These collections are part of our development system, they are not
> > configured to use cdcr they are directly loaded by our data load. The ETL
> > to our solrs use the /update/json request handler and does not send
> > commits. These collections mirror our production collections and have 2
> > shards with 2 replicas. I see the situation where the replicas are marked
> > current=false which should not happen if autoCommit was working
> correctly.
> > The last load was yesterday at 5pm and I didn't check until this morning
> > where I found bb-catalog-material_shard1_replica1 (the leader) was not
> > current, but the other was. The last modified date on the leader was
> > 2017-05-23T22:44:54.618Z.
> >
> > My modified autoCommit:
> >       <autoCommit>
> >        <maxTime>${solr.autoCommit.maxTime:600000}</maxTime>
> >        <openSearcher>false</openSearcher>
> >      </autoCommit>
> >
> >      <autoSoftCommit>
> >        <maxTime>${solr.autoSoftCommit.maxTime:60000}</maxTime>
> >      </autoSoftCommit>
> >
> > The last indexed record from a search matches up with the above time. For
> > this test,the numDocs are the same between the two replicas. I think the
> > soft commit is working. Why wouldn't both replicas be current after so
> many
> > hours?
> > We are using solr 6.2 fyi. I expect to upgrade to solr 6.6 when it
> becomes
> > available
> >
> > Thanks,
> > Webster
> >
> > On Tue, May 23, 2017 at 12:52 PM, Erick Erickson <
> erickerickson@gmail.com>
> > wrote:
> >
> >> This is all quite strange. Optimize (BTW, it's rarely
> >> necessary/desirable on an index that changes, despite its name)
> >> shouldn't matter here. CDCR forwards the raw documents to the target
> >> cluster.
> >>
> >> Ample time indeed. With a soft commit of 15 seconds, that's your
> >> window (with some slop for how long CDCR takes).
> >>
> >> If you do a search and sort by your timestamp descending, what do you
> >> see on the target cluster? And when you are indexing and CDCR is
> >> running, your target cluster solr logs should show updates coming in.
> >> Mostly checking if the data is even getting to the target cluster
> >> here.
> >>
> >> Also check the tlogs on the source cluster. By "check" here I just
> >> mean "are they reasonable size", and "reasonable" should be very
> >> small. The tlogs are the "queue" that CDCR uses to store docs before
> >> forwarding to the target cluster, so this is just a sanity check. If
> >> they're huge, then CDCR is not forwarding anything to the target
> >> cluster.
> >>
> >> It's also vaguely possible that
> >> IgnoreCommitOptimizeUpdateProcessorFactory is interfering, if so it's
> >> a bug and should be reported as a JIRA. If you remove that on the
> >> target cluster, does the behavior change?
> >>
> >> I'm mystified here as you can tell.
> >>
> >> Best,
> >> Erick
> >>
> >> On Tue, May 23, 2017 at 10:12 AM, Webster Homer <webster.homer@sial.com
> >
> >> wrote:
> >> > We see a pretty consistent issue where the replicas show in the admin
> >> > console as not current, indicating that our auto commit isn't
> commiting.
> >> In
> >> > one case we loaded the data to the source, cdcr replicated it to the
> >> > targets and we see the source and the target as having current =
> false.
> >> It
> >> > is searchable so the soft commits are happening. We turned off data
> >> loading
> >> > to investigate this issue, and the replicas are still not current
> after 3
> >> > days. So there should have been ample time to catch up.
> >> > This is our autoCommit
> >> >      <autoCommit>
> >> >        <maxDocs>25000</maxDocs>
> >> >        <maxTime>${solr.autoCommit.maxTime:300000}</maxTime>
> >> >        <openSearcher>false</openSearcher>
> >> >      </autoCommit>
> >> >
> >> > This is our autoSoftCommit
> >> >      <autoSoftCommit>
> >> >        <maxTime>${solr.autoSoftCommit.maxTime:15000}</maxTime>
> >> >      </autoSoftCommit>
> >> > neither property, solr.autoCommit.maxTime or
> solr.autoSoftCommit.maxTime
> >> > are set.
> >> >
> >> > We also have an updateChain that calls the
> >> > solr.IgnoreCommitOptimizeUpdateProcessorFactory to ignore client
> >> commits.
> >> > Could that be the cause of our
> >> >       <updateRequestProcessorChain name="cleanup">
> >> >      <!-- Ignore commits from clients, telling them all's OK -->
> >> >        <processor class="solr.IgnoreCommitOptimizeUpdateProc
> >> essorFactory">
> >> >          <int name="statusCode">200</int>
> >> >        </processor>
> >> >
> >> >        <processor class="TrimFieldUpdateProcessorFactory" />
> >> >        <processor class="RemoveBlankFieldUpdateProcessorFactory" />
> >> >
> >> >        <processor class="solr.LogUpdateProcessorFactory" />
> >> >        <processor class="solr.RunUpdateProcessorFactory" />
> >> >      </updateRequestProcessorChain>
> >> >
> >> > We did create a date field to all our collections that defaults to NOW
> >> so I
> >> > can see that no new data was added, but the replicas don't seem to get
> >> the
> >> > commit. I assume this is something in our configuration (see above).
> >> >
> >> > Is there a way to determine when the last commit occurred?
> >> >
> >> > I believe that the one replica got out of sync due to an admin
> running an
> >> > optimize while cdcr was still running.
> >> > That was one collection, but it looks like we are missing commits on
> most
> >> > of our collections.
> >> >
> >> > Any help would be greatly appreciated!
> >> >
> >> > Thanks,
> >> > Webster Homer
> >> >
> >> > On Mon, May 22, 2017 at 4:12 PM, Erick Erickson <
> erickerickson@gmail.com
> >> >
> >> > wrote:
> >> >
> >> >> You can ping individual replicas by addressing to a specific replica
> >> >> and setting distrib=false, something like
> >> >>
> >> >>      http://SOLR_NODE:port/solr/collection1_shard1_replica1/
> >> >> query?distrib=false&q=......
> >> >>
> >> >> But one thing to check first is that you've committed. I'd:
> >> >> 1> turn off indexing on the source cluster.
> >> >> 2> wait until the CDCR had caught up (if necessary).
> >> >> 3> issue a hard commit on the target
> >> >> 4> _then_ see if the counts were what is expected.
> >> >>
> >> >> Due to the fact that autocommit settings can fire at different clock
> >> >> times even for replicas on the same shard, it's easier to track
> >> >> whether it's a transient issue. The other thing I've seen people do
> is
> >> >> have a timestamp on the docs set to NOW (there's an update processor
> >> >> that can do this). Then when you check for consistency you can use
> >> >> fq=timestamp:[* TO NOW - (some interval significantly longer than
> your
> >> >> autocommit interval)].
> >> >>
> >> >> bq: Is there a way to recover when a shard has inconsistent replicas.
> >> >> If I use the delete replica API call to delete one of them and then
> use
> >> add
> >> >> replica to create it from scratch will it auto-populate from the
> other
> >> >> replica in the shard?
> >> >>
> >> >> Yes. Whenever you ADDREPLICA it'll catch itself up from the leader
> >> >> before becoming active. It'll have to copy the _entire_ index from
> the
> >> >> leader, so you'll see network traffic spike.
> >> >>
> >> >> Best,
> >> >> Erick
> >> >>
> >> >> On Mon, May 22, 2017 at 1:41 PM, Webster Homer <
> webster.homer@sial.com>
> >> >> wrote:
> >> >> > I have a solrcloud collection with 2 shards and 4 replicas. The
> >> replicas
> >> >> > for shard 1 have different numbers of records, so different queries
> >> will
> >> >> > return different numbers of records.
> >> >> >
> >> >> > I am not certain how this occurred, it happened in a collection
> that
> >> was
> >> >> a
> >> >> > cdcr target.
> >> >> >
> >> >> > Is there a way to limit a search to a specific replica of a shard?
> We
> >> >> want
> >> >> > to understand the differences
> >> >> >
> >> >> > Is there a way to recover when a shard has inconsistent replicas.
> >> >> > If I use the delete replica API call to delete one of them and then
> >> use
> >> >> add
> >> >> > replica to create it from scratch will it auto-populate from the
> other
> >> >> > replica in the shard?
> >> >> >
> >> >> > Thanks,
> >> >> > Webster
> >> >> >
> >> >> > --
> >> >> >
> >> >> >
> >> >> > This message and any attachment are confidential and may be
> >> privileged or
> >> >> > otherwise protected from disclosure. If you are not the intended
> >> >> recipient,
> >> >> > you must not copy this message or attachment or disclose the
> contents
> >> to
> >> >> > any other person. If you have received this transmission in error,
> >> please
> >> >> > notify the sender immediately and delete the message and any
> >> attachment
> >> >> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> >> >> > subsidiaries do not accept liability for any omissions or errors in
> >> this
> >> >> > message which may arise as a result of E-Mail-transmission or for
> >> damages
> >> >> > resulting from any unauthorized changes of the content of this
> message
> >> >> and
> >> >> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of
> its
> >> >> > subsidiaries do not guarantee that this message is free of viruses
> and
> >> >> does
> >> >> > not accept liability for any damages caused by any virus
> transmitted
> >> >> > therewith.
> >> >> >
> >> >> > Click http://www.emdgroup.com/disclaimer to access the German,
> >> French,
> >> >> > Spanish and Portuguese versions of this disclaimer.
> >> >>
> >> >
> >> > --
> >> >
> >> >
> >> > This message and any attachment are confidential and may be
> privileged or
> >> > otherwise protected from disclosure. If you are not the intended
> >> recipient,
> >> > you must not copy this message or attachment or disclose the contents
> to
> >> > any other person. If you have received this transmission in error,
> please
> >> > notify the sender immediately and delete the message and any
> attachment
> >> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> >> > subsidiaries do not accept liability for any omissions or errors in
> this
> >> > message which may arise as a result of E-Mail-transmission or for
> damages
> >> > resulting from any unauthorized changes of the content of this message
> >> and
> >> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> >> > subsidiaries do not guarantee that this message is free of viruses and
> >> does
> >> > not accept liability for any damages caused by any virus transmitted
> >> > therewith.
> >> >
> >> > Click http://www.emdgroup.com/disclaimer to access the German,
> French,
> >> > Spanish and Portuguese versions of this disclaimer.
> >>
> >
> > --
> >
> >
> > This message and any attachment are confidential and may be privileged or
> > otherwise protected from disclosure. If you are not the intended
> recipient,
> > you must not copy this message or attachment or disclose the contents to
> > any other person. If you have received this transmission in error, please
> > notify the sender immediately and delete the message and any attachment
> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not accept liability for any omissions or errors in this
> > message which may arise as a result of E-Mail-transmission or for damages
> > resulting from any unauthorized changes of the content of this message
> and
> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not guarantee that this message is free of viruses and
> does
> > not accept liability for any damages caused by any virus transmitted
> > therewith.
> >
> > Click http://www.emdgroup.com/disclaimer to access the German, French,
> > Spanish and Portuguese versions of this disclaimer.
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.

Re: solrcloud replicas not in sync

Posted by Erick Erickson <er...@gmail.com>.

I wouldn't rely on the "current" flag in the admin UI as an indicator.
As long as your numDocs and the like match I'd say it's a UI issue.

Best,
Erick

On Wed, May 24, 2017 at 8:15 AM, Webster Homer <we...@sial.com> wrote:
> We see data in the target clusters. CDCR replication is working. We first
> noticed the current=false flag on the target replicas, but since I started
> looking I see it on the source too.
>
>
> I have removed the IgnoreCommitOptimizeUpdateProcessorFactory from our
> update processor chain, I did two data loads to different collections.
> These collections are part of our development system, they are not
> configured to use cdcr they are directly loaded by our data load. The ETL
> to our solrs use the /update/json request handler and does not send
> commits. These collections mirror our production collections and have 2
> shards with 2 replicas. I see the situation where the replicas are marked
> current=false which should not happen if autoCommit was working correctly.
> The last load was yesterday at 5pm and I didn't check until this morning
> where I found bb-catalog-material_shard1_replica1 (the leader) was not
> current, but the other was. The last modified date on the leader was
> 2017-05-23T22:44:54.618Z.
>
> My modified autoCommit:
>       <autoCommit>
>        <maxTime>${solr.autoCommit.maxTime:600000}</maxTime>
>        <openSearcher>false</openSearcher>
>      </autoCommit>
>
>      <autoSoftCommit>
>        <maxTime>${solr.autoSoftCommit.maxTime:60000}</maxTime>
>      </autoSoftCommit>
>
> The last indexed record from a search matches up with the above time. For
> this test,the numDocs are the same between the two replicas. I think the
> soft commit is working. Why wouldn't both replicas be current after so many
> hours?
> We are using solr 6.2 fyi. I expect to upgrade to solr 6.6 when it becomes
> available
>
> Thanks,
> Webster
>
> On Tue, May 23, 2017 at 12:52 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> This is all quite strange. Optimize (BTW, it's rarely
>> necessary/desirable on an index that changes, despite its name)
>> shouldn't matter here. CDCR forwards the raw documents to the target
>> cluster.
>>
>> Ample time indeed. With a soft commit of 15 seconds, that's your
>> window (with some slop for how long CDCR takes).
>>
>> If you do a search and sort by your timestamp descending, what do you
>> see on the target cluster? And when you are indexing and CDCR is
>> running, your target cluster solr logs should show updates coming in.
>> Mostly checking if the data is even getting to the target cluster
>> here.
>>
>> Also check the tlogs on the source cluster. By "check" here I just
>> mean "are they reasonable size", and "reasonable" should be very
>> small. The tlogs are the "queue" that CDCR uses to store docs before
>> forwarding to the target cluster, so this is just a sanity check. If
>> they're huge, then CDCR is not forwarding anything to the target
>> cluster.
>>
>> It's also vaguely possible that
>> IgnoreCommitOptimizeUpdateProcessorFactory is interfering, if so it's
>> a bug and should be reported as a JIRA. If you remove that on the
>> target cluster, does the behavior change?
>>
>> I'm mystified here as you can tell.
>>
>> Best,
>> Erick
>>
>> On Tue, May 23, 2017 at 10:12 AM, Webster Homer <we...@sial.com>
>> wrote:
>> > We see a pretty consistent issue where the replicas show in the admin
>> > console as not current, indicating that our auto commit isn't commiting.
>> In
>> > one case we loaded the data to the source, cdcr replicated it to the
>> > targets and we see the source and the target as having current = false.
>> It
>> > is searchable so the soft commits are happening. We turned off data
>> loading
>> > to investigate this issue, and the replicas are still not current after 3
>> > days. So there should have been ample time to catch up.
>> > This is our autoCommit
>> >      <autoCommit>
>> >        <maxDocs>25000</maxDocs>
>> >        <maxTime>${solr.autoCommit.maxTime:300000}</maxTime>
>> >        <openSearcher>false</openSearcher>
>> >      </autoCommit>
>> >
>> > This is our autoSoftCommit
>> >      <autoSoftCommit>
>> >        <maxTime>${solr.autoSoftCommit.maxTime:15000}</maxTime>
>> >      </autoSoftCommit>
>> > neither property, solr.autoCommit.maxTime or solr.autoSoftCommit.maxTime
>> > are set.
>> >
>> > We also have an updateChain that calls the
>> > solr.IgnoreCommitOptimizeUpdateProcessorFactory to ignore client
>> commits.
>> > Could that be the cause of our
>> >       <updateRequestProcessorChain name="cleanup">
>> >      <!-- Ignore commits from clients, telling them all's OK -->
>> >        <processor class="solr.IgnoreCommitOptimizeUpdateProc
>> essorFactory">
>> >          <int name="statusCode">200</int>
>> >        </processor>
>> >
>> >        <processor class="TrimFieldUpdateProcessorFactory" />
>> >        <processor class="RemoveBlankFieldUpdateProcessorFactory" />
>> >
>> >        <processor class="solr.LogUpdateProcessorFactory" />
>> >        <processor class="solr.RunUpdateProcessorFactory" />
>> >      </updateRequestProcessorChain>
>> >
>> > We did create a date field to all our collections that defaults to NOW
>> so I
>> > can see that no new data was added, but the replicas don't seem to get
>> the
>> > commit. I assume this is something in our configuration (see above).
>> >
>> > Is there a way to determine when the last commit occurred?
>> >
>> > I believe that the one replica got out of sync due to an admin running an
>> > optimize while cdcr was still running.
>> > That was one collection, but it looks like we are missing commits on most
>> > of our collections.
>> >
>> > Any help would be greatly appreciated!
>> >
>> > Thanks,
>> > Webster Homer
>> >
>> > On Mon, May 22, 2017 at 4:12 PM, Erick Erickson <erickerickson@gmail.com
>> >
>> > wrote:
>> >
>> >> You can ping individual replicas by addressing to a specific replica
>> >> and setting distrib=false, something like
>> >>
>> >>      http://SOLR_NODE:port/solr/collection1_shard1_replica1/
>> >> query?distrib=false&q=......
>> >>
>> >> But one thing to check first is that you've committed. I'd:
>> >> 1> turn off indexing on the source cluster.
>> >> 2> wait until the CDCR had caught up (if necessary).
>> >> 3> issue a hard commit on the target
>> >> 4> _then_ see if the counts were what is expected.
>> >>
>> >> Due to the fact that autocommit settings can fire at different clock
>> >> times even for replicas on the same shard, it's easier to track
>> >> whether it's a transient issue. The other thing I've seen people do is
>> >> have a timestamp on the docs set to NOW (there's an update processor
>> >> that can do this). Then when you check for consistency you can use
>> >> fq=timestamp:[* TO NOW - (some interval significantly longer than your
>> >> autocommit interval)].
>> >>
>> >> bq: Is there a way to recover when a shard has inconsistent replicas.
>> >> If I use the delete replica API call to delete one of them and then use
>> add
>> >> replica to create it from scratch will it auto-populate from the other
>> >> replica in the shard?
>> >>
>> >> Yes. Whenever you ADDREPLICA it'll catch itself up from the leader
>> >> before becoming active. It'll have to copy the _entire_ index from the
>> >> leader, so you'll see network traffic spike.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Mon, May 22, 2017 at 1:41 PM, Webster Homer <we...@sial.com>
>> >> wrote:
>> >> > I have a solrcloud collection with 2 shards and 4 replicas. The
>> replicas
>> >> > for shard 1 have different numbers of records, so different queries
>> will
>> >> > return different numbers of records.
>> >> >
>> >> > I am not certain how this occurred, it happened in a collection that
>> was
>> >> a
>> >> > cdcr target.
>> >> >
>> >> > Is there a way to limit a search to a specific replica of a shard? We
>> >> want
>> >> > to understand the differences
>> >> >
>> >> > Is there a way to recover when a shard has inconsistent replicas.
>> >> > If I use the delete replica API call to delete one of them and then
>> use
>> >> add
>> >> > replica to create it from scratch will it auto-populate from the other
>> >> > replica in the shard?
>> >> >
>> >> > Thanks,
>> >> > Webster
>> >> >
>> >> > --
>> >> >
>> >> >
>> >> > This message and any attachment are confidential and may be
>> privileged or
>> >> > otherwise protected from disclosure. If you are not the intended
>> >> recipient,
>> >> > you must not copy this message or attachment or disclose the contents
>> to
>> >> > any other person. If you have received this transmission in error,
>> please
>> >> > notify the sender immediately and delete the message and any
>> attachment
>> >> > from your system. Merck KGaA, Darmstadt, Germany and any of its
>> >> > subsidiaries do not accept liability for any omissions or errors in
>> this
>> >> > message which may arise as a result of E-Mail-transmission or for
>> damages
>> >> > resulting from any unauthorized changes of the content of this message
>> >> and
>> >> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>> >> > subsidiaries do not guarantee that this message is free of viruses and
>> >> does
>> >> > not accept liability for any damages caused by any virus transmitted
>> >> > therewith.
>> >> >
>> >> > Click http://www.emdgroup.com/disclaimer to access the German,
>> French,
>> >> > Spanish and Portuguese versions of this disclaimer.
>> >>
>> >
>> > --
>> >
>> >
>> > This message and any attachment are confidential and may be privileged or
>> > otherwise protected from disclosure. If you are not the intended
>> recipient,
>> > you must not copy this message or attachment or disclose the contents to
>> > any other person. If you have received this transmission in error, please
>> > notify the sender immediately and delete the message and any attachment
>> > from your system. Merck KGaA, Darmstadt, Germany and any of its
>> > subsidiaries do not accept liability for any omissions or errors in this
>> > message which may arise as a result of E-Mail-transmission or for damages
>> > resulting from any unauthorized changes of the content of this message
>> and
>> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>> > subsidiaries do not guarantee that this message is free of viruses and
>> does
>> > not accept liability for any damages caused by any virus transmitted
>> > therewith.
>> >
>> > Click http://www.emdgroup.com/disclaimer to access the German, French,
>> > Spanish and Portuguese versions of this disclaimer.
>>
>
> --
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
> Click http://www.emdgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.

Re: solrcloud replicas not in sync

Posted by Webster Homer <we...@sial.com>.

We see data in the target clusters. CDCR replication is working. We first
noticed the current=false flag on the target replicas, but since I started
looking I see it on the source too.


I have removed the IgnoreCommitOptimizeUpdateProcessorFactory from our
update processor chain, I did two data loads to different collections.
These collections are part of our development system, they are not
configured to use cdcr they are directly loaded by our data load. The ETL
to our solrs use the /update/json request handler and does not send
commits. These collections mirror our production collections and have 2
shards with 2 replicas. I see the situation where the replicas are marked
current=false which should not happen if autoCommit was working correctly.
The last load was yesterday at 5pm and I didn't check until this morning
where I found bb-catalog-material_shard1_replica1 (the leader) was not
current, but the other was. The last modified date on the leader was
2017-05-23T22:44:54.618Z.

My modified autoCommit:
      <autoCommit>
       <maxTime>${solr.autoCommit.maxTime:600000}</maxTime>
       <openSearcher>false</openSearcher>
     </autoCommit>

     <autoSoftCommit>
       <maxTime>${solr.autoSoftCommit.maxTime:60000}</maxTime>
     </autoSoftCommit>

The last indexed record from a search matches up with the above time. For
this test,the numDocs are the same between the two replicas. I think the
soft commit is working. Why wouldn't both replicas be current after so many
hours?
We are using solr 6.2 fyi. I expect to upgrade to solr 6.6 when it becomes
available

Thanks,
Webster

On Tue, May 23, 2017 at 12:52 PM, Erick Erickson <er...@gmail.com>
wrote:

> This is all quite strange. Optimize (BTW, it's rarely
> necessary/desirable on an index that changes, despite its name)
> shouldn't matter here. CDCR forwards the raw documents to the target
> cluster.
>
> Ample time indeed. With a soft commit of 15 seconds, that's your
> window (with some slop for how long CDCR takes).
>
> If you do a search and sort by your timestamp descending, what do you
> see on the target cluster? And when you are indexing and CDCR is
> running, your target cluster solr logs should show updates coming in.
> Mostly checking if the data is even getting to the target cluster
> here.
>
> Also check the tlogs on the source cluster. By "check" here I just
> mean "are they reasonable size", and "reasonable" should be very
> small. The tlogs are the "queue" that CDCR uses to store docs before
> forwarding to the target cluster, so this is just a sanity check. If
> they're huge, then CDCR is not forwarding anything to the target
> cluster.
>
> It's also vaguely possible that
> IgnoreCommitOptimizeUpdateProcessorFactory is interfering, if so it's
> a bug and should be reported as a JIRA. If you remove that on the
> target cluster, does the behavior change?
>
> I'm mystified here as you can tell.
>
> Best,
> Erick
>
> On Tue, May 23, 2017 at 10:12 AM, Webster Homer <we...@sial.com>
> wrote:
> > We see a pretty consistent issue where the replicas show in the admin
> > console as not current, indicating that our auto commit isn't commiting.
> In
> > one case we loaded the data to the source, cdcr replicated it to the
> > targets and we see the source and the target as having current = false.
> It
> > is searchable so the soft commits are happening. We turned off data
> loading
> > to investigate this issue, and the replicas are still not current after 3
> > days. So there should have been ample time to catch up.
> > This is our autoCommit
> >      <autoCommit>
> >        <maxDocs>25000</maxDocs>
> >        <maxTime>${solr.autoCommit.maxTime:300000}</maxTime>
> >        <openSearcher>false</openSearcher>
> >      </autoCommit>
> >
> > This is our autoSoftCommit
> >      <autoSoftCommit>
> >        <maxTime>${solr.autoSoftCommit.maxTime:15000}</maxTime>
> >      </autoSoftCommit>
> > neither property, solr.autoCommit.maxTime or solr.autoSoftCommit.maxTime
> > are set.
> >
> > We also have an updateChain that calls the
> > solr.IgnoreCommitOptimizeUpdateProcessorFactory to ignore client
> commits.
> > Could that be the cause of our
> >       <updateRequestProcessorChain name="cleanup">
> >      <!-- Ignore commits from clients, telling them all's OK -->
> >        <processor class="solr.IgnoreCommitOptimizeUpdateProc
> essorFactory">
> >          <int name="statusCode">200</int>
> >        </processor>
> >
> >        <processor class="TrimFieldUpdateProcessorFactory" />
> >        <processor class="RemoveBlankFieldUpdateProcessorFactory" />
> >
> >        <processor class="solr.LogUpdateProcessorFactory" />
> >        <processor class="solr.RunUpdateProcessorFactory" />
> >      </updateRequestProcessorChain>
> >
> > We did create a date field to all our collections that defaults to NOW
> so I
> > can see that no new data was added, but the replicas don't seem to get
> the
> > commit. I assume this is something in our configuration (see above).
> >
> > Is there a way to determine when the last commit occurred?
> >
> > I believe that the one replica got out of sync due to an admin running an
> > optimize while cdcr was still running.
> > That was one collection, but it looks like we are missing commits on most
> > of our collections.
> >
> > Any help would be greatly appreciated!
> >
> > Thanks,
> > Webster Homer
> >
> > On Mon, May 22, 2017 at 4:12 PM, Erick Erickson <erickerickson@gmail.com
> >
> > wrote:
> >
> >> You can ping individual replicas by addressing to a specific replica
> >> and setting distrib=false, something like
> >>
> >>      http://SOLR_NODE:port/solr/collection1_shard1_replica1/
> >> query?distrib=false&q=......
> >>
> >> But one thing to check first is that you've committed. I'd:
> >> 1> turn off indexing on the source cluster.
> >> 2> wait until the CDCR had caught up (if necessary).
> >> 3> issue a hard commit on the target
> >> 4> _then_ see if the counts were what is expected.
> >>
> >> Due to the fact that autocommit settings can fire at different clock
> >> times even for replicas on the same shard, it's easier to track
> >> whether it's a transient issue. The other thing I've seen people do is
> >> have a timestamp on the docs set to NOW (there's an update processor
> >> that can do this). Then when you check for consistency you can use
> >> fq=timestamp:[* TO NOW - (some interval significantly longer than your
> >> autocommit interval)].
> >>
> >> bq: Is there a way to recover when a shard has inconsistent replicas.
> >> If I use the delete replica API call to delete one of them and then use
> add
> >> replica to create it from scratch will it auto-populate from the other
> >> replica in the shard?
> >>
> >> Yes. Whenever you ADDREPLICA it'll catch itself up from the leader
> >> before becoming active. It'll have to copy the _entire_ index from the
> >> leader, so you'll see network traffic spike.
> >>
> >> Best,
> >> Erick
> >>
> >> On Mon, May 22, 2017 at 1:41 PM, Webster Homer <we...@sial.com>
> >> wrote:
> >> > I have a solrcloud collection with 2 shards and 4 replicas. The
> replicas
> >> > for shard 1 have different numbers of records, so different queries
> will
> >> > return different numbers of records.
> >> >
> >> > I am not certain how this occurred, it happened in a collection that
> was
> >> a
> >> > cdcr target.
> >> >
> >> > Is there a way to limit a search to a specific replica of a shard? We
> >> want
> >> > to understand the differences
> >> >
> >> > Is there a way to recover when a shard has inconsistent replicas.
> >> > If I use the delete replica API call to delete one of them and then
> use
> >> add
> >> > replica to create it from scratch will it auto-populate from the other
> >> > replica in the shard?
> >> >
> >> > Thanks,
> >> > Webster
> >> >
> >> > --
> >> >
> >> >
> >> > This message and any attachment are confidential and may be
> privileged or
> >> > otherwise protected from disclosure. If you are not the intended
> >> recipient,
> >> > you must not copy this message or attachment or disclose the contents
> to
> >> > any other person. If you have received this transmission in error,
> please
> >> > notify the sender immediately and delete the message and any
> attachment
> >> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> >> > subsidiaries do not accept liability for any omissions or errors in
> this
> >> > message which may arise as a result of E-Mail-transmission or for
> damages
> >> > resulting from any unauthorized changes of the content of this message
> >> and
> >> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> >> > subsidiaries do not guarantee that this message is free of viruses and
> >> does
> >> > not accept liability for any damages caused by any virus transmitted
> >> > therewith.
> >> >
> >> > Click http://www.emdgroup.com/disclaimer to access the German,
> French,
> >> > Spanish and Portuguese versions of this disclaimer.
> >>
> >
> > --
> >
> >
> > This message and any attachment are confidential and may be privileged or
> > otherwise protected from disclosure. If you are not the intended
> recipient,
> > you must not copy this message or attachment or disclose the contents to
> > any other person. If you have received this transmission in error, please
> > notify the sender immediately and delete the message and any attachment
> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not accept liability for any omissions or errors in this
> > message which may arise as a result of E-Mail-transmission or for damages
> > resulting from any unauthorized changes of the content of this message
> and
> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not guarantee that this message is free of viruses and
> does
> > not accept liability for any damages caused by any virus transmitted
> > therewith.
> >
> > Click http://www.emdgroup.com/disclaimer to access the German, French,
> > Spanish and Portuguese versions of this disclaimer.
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.

Re: solrcloud replicas not in sync

Posted by Erick Erickson <er...@gmail.com>.

This is all quite strange. Optimize (BTW, it's rarely
necessary/desirable on an index that changes, despite its name)
shouldn't matter here. CDCR forwards the raw documents to the target
cluster.

Ample time indeed. With a soft commit of 15 seconds, that's your
window (with some slop for how long CDCR takes).

If you do a search and sort by your timestamp descending, what do you
see on the target cluster? And when you are indexing and CDCR is
running, your target cluster solr logs should show updates coming in.
Mostly checking if the data is even getting to the target cluster
here.

Also check the tlogs on the source cluster. By "check" here I just
mean "are they reasonable size", and "reasonable" should be very
small. The tlogs are the "queue" that CDCR uses to store docs before
forwarding to the target cluster, so this is just a sanity check. If
they're huge, then CDCR is not forwarding anything to the target
cluster.

It's also vaguely possible that
IgnoreCommitOptimizeUpdateProcessorFactory is interfering, if so it's
a bug and should be reported as a JIRA. If you remove that on the
target cluster, does the behavior change?

I'm mystified here as you can tell.

Best,
Erick

On Tue, May 23, 2017 at 10:12 AM, Webster Homer <we...@sial.com> wrote:
> We see a pretty consistent issue where the replicas show in the admin
> console as not current, indicating that our auto commit isn't commiting. In
> one case we loaded the data to the source, cdcr replicated it to the
> targets and we see the source and the target as having current = false. It
> is searchable so the soft commits are happening. We turned off data loading
> to investigate this issue, and the replicas are still not current after 3
> days. So there should have been ample time to catch up.
> This is our autoCommit
>      <autoCommit>
>        <maxDocs>25000</maxDocs>
>        <maxTime>${solr.autoCommit.maxTime:300000}</maxTime>
>        <openSearcher>false</openSearcher>
>      </autoCommit>
>
> This is our autoSoftCommit
>      <autoSoftCommit>
>        <maxTime>${solr.autoSoftCommit.maxTime:15000}</maxTime>
>      </autoSoftCommit>
> neither property, solr.autoCommit.maxTime or solr.autoSoftCommit.maxTime
> are set.
>
> We also have an updateChain that calls the
> solr.IgnoreCommitOptimizeUpdateProcessorFactory to ignore client commits.
> Could that be the cause of our
>       <updateRequestProcessorChain name="cleanup">
>      <!-- Ignore commits from clients, telling them all's OK -->
>        <processor class="solr.IgnoreCommitOptimizeUpdateProcessorFactory">
>          <int name="statusCode">200</int>
>        </processor>
>
>        <processor class="TrimFieldUpdateProcessorFactory" />
>        <processor class="RemoveBlankFieldUpdateProcessorFactory" />
>
>        <processor class="solr.LogUpdateProcessorFactory" />
>        <processor class="solr.RunUpdateProcessorFactory" />
>      </updateRequestProcessorChain>
>
> We did create a date field to all our collections that defaults to NOW so I
> can see that no new data was added, but the replicas don't seem to get the
> commit. I assume this is something in our configuration (see above).
>
> Is there a way to determine when the last commit occurred?
>
> I believe that the one replica got out of sync due to an admin running an
> optimize while cdcr was still running.
> That was one collection, but it looks like we are missing commits on most
> of our collections.
>
> Any help would be greatly appreciated!
>
> Thanks,
> Webster Homer
>
> On Mon, May 22, 2017 at 4:12 PM, Erick Erickson <er...@gmail.com>
> wrote:
>
>> You can ping individual replicas by addressing to a specific replica
>> and setting distrib=false, something like
>>
>>      http://SOLR_NODE:port/solr/collection1_shard1_replica1/
>> query?distrib=false&q=......
>>
>> But one thing to check first is that you've committed. I'd:
>> 1> turn off indexing on the source cluster.
>> 2> wait until the CDCR had caught up (if necessary).
>> 3> issue a hard commit on the target
>> 4> _then_ see if the counts were what is expected.
>>
>> Due to the fact that autocommit settings can fire at different clock
>> times even for replicas on the same shard, it's easier to track
>> whether it's a transient issue. The other thing I've seen people do is
>> have a timestamp on the docs set to NOW (there's an update processor
>> that can do this). Then when you check for consistency you can use
>> fq=timestamp:[* TO NOW - (some interval significantly longer than your
>> autocommit interval)].
>>
>> bq: Is there a way to recover when a shard has inconsistent replicas.
>> If I use the delete replica API call to delete one of them and then use add
>> replica to create it from scratch will it auto-populate from the other
>> replica in the shard?
>>
>> Yes. Whenever you ADDREPLICA it'll catch itself up from the leader
>> before becoming active. It'll have to copy the _entire_ index from the
>> leader, so you'll see network traffic spike.
>>
>> Best,
>> Erick
>>
>> On Mon, May 22, 2017 at 1:41 PM, Webster Homer <we...@sial.com>
>> wrote:
>> > I have a solrcloud collection with 2 shards and 4 replicas. The replicas
>> > for shard 1 have different numbers of records, so different queries will
>> > return different numbers of records.
>> >
>> > I am not certain how this occurred, it happened in a collection that was
>> a
>> > cdcr target.
>> >
>> > Is there a way to limit a search to a specific replica of a shard? We
>> want
>> > to understand the differences
>> >
>> > Is there a way to recover when a shard has inconsistent replicas.
>> > If I use the delete replica API call to delete one of them and then use
>> add
>> > replica to create it from scratch will it auto-populate from the other
>> > replica in the shard?
>> >
>> > Thanks,
>> > Webster
>> >
>> > --
>> >
>> >
>> > This message and any attachment are confidential and may be privileged or
>> > otherwise protected from disclosure. If you are not the intended
>> recipient,
>> > you must not copy this message or attachment or disclose the contents to
>> > any other person. If you have received this transmission in error, please
>> > notify the sender immediately and delete the message and any attachment
>> > from your system. Merck KGaA, Darmstadt, Germany and any of its
>> > subsidiaries do not accept liability for any omissions or errors in this
>> > message which may arise as a result of E-Mail-transmission or for damages
>> > resulting from any unauthorized changes of the content of this message
>> and
>> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>> > subsidiaries do not guarantee that this message is free of viruses and
>> does
>> > not accept liability for any damages caused by any virus transmitted
>> > therewith.
>> >
>> > Click http://www.emdgroup.com/disclaimer to access the German, French,
>> > Spanish and Portuguese versions of this disclaimer.
>>
>
> --
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
> Click http://www.emdgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.

Re: solrcloud replicas not in sync

Posted by Webster Homer <we...@sial.com>.

We see a pretty consistent issue where the replicas show in the admin
console as not current, indicating that our auto commit isn't commiting. In
one case we loaded the data to the source, cdcr replicated it to the
targets and we see the source and the target as having current = false. It
is searchable so the soft commits are happening. We turned off data loading
to investigate this issue, and the replicas are still not current after 3
days. So there should have been ample time to catch up.
This is our autoCommit
     <autoCommit>
       <maxDocs>25000</maxDocs>
       <maxTime>${solr.autoCommit.maxTime:300000}</maxTime>
       <openSearcher>false</openSearcher>
     </autoCommit>

This is our autoSoftCommit
     <autoSoftCommit>
       <maxTime>${solr.autoSoftCommit.maxTime:15000}</maxTime>
     </autoSoftCommit>
neither property, solr.autoCommit.maxTime or solr.autoSoftCommit.maxTime
are set.

We also have an updateChain that calls the
solr.IgnoreCommitOptimizeUpdateProcessorFactory to ignore client commits.
Could that be the cause of our
      <updateRequestProcessorChain name="cleanup">
     <!-- Ignore commits from clients, telling them all's OK -->
       <processor class="solr.IgnoreCommitOptimizeUpdateProcessorFactory">
         <int name="statusCode">200</int>
       </processor>

       <processor class="TrimFieldUpdateProcessorFactory" />
       <processor class="RemoveBlankFieldUpdateProcessorFactory" />

       <processor class="solr.LogUpdateProcessorFactory" />
       <processor class="solr.RunUpdateProcessorFactory" />
     </updateRequestProcessorChain>

We did create a date field to all our collections that defaults to NOW so I
can see that no new data was added, but the replicas don't seem to get the
commit. I assume this is something in our configuration (see above).

Is there a way to determine when the last commit occurred?

I believe that the one replica got out of sync due to an admin running an
optimize while cdcr was still running.
That was one collection, but it looks like we are missing commits on most
of our collections.

Any help would be greatly appreciated!

Thanks,
Webster Homer

On Mon, May 22, 2017 at 4:12 PM, Erick Erickson <er...@gmail.com>
wrote:

> You can ping individual replicas by addressing to a specific replica
> and setting distrib=false, something like
>
>      http://SOLR_NODE:port/solr/collection1_shard1_replica1/
> query?distrib=false&q=......
>
> But one thing to check first is that you've committed. I'd:
> 1> turn off indexing on the source cluster.
> 2> wait until the CDCR had caught up (if necessary).
> 3> issue a hard commit on the target
> 4> _then_ see if the counts were what is expected.
>
> Due to the fact that autocommit settings can fire at different clock
> times even for replicas on the same shard, it's easier to track
> whether it's a transient issue. The other thing I've seen people do is
> have a timestamp on the docs set to NOW (there's an update processor
> that can do this). Then when you check for consistency you can use
> fq=timestamp:[* TO NOW - (some interval significantly longer than your
> autocommit interval)].
>
> bq: Is there a way to recover when a shard has inconsistent replicas.
> If I use the delete replica API call to delete one of them and then use add
> replica to create it from scratch will it auto-populate from the other
> replica in the shard?
>
> Yes. Whenever you ADDREPLICA it'll catch itself up from the leader
> before becoming active. It'll have to copy the _entire_ index from the
> leader, so you'll see network traffic spike.
>
> Best,
> Erick
>
> On Mon, May 22, 2017 at 1:41 PM, Webster Homer <we...@sial.com>
> wrote:
> > I have a solrcloud collection with 2 shards and 4 replicas. The replicas
> > for shard 1 have different numbers of records, so different queries will
> > return different numbers of records.
> >
> > I am not certain how this occurred, it happened in a collection that was
> a
> > cdcr target.
> >
> > Is there a way to limit a search to a specific replica of a shard? We
> want
> > to understand the differences
> >
> > Is there a way to recover when a shard has inconsistent replicas.
> > If I use the delete replica API call to delete one of them and then use
> add
> > replica to create it from scratch will it auto-populate from the other
> > replica in the shard?
> >
> > Thanks,
> > Webster
> >
> > --
> >
> >
> > This message and any attachment are confidential and may be privileged or
> > otherwise protected from disclosure. If you are not the intended
> recipient,
> > you must not copy this message or attachment or disclose the contents to
> > any other person. If you have received this transmission in error, please
> > notify the sender immediately and delete the message and any attachment
> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not accept liability for any omissions or errors in this
> > message which may arise as a result of E-Mail-transmission or for damages
> > resulting from any unauthorized changes of the content of this message
> and
> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not guarantee that this message is free of viruses and
> does
> > not accept liability for any damages caused by any virus transmitted
> > therewith.
> >
> > Click http://www.emdgroup.com/disclaimer to access the German, French,
> > Spanish and Portuguese versions of this disclaimer.
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.

Re: solrcloud replicas not in sync

Posted by Erick Erickson <er...@gmail.com>.

You can ping individual replicas by addressing to a specific replica
and setting distrib=false, something like

     http://SOLR_NODE:port/solr/collection1_shard1_replica1/query?distrib=false&q=......

But one thing to check first is that you've committed. I'd:
1> turn off indexing on the source cluster.
2> wait until the CDCR had caught up (if necessary).
3> issue a hard commit on the target
4> _then_ see if the counts were what is expected.

Due to the fact that autocommit settings can fire at different clock
times even for replicas on the same shard, it's easier to track
whether it's a transient issue. The other thing I've seen people do is
have a timestamp on the docs set to NOW (there's an update processor
that can do this). Then when you check for consistency you can use
fq=timestamp:[* TO NOW - (some interval significantly longer than your
autocommit interval)].

bq: Is there a way to recover when a shard has inconsistent replicas.
If I use the delete replica API call to delete one of them and then use add
replica to create it from scratch will it auto-populate from the other
replica in the shard?

Yes. Whenever you ADDREPLICA it'll catch itself up from the leader
before becoming active. It'll have to copy the _entire_ index from the
leader, so you'll see network traffic spike.

Best,
Erick

On Mon, May 22, 2017 at 1:41 PM, Webster Homer <we...@sial.com> wrote:
> I have a solrcloud collection with 2 shards and 4 replicas. The replicas
> for shard 1 have different numbers of records, so different queries will
> return different numbers of records.
>
> I am not certain how this occurred, it happened in a collection that was a
> cdcr target.
>
> Is there a way to limit a search to a specific replica of a shard? We want
> to understand the differences
>
> Is there a way to recover when a shard has inconsistent replicas.
> If I use the delete replica API call to delete one of them and then use add
> replica to create it from scratch will it auto-populate from the other
> replica in the shard?
>
> Thanks,
> Webster
>
> --
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
> Click http://www.emdgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.