You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Gili Nachum <gi...@gmail.com> on 2015/09/21 08:34:51 UTC

How can I get a monotonically increasing field value for docs?

I've implemented a custom solr2solr ongoing unidirectional replication
mechanism.

A Replicator (acting as solrJ client), crawls documents from SolrCloud1 and
writes them to SolrCloud2 in batches.
The replicator crawl logic is to read documents with a time greater/equale
to the time of the last replicated document.
Whenever a document is added/updated, I auto updated a a tdate field
"last_updated_in_solr" using TimestampUpdateProcessorFactory.

*My problem: *When a client indexes a batch of 100 documents, all 100 docs
have the same "last_updated_in_solr" value. This makes my ongoing
replication check for new documents to replicate much more complex than if
the time value was unique.

1. Can I use some other processor to generate increasing unique values?
2. Can I use the internal _version_ field for this? is it guaranteed to be
monotonically increasing for the entire collection or only per document,
with each add/update?
Any other options?

Schema.xml:
<field name="last_updated_in_solr" type="tdate" indexed="true"
stored="true" multiValued="false"/>

solrconfig.xml:
<updateRequestProcessorChain name="default">
       <processor class="solr.TimestampUpdateProcessorFactory">
           <str name="fieldName">last_updated_in_solr</str>
       </processor>
       <processor class="solr.LogUpdateProcessorFactory" />
       <processor class="solr.RunUpdateProcessorFactory" />
    </updateRequestProcessorChain>

I know there's work for a build-in replication mechanism, but it's not yet
released.
Using Solr 4.7.2.

Re: How can I get a monotonically increasing field value for docs?

Posted by Gili Nachum <gi...@gmail.com>.

Glad I made that silly statement.
I came to know cursorMark, after noticing how much inefficient is native
deep paging in Solr, where each shard returns rowXstart worth of data to
the shard servicing the query. I then *wrongly* assumed that cursorMark
records the returned doc # of the result set for *each shard*, so that in
the next request the each shard would return the next rows worth of
document from where its previous index.

I now see how the cursorMark value encodes the fields to sort by of the
last returned document, so that on the next requests each shard would fetch
documents post that point (with Lucene's searchAfter()) - just like in my
own custom implementation.

Thanks for clarifying.

On Wed, Sep 30, 2015 at 8:46 PM, Chris Hostetter <ho...@fucit.org>
wrote:

>
> : Small potato: I assume cursor mark breaks when the number of shards
> changes
> : while keeping the original values doesn't, since the relative position is
> : encoded per shard...But that's an edge case.
>
> I don't understand your question ... the encoded cursorMark values don't
> know about thing know/care anyhting about shards.  It only encodes
> information about the *relative* position where you left off according to
> the specified sort -- that position is relative to the abstract orderings
> of all possible values, not relative to any particular shard(s)
>
> in your use case it would function *exactly* the same as keeping track of
> the exact timestamp and unqiueKey of the last doc you recieved, and
> passing that cursorMark value back on the next query would be exactly the
> same as specifying a "fq=timestamp:{X TO *] OR (timestamp:X AND id:[Y TO
> *])" on the next request, except that under the covers the way a
> cursorMark is passed down to the IndexSearcher as a "searchAfter"
> structure should be more efficient then using an fq.
>
> adding shards, removing shards, adding documents, removing documents ...
> cursorMark doesn't care ... what you get back is any doc that, at the
> moment you sent that cursorMark value, has sort values which would place
> that doc *after* the last doc you recevied with the previous request when
> you got that value as the nextCursorMark.
>
> changing the value of a sort field in a document in the middle of
> iteration might affect if it is ever seen, or if it's seen more then once
> (see previusly mentioned URL for detailed examples) but spliting shards or
> what not it's not going to the results of iterating a cursor in any way.
>
>
> -Hoss
> http://www.lucidworks.com/
>

Re: How can I get a monotonically increasing field value for docs?

Posted by Chris Hostetter <ho...@fucit.org>.

: Small potato: I assume cursor mark breaks when the number of shards changes
: while keeping the original values doesn't, since the relative position is
: encoded per shard...But that's an edge case.

I don't understand your question ... the encoded cursorMark values don't 
know about thing know/care anyhting about shards.  It only encodes 
information about the *relative* position where you left off according to 
the specified sort -- that position is relative to the abstract orderings 
of all possible values, not relative to any particular shard(s)

in your use case it would function *exactly* the same as keeping track of 
the exact timestamp and unqiueKey of the last doc you recieved, and 
passing that cursorMark value back on the next query would be exactly the 
same as specifying a "fq=timestamp:{X TO *] OR (timestamp:X AND id:[Y TO 
*])" on the next request, except that under the covers the way a 
cursorMark is passed down to the IndexSearcher as a "searchAfter" 
structure should be more efficient then using an fq.

adding shards, removing shards, adding documents, removing documents ... 
cursorMark doesn't care ... what you get back is any doc that, at the 
moment you sent that cursorMark value, has sort values which would place 
that doc *after* the last doc you recevied with the previous request when 
you got that value as the nextCursorMark.

changing the value of a sort field in a document in the middle of 
iteration might affect if it is ever seen, or if it's seen more then once 
(see previusly mentioned URL for detailed examples) but spliting shards or 
what not it's not going to the results of iterating a cursor in any way.


-Hoss
http://www.lucidworks.com/

Re: How can I get a monotonically increasing field value for docs?

Posted by Gili Nachum <gi...@gmail.com>.

Hoss,

Good point, didn't know about cursor mark when we designed this a year ago
:(

Small potato: I assume cursor mark breaks when the number of shards changes
while keeping the original values doesn't, since the relative position is
encoded per shard...But that's an edge case.

Looking forward for http://yonik.com/solr-cross-data-center-replication/

On Tue, Sep 29, 2015 at 10:20 PM, Chris Hostetter <ho...@fucit.org>
wrote:

>
>
> You're basically re-implementing Solr' cursors.
>
> you can change your system of reading docs from the old collection to
> use...
>
> cursorMark=*&sort=timestamp+asc,id+asc
>
> ...and then instead of keeping track of the last timestamp & id values and
> constructing a filter, you can just keep track of the nextCursorMark and
> pass it the next time you want to check for newer documents...
>
> https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results
>
>
>
>
>
> : Date: Mon, 21 Sep 2015 21:32:33 +0300
> : From: Gili Nachum <gi...@gmail.com>
> : Reply-To: solr-user@lucene.apache.org
> : To: solr-user@lucene.apache.org
> : Subject: Re: How can I get a monotonically increasing field value for
> docs?
> :
> : Thanks for the indepth explanation!
> :
> : The secondary sort by uuid would allow me to read a series of docs with
> : identical time over multiple batches by specifying filtering
> : time>timeOnLastReadDoc or (time=timeOnLastReadDoc and
> : uuid>uuidOnLastReaDoc) which essentially creates a unique sorted value to
> : track progress over.
> : On Sep 21, 2015 19:56, "Shawn Heisey" <ap...@elyograg.org> wrote:
> :
> : > On 9/21/2015 9:01 AM, Gili Nachum wrote:
> : > > TimestampUpdateProcessorFactory takes place only on the leader
> shard, or
> : > on
> : > > each shard replica?
> : > > if on each replica then I would get different values on each replica.
> : > >
> : > > My alternative would be to perform secondary sort on a UUID to ensure
> : > order.
> : >
> : > If the update chain is configured properly, it runs on the leader, so
> : > all replicas get the same timestamp.
> : >
> : > Without SolrCloud, the way to create an "indexed at" time field is in
> : > the schema -- specify a default value of NOW on the field definition
> and
> : > don't send the field when indexing.  The old master/slave replication
> : > copies the actual index contents, so the indexed values in all replicas
> : > are the same.
> : >
> : > The problem with NOW in the schema when running SolrCloud is that each
> : > replica indexes the document independently, so each replica can have a
> : > different timestamp.  This is why the timestamp update processor exists
> : > -- to set the timestamp to a specific value before the document is
> : > duplicated to each replica, eliminating the problem.
> : >
> : > FYI, secondary sort parameters affect the order when the primary sort
> : > field is identical between two documents.  It may not do what you are
> : > intending because of that.
> : >
> : > Thanks,
> : > Shawn
> : >
> : >
> :
>
> -Hoss
> http://www.lucidworks.com/
>

Re: How can I get a monotonically increasing field value for docs?

Posted by Chris Hostetter <ho...@fucit.org>.


You're basically re-implementing Solr' cursors.

you can change your system of reading docs from the old collection to 
use...

cursorMark=*&sort=timestamp+asc,id+asc

...and then instead of keeping track of the last timestamp & id values and 
constructing a filter, you can just keep track of the nextCursorMark and 
pass it the next time you want to check for newer documents...

https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results





: Date: Mon, 21 Sep 2015 21:32:33 +0300
: From: Gili Nachum <gi...@gmail.com>
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Re: How can I get a monotonically increasing field value for docs?
: 
: Thanks for the indepth explanation!
: 
: The secondary sort by uuid would allow me to read a series of docs with
: identical time over multiple batches by specifying filtering
: time>timeOnLastReadDoc or (time=timeOnLastReadDoc and
: uuid>uuidOnLastReaDoc) which essentially creates a unique sorted value to
: track progress over.
: On Sep 21, 2015 19:56, "Shawn Heisey" <ap...@elyograg.org> wrote:
: 
: > On 9/21/2015 9:01 AM, Gili Nachum wrote:
: > > TimestampUpdateProcessorFactory takes place only on the leader shard, or
: > on
: > > each shard replica?
: > > if on each replica then I would get different values on each replica.
: > >
: > > My alternative would be to perform secondary sort on a UUID to ensure
: > order.
: >
: > If the update chain is configured properly, it runs on the leader, so
: > all replicas get the same timestamp.
: >
: > Without SolrCloud, the way to create an "indexed at" time field is in
: > the schema -- specify a default value of NOW on the field definition and
: > don't send the field when indexing.  The old master/slave replication
: > copies the actual index contents, so the indexed values in all replicas
: > are the same.
: >
: > The problem with NOW in the schema when running SolrCloud is that each
: > replica indexes the document independently, so each replica can have a
: > different timestamp.  This is why the timestamp update processor exists
: > -- to set the timestamp to a specific value before the document is
: > duplicated to each replica, eliminating the problem.
: >
: > FYI, secondary sort parameters affect the order when the primary sort
: > field is identical between two documents.  It may not do what you are
: > intending because of that.
: >
: > Thanks,
: > Shawn
: >
: >
: 

-Hoss
http://www.lucidworks.com/

Re: How can I get a monotonically increasing field value for docs?

Posted by Gili Nachum <gi...@gmail.com>.

Thanks for the indepth explanation!

The secondary sort by uuid would allow me to read a series of docs with
identical time over multiple batches by specifying filtering
time>timeOnLastReadDoc or (time=timeOnLastReadDoc and
uuid>uuidOnLastReaDoc) which essentially creates a unique sorted value to
track progress over.
On Sep 21, 2015 19:56, "Shawn Heisey" <ap...@elyograg.org> wrote:

> On 9/21/2015 9:01 AM, Gili Nachum wrote:
> > TimestampUpdateProcessorFactory takes place only on the leader shard, or
> on
> > each shard replica?
> > if on each replica then I would get different values on each replica.
> >
> > My alternative would be to perform secondary sort on a UUID to ensure
> order.
>
> If the update chain is configured properly, it runs on the leader, so
> all replicas get the same timestamp.
>
> Without SolrCloud, the way to create an "indexed at" time field is in
> the schema -- specify a default value of NOW on the field definition and
> don't send the field when indexing.  The old master/slave replication
> copies the actual index contents, so the indexed values in all replicas
> are the same.
>
> The problem with NOW in the schema when running SolrCloud is that each
> replica indexes the document independently, so each replica can have a
> different timestamp.  This is why the timestamp update processor exists
> -- to set the timestamp to a specific value before the document is
> duplicated to each replica, eliminating the problem.
>
> FYI, secondary sort parameters affect the order when the primary sort
> field is identical between two documents.  It may not do what you are
> intending because of that.
>
> Thanks,
> Shawn
>
>

Re: How can I get a monotonically increasing field value for docs?

Posted by Shawn Heisey <ap...@elyograg.org>.

On 9/21/2015 9:01 AM, Gili Nachum wrote:
> TimestampUpdateProcessorFactory takes place only on the leader shard, or on
> each shard replica?
> if on each replica then I would get different values on each replica.
>
> My alternative would be to perform secondary sort on a UUID to ensure order.

If the update chain is configured properly, it runs on the leader, so
all replicas get the same timestamp.

Without SolrCloud, the way to create an "indexed at" time field is in
the schema -- specify a default value of NOW on the field definition and
don't send the field when indexing.  The old master/slave replication
copies the actual index contents, so the indexed values in all replicas
are the same.

The problem with NOW in the schema when running SolrCloud is that each
replica indexes the document independently, so each replica can have a
different timestamp.  This is why the timestamp update processor exists
-- to set the timestamp to a specific value before the document is
duplicated to each replica, eliminating the problem.

FYI, secondary sort parameters affect the order when the primary sort
field is identical between two documents.  It may not do what you are
intending because of that.

Thanks,
Shawn

Re: How can I get a monotonically increasing field value for docs?

Posted by Gili Nachum <gi...@gmail.com>.

TimestampUpdateProcessorFactory takes place only on the leader shard, or on
each shard replica?
if on each replica then I would get different values on each replica.

My alternative would be to perform secondary sort on a UUID to ensure order.
Thanks.

On Mon, Sep 21, 2015 at 12:09 PM, Upayavira <uv...@odoko.co.uk> wrote:

> There's nothing to stop you creating your own
> TimestampUpdateProcessorFactory, here's the entire source for it:
>
> public class TimestampUpdateProcessorFactory
>   extends AbstractDefaultValueUpdateProcessorFactory {
>
>   @Override
>   public UpdateRequestProcessor getInstance(SolrQueryRequest req,
>                                             SolrQueryResponse rsp,
>                                             UpdateRequestProcessor next
>                                             ) {
>     return new DefaultValueUpdateProcessor(fieldName, next) {
>       @Override
>       public Object getDefaultValue() {
>         return SolrRequestInfo.getRequestInfo().getNOW();
>       }
>     };
>   }
> }
>
> Effectively, all it does is return the value of NOW according to the
> request, as the default value.
>
> You could construct that on a per invocation basis, using
> System.getMillis() or whatever.
>
> Upayavira
>
> On Mon, Sep 21, 2015, at 07:34 AM, Gili Nachum wrote:
> > I've implemented a custom solr2solr ongoing unidirectional replication
> > mechanism.
> >
> > A Replicator (acting as solrJ client), crawls documents from SolrCloud1
> > and
> > writes them to SolrCloud2 in batches.
> > The replicator crawl logic is to read documents with a time
> > greater/equale
> > to the time of the last replicated document.
> > Whenever a document is added/updated, I auto updated a a tdate field
> > "last_updated_in_solr" using TimestampUpdateProcessorFactory.
> >
> > *My problem: *When a client indexes a batch of 100 documents, all 100
> > docs
> > have the same "last_updated_in_solr" value. This makes my ongoing
> > replication check for new documents to replicate much more complex than
> > if
> > the time value was unique.
> >
> > 1. Can I use some other processor to generate increasing unique values?
> > 2. Can I use the internal _version_ field for this? is it guaranteed to
> > be
> > monotonically increasing for the entire collection or only per document,
> > with each add/update?
> > Any other options?
> >
> > Schema.xml:
> > <field name="last_updated_in_solr" type="tdate" indexed="true"
> > stored="true" multiValued="false"/>
> >
> > solrconfig.xml:
> > <updateRequestProcessorChain name="default">
> >        <processor class="solr.TimestampUpdateProcessorFactory">
> >            <str name="fieldName">last_updated_in_solr</str>
> >        </processor>
> >        <processor class="solr.LogUpdateProcessorFactory" />
> >        <processor class="solr.RunUpdateProcessorFactory" />
> >     </updateRequestProcessorChain>
> >
> > I know there's work for a build-in replication mechanism, but it's not
> > yet
> > released.
> > Using Solr 4.7.2.
>

Re: How can I get a monotonically increasing field value for docs?

Posted by Shawn Heisey <ap...@elyograg.org>.

On 9/21/2015 3:09 AM, Upayavira wrote:
> Effectively, all it does is return the value of NOW according to the
> request, as the default value.
> 
> You could construct that on a per invocation basis, using
> System.getMillis() or whatever.

The millisecond timestamp isn't guaranteed to always increase on every
call -- it's not monotonic.

http://stackoverflow.com/a/2979239/2665648

If the OS and hardware are capable of doing it, nanoTime IS monotonic,
and MIGHT be updated more frequently.

Thanks,
Shawn

Re: How can I get a monotonically increasing field value for docs?

Posted by Upayavira <uv...@odoko.co.uk>.

There's nothing to stop you creating your own
TimestampUpdateProcessorFactory, here's the entire source for it:

public class TimestampUpdateProcessorFactory
  extends AbstractDefaultValueUpdateProcessorFactory {

  @Override
  public UpdateRequestProcessor getInstance(SolrQueryRequest req, 
                                            SolrQueryResponse rsp, 
                                            UpdateRequestProcessor next
                                            ) {
    return new DefaultValueUpdateProcessor(fieldName, next) {
      @Override
      public Object getDefaultValue() { 
        return SolrRequestInfo.getRequestInfo().getNOW();
      }
    };
  }
}

Effectively, all it does is return the value of NOW according to the
request, as the default value.

You could construct that on a per invocation basis, using
System.getMillis() or whatever.

Upayavira

On Mon, Sep 21, 2015, at 07:34 AM, Gili Nachum wrote:
> I've implemented a custom solr2solr ongoing unidirectional replication
> mechanism.
> 
> A Replicator (acting as solrJ client), crawls documents from SolrCloud1
> and
> writes them to SolrCloud2 in batches.
> The replicator crawl logic is to read documents with a time
> greater/equale
> to the time of the last replicated document.
> Whenever a document is added/updated, I auto updated a a tdate field
> "last_updated_in_solr" using TimestampUpdateProcessorFactory.
> 
> *My problem: *When a client indexes a batch of 100 documents, all 100
> docs
> have the same "last_updated_in_solr" value. This makes my ongoing
> replication check for new documents to replicate much more complex than
> if
> the time value was unique.
> 
> 1. Can I use some other processor to generate increasing unique values?
> 2. Can I use the internal _version_ field for this? is it guaranteed to
> be
> monotonically increasing for the entire collection or only per document,
> with each add/update?
> Any other options?
> 
> Schema.xml:
> <field name="last_updated_in_solr" type="tdate" indexed="true"
> stored="true" multiValued="false"/>
> 
> solrconfig.xml:
> <updateRequestProcessorChain name="default">
>        <processor class="solr.TimestampUpdateProcessorFactory">
>            <str name="fieldName">last_updated_in_solr</str>
>        </processor>
>        <processor class="solr.LogUpdateProcessorFactory" />
>        <processor class="solr.RunUpdateProcessorFactory" />
>     </updateRequestProcessorChain>
> 
> I know there's work for a build-in replication mechanism, but it's not
> yet
> released.
> Using Solr 4.7.2.