You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "onlinespending@gmail.com" <on...@gmail.com> on 2011/03/15 04:38:48 UTC

keeping data consistent between Database and Solr

Like many people, Solr is not my primary data store. Not all of my data need
be searchable and for simple and fast retrieval I store it in a database
(Cassandra in my case).  Actually I don't have this all built up yet, but my
intention is that whenever new data is entered that it be added to my
Cassandra database and simultaneously added to the Solr index (either by
queuing up recent data before a commit or some other means; any suggestions
on this front?).

But my main question is, how do I guarantee that data between my Cassandra
database and Solr index are consistent and up-to-date?  What if I write the
data to Cassandra and then a failure occurs during the commit to the Solr
index?  I would need to be aware what data failed to commit and make sure
that a re-attempt is made.  Obviously inconsistency for a short duration is
inevitable when using two different databases (Cassandra and Solr), but I
certainly don't want a failure to create perpetual inconsistency.  I'm
curious what sort of mechanisms people are using to ensure consistency
between their database (MySQL, Cassandra, etc.) and Solr.

Thank you,
Ben

RE: keeping data consistent between Database and Solr

Posted by Tim Gilbert <TI...@morningstar.com>.

I use Solr + MySql with data coming from several DHI type "loaders" that
I have written to move data from many different databases into my "BI"
solution.  I don't use DHI because I am not simply replicating the data,
but I am moving/merging/processing the incoming data during the loading.

For me, I have an Aspect (aspectj) which wraps my Data Access Object and
every time a "persist" is called (I am using hibernate), I update Solr
with the same data an instant later using @Around advice.  This handles
nearly every event during the day.  I have a simple "retry" procedure on
my Solrj add/commit on network error in hopes that it will eventually
work.

In case of error I rebuild the solr index from scratch each night by
recreating it based on the data in MySQL.  That takes about 10 minutes
and I run it at night.  This allows for me to have "eventual
consistency" for any issues that cropped up during the day. 

Obviously the size of my database (< 2 million records) makes this
approach manageable.  YMMV.

Tim

-----Original Message-----
From: Shawn Heisey [mailto:solr@elyograg.org] 
Sent: Tuesday, March 15, 2011 9:13 AM
To: solr-user@lucene.apache.org
Subject: Re: keeping data consistent between Database and Solr

On 3/14/2011 9:38 PM, onlinespending@gmail.com wrote:
> But my main question is, how do I guarantee that data between my
Cassandra
> database and Solr index are consistent and up-to-date?

Our MySQL database has two unique indexes.  One is a document ID, 
implemented in MySQL as an autoincrement integer and in Solr as a long.

The other is what we call a tag id, implemented in MySQL as a varchar 
and Solr as a single lowercased token and serving as Solr's uniqueKey.  
We have an update trigger on the database that updates the document ID 
whenever the database document is updated.

We have a homegrown build system for Solr.  In a nutshell, it keeps 
track of the newest document ID in the Solr Index.  If the DIH 
delta-import fails, it doesn't update the stored ID, which means that on

the next run, it will try and index those documents again.  Changes to 
the entries in the database are automatically picked up because the 
document ID is newer, but the tag id doesn't change, so the document in 
Solr is overwritten.

Things are actually more complex than I've written, because our index is

distributed.  Hopefully it can give you some ideas for yours.

Shawn

Re: keeping data consistent between Database and Solr

Posted by "onlinespending@gmail.com" <on...@gmail.com>.

On Mon, Mar 21, 2011 at 10:57 AM, Shawn Heisey <so...@elyograg.org> wrote:

> On 3/15/2011 12:54 PM, onlinespending@gmail.com wrote:
>
>> That's pretty interesting to use the autoincrementing document ID as a way
>> to keep track of what has not been indexed in Solr.  And you overwrite
>> this
>> document ID even when you modify an existing document.  Very cool.  I
>> suppose the number can even rotate back to 0, as long as you handle that.
>>
>
> We use a bigint for the value, and the highest value is currently less than
> 300 million, so we don't expect it to ever rotate around to 0.  My build
> system would not be able to handle wrapraound without manual intervention.
>  If we have that problem, I think we'd have to renumber the entire database
> and reindex.


One solution to reduce the rate at which this number grows would be to store
a "batch ID" rather than a "document ID". If you've just added batch #1428
to the Solr index, then any new updated documents in your SQL database would
be assigned #1429. Since you already have a unique tag ID, you may be OK
with a non-unique ID for the sake of keeping track of index updates.


>
>
>  I am thinking of using a timestamp to achieve a similar thing. All
>> documents
>> that have been accessed after the last Solr index need to be added to the
>> Solr index.  In fact, each name-value pair in Cassandra has a timestamp
>> associated with it, so I'm curious if I could simply use this.
>>
>
> As long as you can guarantee that it's all deterministic and idempotent,
> you can use anything you like.  I hope you know what those words mean. :)
>  It's important when using timestamps that the system that runs the build
> script is the same one that stores the last-used timestamp.  That way you
> are guaranteed that you will never have things getting missed because of
> clock skew.


Yes, that is a concern of mine. If I go with a timestamp I'll certainly need
to pay close attention to things.


>
>
>  I'm curious how you handle the delta-imports. Do you have some routine
>> that
>> periodically checks for updates to your MySQL database via the document
>> ID?
>> Which language do you use for that?
>>
>
> The entire build system is written in Perl, where I am comfortable.  I even
> wrote an object-oriented module that the scripts share.  The update script
> runs every two minutes, from cron, indexing anything with a higher document
> ID than the one recorded during the last successful run.  There are some
> other scripts that run on longer intervals and handle things like deletes
> and data redistribution into shards.  These scripts kick off the build, then
> use the bare /dataimport URL to track when the import completes and whether
> it's successful.


> Thanks,
> Shawn
>

Thanks for the info. That's very helpful!

Ben

Re: keeping data consistent between Database and Solr

Posted by Shawn Heisey <so...@elyograg.org>.

On 3/15/2011 12:54 PM, onlinespending@gmail.com wrote:
> That's pretty interesting to use the autoincrementing document ID as a way
> to keep track of what has not been indexed in Solr.  And you overwrite this
> document ID even when you modify an existing document.  Very cool.  I
> suppose the number can even rotate back to 0, as long as you handle that.

We use a bigint for the value, and the highest value is currently less 
than 300 million, so we don't expect it to ever rotate around to 0.  My 
build system would not be able to handle wrapraound without manual 
intervention.  If we have that problem, I think we'd have to renumber 
the entire database and reindex.

> I am thinking of using a timestamp to achieve a similar thing. All documents
> that have been accessed after the last Solr index need to be added to the
> Solr index.  In fact, each name-value pair in Cassandra has a timestamp
> associated with it, so I'm curious if I could simply use this.

As long as you can guarantee that it's all deterministic and idempotent, 
you can use anything you like.  I hope you know what those words mean. 
:)  It's important when using timestamps that the system that runs the 
build script is the same one that stores the last-used timestamp.  That 
way you are guaranteed that you will never have things getting missed 
because of clock skew.

> I'm curious how you handle the delta-imports. Do you have some routine that
> periodically checks for updates to your MySQL database via the document ID?
> Which language do you use for that?

The entire build system is written in Perl, where I am comfortable.  I 
even wrote an object-oriented module that the scripts share.  The update 
script runs every two minutes, from cron, indexing anything with a 
higher document ID than the one recorded during the last successful 
run.  There are some other scripts that run on longer intervals and 
handle things like deletes and data redistribution into shards.  These 
scripts kick off the build, then use the bare /dataimport URL to track 
when the import completes and whether it's successful.

Thanks,
Shawn

Re: keeping data consistent between Database and Solr

Posted by "onlinespending@gmail.com" <on...@gmail.com>.

That's pretty interesting to use the autoincrementing document ID as a way
to keep track of what has not been indexed in Solr.  And you overwrite this
document ID even when you modify an existing document.  Very cool.  I
suppose the number can even rotate back to 0, as long as you handle that.

I am thinking of using a timestamp to achieve a similar thing. All documents
that have been accessed after the last Solr index need to be added to the
Solr index.  In fact, each name-value pair in Cassandra has a timestamp
associated with it, so I'm curious if I could simply use this.

I'm curious how you handle the delta-imports. Do you have some routine that
periodically checks for updates to your MySQL database via the document ID?
Which language do you use for that?

Thanks,
Ben

On Tue, Mar 15, 2011 at 9:12 AM, Shawn Heisey <so...@elyograg.org> wrote:

> On 3/14/2011 9:38 PM, onlinespending@gmail.com wrote:
>
>> But my main question is, how do I guarantee that data between my Cassandra
>> database and Solr index are consistent and up-to-date?
>>
>
> Our MySQL database has two unique indexes.  One is a document ID,
> implemented in MySQL as an autoincrement integer and in Solr as a long.  The
> other is what we call a tag id, implemented in MySQL as a varchar and Solr
> as a single lowercased token and serving as Solr's uniqueKey.  We have an
> update trigger on the database that updates the document ID whenever the
> database document is updated.
>
> We have a homegrown build system for Solr.  In a nutshell, it keeps track
> of the newest document ID in the Solr Index.  If the DIH delta-import fails,
> it doesn't update the stored ID, which means that on the next run, it will
> try and index those documents again.  Changes to the entries in the database
> are automatically picked up because the document ID is newer, but the tag id
> doesn't change, so the document in Solr is overwritten.
>
> Things are actually more complex than I've written, because our index is
> distributed.  Hopefully it can give you some ideas for yours.
>
> Shawn
>
>

Re: keeping data consistent between Database and Solr

Posted by Shawn Heisey <so...@elyograg.org>.

On 3/14/2011 9:38 PM, onlinespending@gmail.com wrote:
> But my main question is, how do I guarantee that data between my Cassandra
> database and Solr index are consistent and up-to-date?

Our MySQL database has two unique indexes.  One is a document ID, 
implemented in MySQL as an autoincrement integer and in Solr as a long.  
The other is what we call a tag id, implemented in MySQL as a varchar 
and Solr as a single lowercased token and serving as Solr's uniqueKey.  
We have an update trigger on the database that updates the document ID 
whenever the database document is updated.

We have a homegrown build system for Solr.  In a nutshell, it keeps 
track of the newest document ID in the Solr Index.  If the DIH 
delta-import fails, it doesn't update the stored ID, which means that on 
the next run, it will try and index those documents again.  Changes to 
the entries in the database are automatically picked up because the 
document ID is newer, but the tag id doesn't change, so the document in 
Solr is overwritten.

Things are actually more complex than I've written, because our index is 
distributed.  Hopefully it can give you some ideas for yours.

Shawn

Re: keeping data consistent between Database and Solr

Posted by "onlinespending@gmail.com" <on...@gmail.com>.

Solandra is great for adding better scalability and NRT to Solr, but
it pretty much just stores the index in Cassandra and insulates that
from the user. It doesn't solve the problem of allowing quick and
direct retrieval of data that need not be searched. I could certainly
just use a Solr search query to "directly" access a single document,
but that has overhead and would not be as efficient as directly
accessing a database. With potentially tens of thousands of
simultaneous direct data accesses, I'd rather not put this burden on
Solr and would prefer to use it only for searchas it was intended,
while simple data retrieval could come from a better equipped
database.

But my question of consistency applies to all databases and Solr. i
would imagine most people maintain separate MySQL and Solr databases.

On Tuesday, March 15, 2011, Bill Bell <bi...@gmail.com> wrote:
> Look at Solandra. Solr + Cassandra.
>
> On 3/14/11 9:38 PM, "onlinespending@gmail.com" <on...@gmail.com>
> wrote:
>
>>Like many people, Solr is not my primary data store. Not all of my data
>>need
>>be searchable and for simple and fast retrieval I store it in a database
>>(Cassandra in my case).  Actually I don't have this all built up yet, but
>>my
>>intention is that whenever new data is entered that it be added to my
>>Cassandra database and simultaneously added to the Solr index (either by
>>queuing up recent data before a commit or some other means; any
>>suggestions
>>on this front?).
>>
>>But my main question is, how do I guarantee that data between my Cassandra
>>database and Solr index are consistent and up-to-date?  What if I write
>>the
>>data to Cassandra and then a failure occurs during the commit to the Solr
>>index?  I would need to be aware what data failed to commit and make sure
>>that a re-attempt is made.  Obviously inconsistency for a short duration
>>is
>>inevitable when using two different databases (Cassandra and Solr), but I
>>certainly don't want a failure to create perpetual inconsistency.  I'm
>>curious what sort of mechanisms people are using to ensure consistency
>>between their database (MySQL, Cassandra, etc.) and Solr.
>>
>>Thank you,
>>Ben
>
>
>

Re: keeping data consistent between Database and Solr

Posted by Bill Bell <bi...@gmail.com>.

Look at Solandra. Solr + Cassandra.

On 3/14/11 9:38 PM, "onlinespending@gmail.com" <on...@gmail.com>
wrote:

>Like many people, Solr is not my primary data store. Not all of my data
>need
>be searchable and for simple and fast retrieval I store it in a database
>(Cassandra in my case).  Actually I don't have this all built up yet, but
>my
>intention is that whenever new data is entered that it be added to my
>Cassandra database and simultaneously added to the Solr index (either by
>queuing up recent data before a commit or some other means; any
>suggestions
>on this front?).
>
>But my main question is, how do I guarantee that data between my Cassandra
>database and Solr index are consistent and up-to-date?  What if I write
>the
>data to Cassandra and then a failure occurs during the commit to the Solr
>index?  I would need to be aware what data failed to commit and make sure
>that a re-attempt is made.  Obviously inconsistency for a short duration
>is
>inevitable when using two different databases (Cassandra and Solr), but I
>certainly don't want a failure to create perpetual inconsistency.  I'm
>curious what sort of mechanisms people are using to ensure consistency
>between their database (MySQL, Cassandra, etc.) and Solr.
>
>Thank you,
>Ben