You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by William Nelis <Wi...@morningstar.com> on 2017/05/19 19:51:45 UTC

Incremental Indexing when Source Data is not Incremental

Hello.

I am new to Solr and have a question about incremental indexing. We have a source text file that contains millions of rows. Each row is saved as a document in Solr. There is one field in each row that is a unique identifier.

Unfortunately, this source text file can change. We need to check it every hour for changes. If rows are removed, we must remove them from Solr. If rows are added, we must add them to Solr.

We do not want to drop all records and re-load them. Instead we would like to diff for the changes. What is the recommended way of doing this? Can we just get all values Solr stores for the unique identifier field and do the diff external to Solr? Does Solr provide functionality that will allow us to do the incremental changes even though the source file itself is not incremental?


An example of the file format (obviously this is not a real file):

AAQX     This is the first document             213.32
AAZT      This is the second document        243.23
ABGT     This is the third document            321.43
...

The first column is the unique identifier (there are far more columns, but this has been simplified).


Thank you for your help.

Re: Incremental Indexing when Source Data is not Incremental

Posted by David Smiley <da...@gmail.com>.

Please ask for help on the solr-user list.  This is the dev list for Solr
internals.
Thanks

On Fri, May 19, 2017 at 3:51 PM William Nelis <Wi...@morningstar.com>
wrote:

> Hello.
>
>
>
> I am new to Solr and have a question about incremental indexing. We have a
> source text file that contains millions of rows. Each row is saved as a
> document in Solr. There is one field in each row that is a unique
> identifier.
>
>
>
> Unfortunately, this source text file can change. We need to check it every
> hour for changes. If rows are removed, we must remove them from Solr. If
> rows are added, we must add them to Solr.
>
>
>
> We do not want to drop all records and re-load them. Instead we would like
> to diff for the changes. What is the recommended way of doing this? Can we
> just get all values Solr stores for the unique identifier field and do the
> diff external to Solr? Does Solr provide functionality that will allow us
> to do the incremental changes even though the source file itself is not
> incremental?
>
>
>
>
>
> An example of the file format (obviously this is not a real file):
>
>
>
> AAQX     This is the first document             213.32
>
> AAZT      This is the second document        243.23
>
> ABGT     This is the third document            321.43
>
> ...
>
>
>
> The first column is the unique identifier (there are far more columns, but
> this has been simplified).
>
>
>
>
>
> Thank you for your help.
>
>
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Incremental Indexing when Source Data is not Incremental

Posted by Erick Erickson <er...@gmail.com>.

Yes, you can get all the current IDs from Solr, but it's a bit
cumbersome. Use the /export handler (you have to insure that all
fields you return are docValues="true", then write some sort of script
that diffed them against your source file.

There's nothing in Solr that will do the diff for you, you'll have to
"roll your own" here.

What people often do is keep a list of changes and operate on _that_.
In the DB world that's a trigger for operations on your table along
with an operation, so you'd have something like:

op    ID
delete 123
add     456

Then you process those changes in order for your deltas. Perhaps you
could do something similar with whatever changes the file in the first
place either with a DB or a text file somewhere....

How do you detect if a particular doc is _changed_? You'll have to
re-index then too....

Best,
Erick

Best,
Erick

On Fri, May 19, 2017 at 12:51 PM, William Nelis
<Wi...@morningstar.com> wrote:
> Hello.
>
>
>
> I am new to Solr and have a question about incremental indexing. We have a
> source text file that contains millions of rows. Each row is saved as a
> document in Solr. There is one field in each row that is a unique
> identifier.
>
>
>
> Unfortunately, this source text file can change. We need to check it every
> hour for changes. If rows are removed, we must remove them from Solr. If
> rows are added, we must add them to Solr.
>
>
>
> We do not want to drop all records and re-load them. Instead we would like
> to diff for the changes. What is the recommended way of doing this? Can we
> just get all values Solr stores for the unique identifier field and do the
> diff external to Solr? Does Solr provide functionality that will allow us to
> do the incremental changes even though the source file itself is not
> incremental?
>
>
>
>
>
> An example of the file format (obviously this is not a real file):
>
>
>
> AAQX     This is the first document             213.32
>
> AAZT      This is the second document        243.23
>
> ABGT     This is the third document            321.43
>
> ...
>
>
>
> The first column is the unique identifier (there are far more columns, but
> this has been simplified).
>
>
>
>
>
> Thank you for your help.
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org