You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by lalitjangra <la...@gmail.com> on 2014/02/13 06:15:59 UTC

Solr delta indexing approach

Hi,I am working on a prototyope where i have a content source & i am indexing
all documents & strore the index in solr.Now i have pre-condition that my
content source is ever changing means there is always new content added to
it. As i have read that solr use to do indexing on full source only
everytime solr is asked for indexing.But this may lead to underutilization
of reosources as same documents are getting reindexed again & again.Is there
any approach to handle such scenarios. E.g. I have 10000  documents in my
source which have been indexed by solr till today. But next day my source
has 11000 documents. So i want to i ndex only new 1000 documents not all
11000. Can anybody suggest for this?Thanks in advance.



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-delta-indexing-approach-tp4117068.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr delta indexing approach

Posted by lalitjangra <la...@gmail.com>.
Thanks all.

I am following couple of articles for same.

I am sending data to solr instead of using DIH and able to successfully
index data in solr.

My concern here is to ensure how to minimize solr indexing so that only
updated data is indexed each time out of all data items.

Is this something OOTB available in solr or we need to do it?

Regards.



--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-delta-indexing-approach-tp4117068p4117087.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr delta indexing approach

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
I'd start from doing Solr tutorial. It will explain a lot of things.
But in summary, you can send data to Solr (best option) or you can
pull it using DataImportHandler. Take your pick, do the tutorial,
maybe read some books. Then come back with specific questions of where
you started.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Thu, Feb 13, 2014 at 12:45 PM, lalitjangra <la...@gmail.com> wrote:
> Thanks Alex,
>
> Yes my source system maintains the crettion & last modificaiton system of
> each document.
>
> As per your inputs, can i assume that next time when solr starts indexing,
> it  scans all the prsent in source but only picks those for indexing which
> are either new or have been updated since last successful indexing.
>
> How solr does this or in short what is solr strategy for indexing? I would
> definitely like to know more about it & if you can share your thoughts on
> same, it would be great.
>
> Regards,
> Lalit.
>
>
>
>
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Solr-delta-indexing-approach-tp4117068p4117077.html
> Sent from the Solr - User mailing list archive at Nabble.com.

RE: Solr delta indexing approach

Posted by "Sadler, Anthony" <An...@yrgrp.com>.
At the risk of derailing the thread:

We do a lot more in the script than is mentioned here: We pull out parts of the path and mangle them (for example turn them into a UNC path for users to use, or pull out a client name or job number using a known folder structure). As for deleted files, here's how the script works in totality:

- Script runs first time, finds every file and puts into formerly empty Solr DB. For every file found, set date_last_seen = current_time. Writes out last_begin_time file and ends. 
- Secondary function of script runs sometime after, looks for any file with a date_last_seen < last_begin_time. Nothing is found this time around.
- Script runs next time, see's that there is a last_begin_time file, reads in that time. Script then runs in a "delta" mode, looking for all files modified later than last_begin_time. If it finds them, it re-indexs them and their contents. All other files that have mod_time less than last_run_time merely have their date_last_seen updated. Script ends, writes out last_begin_time file.

At this point, any files that were deleted between the first and second run were not updated , so their date_last_seen is different from all the others. This gives me something to look for. 

- Secondary function of script runs sometime after, looks for any file with a date_last_seen < last_begin_time. This time around, some files are found. These files have their isDeleted field in solr set to 1. 

Hopefully that makes a bit more sense. 


-----Original Message-----
From: Walter Underwood [mailto:wunder@wunderwood.org] 
Sent: Thursday, 13 February 2014 5:26 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr delta indexing approach

Why write a Perl script for that?

touch new_timestamp
find . -newer timestamp | script-to-submit && mv new_timestamp timestamp

Neither approach deals with deleted files.

To do this correctly, you need lists of all the files in the index with their timestamps, and of all the files in the repository. Then you need to difference them to find deleted ones, new ones, and ones that have changed. You might even want to track links and symlinks to get dupes and canonical paths.

wunder

On Feb 12, 2014, at 10:00 PM, "Sadler, Anthony" <An...@yrgrp.com> wrote:

> I had this problem when I started to look at Solr as an index for a file server. What I ended up doing was writing a perl script that did this:
> 
> - Scan the whole filesystem and create an XML that is submitted into Solr for indexing. As this might be some 600,000 files, I break it down into chunks of N files (N = 200 currently).
> - At the end of a successful scan, write out the time it started to a file.
> - Next time you run the script, the script looks for the start time file. It reads that in and checks every file in the system:
> = If it has a mod_time greater than the begin_time, it has changed since we last updated it, so reindex it.
> = If it doesn't, just update the last_seen timestamp in Solr (a field we created) so we know its still there.
> 
> We're doing that and its indexing just fine.
> 
> -----Original Message-----
> From: lalitjangra [mailto:lalit.j.jangra@gmail.com]
> Sent: Thursday, 13 February 2014 4:45 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr delta indexing approach
> 
> Thanks Alex,
> 
> Yes my source system maintains the crettion & last modificaiton system of each document.
> 
> As per your inputs, can i assume that next time when solr starts indexing, it  scans all the prsent in source but only picks those for indexing which are either new or have been updated since last successful indexing.
> 
> How solr does this or in short what is solr strategy for indexing? I would definitely like to know more about it & if you can share your thoughts on same, it would be great.
> 
> Regards,
> Lalit.
> 
> 
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-delta-indexing-approach-tp4117
> 068p4117077.html Sent from the Solr - User mailing list archive at 
> Nabble.com.
> 
> 
> ________________________________
> 
> ==========================================
> Privileged/Confidential Information may be contained in this message. If you are not the addressee indicated in this message (or responsible for delivery of the message to such person), you may not copy or deliver this message to anyone. In such case, you should destroy this message and kindly notify the sender by reply email. Please advise immediately if you or your employer does not consent to email for messages of this kind. Opinions, conclusions and other information in this message that do not relate to the official business of Burson-Marsteller shall be understood as neither given nor endorsed by it.
> ==========================================
> 

--
Walter Underwood
wunder@wunderwood.org





Re: Solr delta indexing approach

Posted by Walter Underwood <wu...@wunderwood.org>.
Why write a Perl script for that?

touch new_timestamp
find . -newer timestamp | script-to-submit && mv new_timestamp timestamp

Neither approach deals with deleted files.

To do this correctly, you need lists of all the files in the index with their timestamps, and of all the files in the repository. Then you need to difference them to find deleted ones, new ones, and ones that have changed. You might even want to track links and symlinks to get dupes and canonical paths.

wunder

On Feb 12, 2014, at 10:00 PM, "Sadler, Anthony" <An...@yrgrp.com> wrote:

> I had this problem when I started to look at Solr as an index for a file server. What I ended up doing was writing a perl script that did this:
> 
> - Scan the whole filesystem and create an XML that is submitted into Solr for indexing. As this might be some 600,000 files, I break it down into chunks of N files (N = 200 currently).
> - At the end of a successful scan, write out the time it started to a file.
> - Next time you run the script, the script looks for the start time file. It reads that in and checks every file in the system:
> = If it has a mod_time greater than the begin_time, it has changed since we last updated it, so reindex it.
> = If it doesn't, just update the last_seen timestamp in Solr (a field we created) so we know its still there.
> 
> We're doing that and its indexing just fine.
> 
> -----Original Message-----
> From: lalitjangra [mailto:lalit.j.jangra@gmail.com]
> Sent: Thursday, 13 February 2014 4:45 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr delta indexing approach
> 
> Thanks Alex,
> 
> Yes my source system maintains the crettion & last modificaiton system of each document.
> 
> As per your inputs, can i assume that next time when solr starts indexing, it  scans all the prsent in source but only picks those for indexing which are either new or have been updated since last successful indexing.
> 
> How solr does this or in short what is solr strategy for indexing? I would definitely like to know more about it & if you can share your thoughts on same, it would be great.
> 
> Regards,
> Lalit.
> 
> 
> 
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Solr-delta-indexing-approach-tp4117068p4117077.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> ________________________________
> 
> ==========================================
> Privileged/Confidential Information may be contained in this message. If you are not the addressee indicated in this message (or responsible for delivery of the message to such person), you may not copy or deliver this message to anyone. In such case, you should destroy this message and kindly notify the sender by reply email. Please advise immediately if you or your employer does not consent to email for messages of this kind. Opinions, conclusions and other information in this message that do not relate to the official business of Burson-Marsteller shall be understood as neither given nor endorsed by it.
> ==========================================
> 

--
Walter Underwood
wunder@wunderwood.org




RE: Solr delta indexing approach

Posted by "Sadler, Anthony" <An...@yrgrp.com>.
I had this problem when I started to look at Solr as an index for a file server. What I ended up doing was writing a perl script that did this:

- Scan the whole filesystem and create an XML that is submitted into Solr for indexing. As this might be some 600,000 files, I break it down into chunks of N files (N = 200 currently).
- At the end of a successful scan, write out the time it started to a file.
- Next time you run the script, the script looks for the start time file. It reads that in and checks every file in the system:
= If it has a mod_time greater than the begin_time, it has changed since we last updated it, so reindex it.
= If it doesn't, just update the last_seen timestamp in Solr (a field we created) so we know its still there.

We're doing that and its indexing just fine.

-----Original Message-----
From: lalitjangra [mailto:lalit.j.jangra@gmail.com]
Sent: Thursday, 13 February 2014 4:45 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr delta indexing approach

Thanks Alex,

Yes my source system maintains the crettion & last modificaiton system of each document.

As per your inputs, can i assume that next time when solr starts indexing, it  scans all the prsent in source but only picks those for indexing which are either new or have been updated since last successful indexing.

How solr does this or in short what is solr strategy for indexing? I would definitely like to know more about it & if you can share your thoughts on same, it would be great.

Regards,
Lalit.







--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-delta-indexing-approach-tp4117068p4117077.html
Sent from the Solr - User mailing list archive at Nabble.com.


________________________________

==========================================
Privileged/Confidential Information may be contained in this message. If you are not the addressee indicated in this message (or responsible for delivery of the message to such person), you may not copy or deliver this message to anyone. In such case, you should destroy this message and kindly notify the sender by reply email. Please advise immediately if you or your employer does not consent to email for messages of this kind. Opinions, conclusions and other information in this message that do not relate to the official business of Burson-Marsteller shall be understood as neither given nor endorsed by it.
==========================================


Re: Solr delta indexing approach

Posted by lalitjangra <la...@gmail.com>.
Thanks Alex,

Yes my source system maintains the crettion & last modificaiton system of
each document.

As per your inputs, can i assume that next time when solr starts indexing,
it  scans all the prsent in source but only picks those for indexing which
are either new or have been updated since last successful indexing.

How solr does this or in short what is solr strategy for indexing? I would
definitely like to know more about it & if you can share your thoughts on
same, it would be great.

Regards,
Lalit.







--
View this message in context: http://lucene.472066.n3.nabble.com/Solr-delta-indexing-approach-tp4117068p4117077.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr delta indexing approach

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
You have read that Solr needs to reindex a full source. That's correct
(unless you use atomic updates). But - the important point is - this
is per document. So, once you indexed your 10000 documents, you don't
need to worry about them until they change.

Just go ahead and index your additional documents only. I am assuming
your source system can figure out what the new ones are (timestamp,
etc).

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Thu, Feb 13, 2014 at 12:15 PM, lalitjangra <la...@gmail.com> wrote:
> Hi,I am working on a prototyope where i have a content source & i am indexing
> all documents & strore the index in solr.Now i have pre-condition that my
> content source is ever changing means there is always new content added to
> it. As i have read that solr use to do indexing on full source only
> everytime solr is asked for indexing.But this may lead to underutilization
> of reosources as same documents are getting reindexed again & again.Is there
> any approach to handle such scenarios. E.g. I have 10000  documents in my
> source which have been indexed by solr till today. But next day my source
> has 11000 documents. So i want to i ndex only new 1000 documents not all
> 11000. Can anybody suggest for this?Thanks in advance.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Solr-delta-indexing-approach-tp4117068.html
> Sent from the Solr - User mailing list archive at Nabble.com.