You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Shamik Bandopadhyay <sh...@gmail.com> on 2014/02/11 02:20:22 UTC

Indexing question on individual field update

Hi,

  I'm currently indexing a bunch of fields for a given document. For e.g.
let's assume there's a field called "rating". The rating field is not part
of the original document during index, so the value is blank. The field
gets updated by an external service when the document is rated by users.
The service makes a partial Solr update and sets the appropriate rating
value. But, when I re-index the same document, the rating fields get
over-written and reset to blank. I understand that an indexing in Solr is
delete and add, but is there a way to put a conditional indexing at the
field level, which will keep the value if its already present in the index
for a given id ?

Any pointers will be appreciated.

Thanks,
Shamik

Re: Indexing question on individual field update

Posted by shamik <sh...@gmail.com>.
Ok, I was wrong here. I can always set the indextimestamp field with current
time (NOW) for every atomic update. On a similar note, is there any
performance constraint with updates compared to add ?



--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-question-on-individual-field-update-tp4116605p4116772.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing question on individual field update

Posted by shamik <sh...@gmail.com>.
Thanks Eric and Shawn, appreciate your help.



--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-question-on-individual-field-update-tp4116605p4116831.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing question on individual field update

Posted by Erick Erickson <er...@gmail.com>.
Update and add are basically the same thing if there's an existing document.
There will be some performance consequence since you're getting the stored
fields on the server as opposed to getting the full input from the external
source
and handing it to Solr. However, I know of at least one situation where the
atomic update rate is sky-high and it works, so I wouldn't worry about it
unless and
until I saw a problem.

Best,
Erick


On Tue, Feb 11, 2014 at 3:03 PM, Shawn Heisey <so...@elyograg.org> wrote:

> On 2/11/2014 2:37 PM, shamik wrote:
>
>> Eric,
>>
>>    Thanks for your reply. I should have given a better context. I'm
>> currently
>> running an incremental crawl daily on this particular source and indexing
>> the documents. Incremental crawl looks for any change since last crawl
>> date
>> based on the document publish date. But, there's no way for me to know if
>> a
>> document has been deleted. To ensure that, I ran a full crawl on a
>> weekend,
>> which basically re-index the entire content. After the full index is
>> over, I
>> call a purge script, which deletes any content which is more than 24 hour
>> old, based on the indextimestamp field.
>>
>> The issue with atomic update is that it doesn't alter the indextimstamp
>> field. So even if I run a full crawl with atomic updates, the timestamp
>> will
>> stick to its old value. Unfortunately, I can't rely on another date field
>> coming from the source as they are not consistent. That translates to the
>> fact that I can't remove stale content.
>>
>
> One possibility is this: When you send the atomic update to Solr, include
> a new value for the indextimestamp field.
>
> Another option: You can write a custom update processor plugin for Solr.
>  When the custom code is used, it will be executed on each incoming
> document.  Depending on what it finds in the update request, it can make
> appropriate changes, like updating indextimestamp.  You can do pretty much
> anything.
>
> http://wiki.apache.org/solr/UpdateRequestProcessor
>
> Writing an update processor in Java typically gives the best results in
> terms of flexibility and performance, but there is also a way to use other
> programming languages:
>
> http://wiki.apache.org/solr/ScriptUpdateProcessor
>
> Thanks,
> Shawn
>
>

Re: Indexing question on individual field update

Posted by Shawn Heisey <so...@elyograg.org>.
On 2/11/2014 2:37 PM, shamik wrote:
> Eric,
>
>    Thanks for your reply. I should have given a better context. I'm currently
> running an incremental crawl daily on this particular source and indexing
> the documents. Incremental crawl looks for any change since last crawl date
> based on the document publish date. But, there's no way for me to know if a
> document has been deleted. To ensure that, I ran a full crawl on a weekend,
> which basically re-index the entire content. After the full index is over, I
> call a purge script, which deletes any content which is more than 24 hour
> old, based on the indextimestamp field.
>
> The issue with atomic update is that it doesn't alter the indextimstamp
> field. So even if I run a full crawl with atomic updates, the timestamp will
> stick to its old value. Unfortunately, I can't rely on another date field
> coming from the source as they are not consistent. That translates to the
> fact that I can't remove stale content.

One possibility is this: When you send the atomic update to Solr, 
include a new value for the indextimestamp field.

Another option: You can write a custom update processor plugin for 
Solr.  When the custom code is used, it will be executed on each 
incoming document.  Depending on what it finds in the update request, it 
can make appropriate changes, like updating indextimestamp.  You can do 
pretty much anything.

http://wiki.apache.org/solr/UpdateRequestProcessor

Writing an update processor in Java typically gives the best results in 
terms of flexibility and performance, but there is also a way to use 
other programming languages:

http://wiki.apache.org/solr/ScriptUpdateProcessor

Thanks,
Shawn


Re: Indexing question on individual field update

Posted by shamik <sh...@gmail.com>.
Eric,

  Thanks for your reply. I should have given a better context. I'm currently
running an incremental crawl daily on this particular source and indexing
the documents. Incremental crawl looks for any change since last crawl date
based on the document publish date. But, there's no way for me to know if a
document has been deleted. To ensure that, I ran a full crawl on a weekend,
which basically re-index the entire content. After the full index is over, I
call a purge script, which deletes any content which is more than 24 hour
old, based on the indextimestamp field. 

The issue with atomic update is that it doesn't alter the indextimstamp
field. So even if I run a full crawl with atomic updates, the timestamp will
stick to its old value. Unfortunately, I can't rely on another date field
coming from the source as they are not consistent. That translates to the
fact that I can't remove stale content.

Let me know if I'm missing something here.

- Thanks,
Shamik





--
View this message in context: http://lucene.472066.n3.nabble.com/Indexing-question-on-individual-field-update-tp4116605p4116757.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing question on individual field update

Posted by Erick Erickson <er...@gmail.com>.
I'm assuming you're using the atomic update feature to
update the individual field, why not use it when you replace
the rest of the doc?

Best,
Erick


On Mon, Feb 10, 2014 at 5:20 PM, Shamik Bandopadhyay <sh...@gmail.com>wrote:

> Hi,
>
>   I'm currently indexing a bunch of fields for a given document. For e.g.
> let's assume there's a field called "rating". The rating field is not part
> of the original document during index, so the value is blank. The field
> gets updated by an external service when the document is rated by users.
> The service makes a partial Solr update and sets the appropriate rating
> value. But, when I re-index the same document, the rating fields get
> over-written and reset to blank. I understand that an indexing in Solr is
> delete and add, but is there a way to put a conditional indexing at the
> field level, which will keep the value if its already present in the index
> for a given id ?
>
> Any pointers will be appreciated.
>
> Thanks,
> Shamik
>