You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Don Werve <do...@madwombat.com> on 2009/08/28 17:49:01 UTC

Partial updates?

Short version:

Is there a way to either do partial updates to documents (update/add one or
two fields only), or to search across multiple documents grouped by a
(non-unique) key stored in a field?

Long version:

I've run into an issue with the way I'm indexing documents for a new
product, and figure that somebody else has run into the same problem.  In a
nutshell, we're building a system that deals with a lot of incoming and
outgoing text documents (email, word docs, short comments, etc), grouped
together by some common factor (basically, email threads), and want to do
full-text search across those threads.

We've settled on Solr, of course. :)

Right now, I'm adding each new incoming/outgoing message as a new document,
and can search just fine, unless I want to look for multiple terms that span
documents.  So, "foo" is in the first document, "bar" is in the second, and
although they both have a 'thread_id' field identifying them as belonging to
the same group, searching for "+foo +bar" doesn't yield results (which is
not surprising).

Now, I can modify the code to store one document for each group of messages
without too much work.  But as I understand it, this means that for every
new message coming in, I need to hand an aggregate of all previous messages
to the indexer, because Solr will re-create the document (which indexes the
entire group of messages) when I do update/add.  Since there can be some
fairly large files sitting in there (50-100M in some cases), I'd rather not
have to shove that down Solr's pipe every time something changes.

So, first question, is what I think I know about update/add correct?

Second, if so, is there a way that I can update single-valued fields and
append new multivalued fields, without having to re-index the whole
document?

Third, am I just totally wrong about the way I'm trying to do this, and is
there a better way?

Thanks-in-advance!

RE: Partial updates?

Posted by Brandon Ramirez <Br...@elementk.com>.

I would love to see this too.  Most of our data comes from a relational database, but there are some files on the file system related to our products that may need to be indexed.  The files have different change control / life cycle, so I can't be sure that our application will know when this data  changes, so a recurring background re-index job would be helpful.  Having to go to the database to get 99% of the data (which didn't change anyway) to send along with the 1% from the file system is a big limitation.

This also prevents the use of DIH.


Brandon Ramirez | Office: 585.214.5413 | Fax: 585.295.4848 
Software Engineer II | Element K | www.elementk.com


-----Original Message-----
From: mlevy [mailto:mlevy@ushmm.org] 
Sent: Friday, October 28, 2011 2:21 PM
To: solr-user@lucene.apache.org
Subject: Re: Partial updates?

An ability to update would be extremely useful for us. Different parts of records sometimes come from different databases, and being able to update after creation of the Solr index would be extremely useful.

I've made some processes that reads a record and adds a new field to it. The most awkward thing is when there's been a CopyField, when the record is read and re-saved, the copied field causes CopyField to be invoked again.

--
View this message in context: http://lucene.472066.n3.nabble.com/Partial-updates-tp502570p3461740.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Partial updates?

Posted by mlevy <ml...@ushmm.org>.

An ability to update would be extremely useful for us. Different parts of
records sometimes come from different databases, and being able to update
after creation of the Solr index would be extremely useful.

I've made some processes that reads a record and adds a new field to it. The
most awkward thing is when there's been a CopyField, when the record is read
and re-saved, the copied field causes CopyField to be invoked again.

--
View this message in context: http://lucene.472066.n3.nabble.com/Partial-updates-tp502570p3461740.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Partial updates?

Posted by Don Werve <do...@madwombat.com>.

Fantastic!  Anything I can do to help out?

Re: Partial updates?

Posted by Paul Rosen <pa...@performantsoftware.com>.

That sounds very similar to my use case, too. (Mentioned in the recent 
thread "Updating a solr record"). So +1 on allowing updates!

Jason Rutherglen wrote:
> Don,
> 
> I started work on fixing this a while back. However I plan to
> resume again soon. Basically one would be able to update fields
> to a parallel index, without reindexing the entire document.
> There are other use cases I've seen for this such as caching.
> 
> -J
> 
> On Fri, Aug 28, 2009 at 8:49 AM, Don Werve<do...@madwombat.com> wrote:
>> Short version:
>>
>> Is there a way to either do partial updates to documents (update/add one or
>> two fields only), or to search across multiple documents grouped by a
>> (non-unique) key stored in a field?
>>
>> Long version:
>>
>> I've run into an issue with the way I'm indexing documents for a new
>> product, and figure that somebody else has run into the same problem.  In a
>> nutshell, we're building a system that deals with a lot of incoming and
>> outgoing text documents (email, word docs, short comments, etc), grouped
>> together by some common factor (basically, email threads), and want to do
>> full-text search across those threads.
>>
>> We've settled on Solr, of course. :)
>>
>> Right now, I'm adding each new incoming/outgoing message as a new document,
>> and can search just fine, unless I want to look for multiple terms that span
>> documents.  So, "foo" is in the first document, "bar" is in the second, and
>> although they both have a 'thread_id' field identifying them as belonging to
>> the same group, searching for "+foo +bar" doesn't yield results (which is
>> not surprising).
>>
>> Now, I can modify the code to store one document for each group of messages
>> without too much work.  But as I understand it, this means that for every
>> new message coming in, I need to hand an aggregate of all previous messages
>> to the indexer, because Solr will re-create the document (which indexes the
>> entire group of messages) when I do update/add.  Since there can be some
>> fairly large files sitting in there (50-100M in some cases), I'd rather not
>> have to shove that down Solr's pipe every time something changes.
>>
>> So, first question, is what I think I know about update/add correct?
>>
>> Second, if so, is there a way that I can update single-valued fields and
>> append new multivalued fields, without having to re-index the whole
>> document?
>>
>> Third, am I just totally wrong about the way I'm trying to do this, and is
>> there a better way?
>>
>> Thanks-in-advance!
>>

Re: Partial updates?

Posted by Jason Rutherglen <ja...@gmail.com>.

Don,

I started work on fixing this a while back. However I plan to
resume again soon. Basically one would be able to update fields
to a parallel index, without reindexing the entire document.
There are other use cases I've seen for this such as caching.

-J

On Fri, Aug 28, 2009 at 8:49 AM, Don Werve<do...@madwombat.com> wrote:
> Short version:
>
> Is there a way to either do partial updates to documents (update/add one or
> two fields only), or to search across multiple documents grouped by a
> (non-unique) key stored in a field?
>
> Long version:
>
> I've run into an issue with the way I'm indexing documents for a new
> product, and figure that somebody else has run into the same problem.  In a
> nutshell, we're building a system that deals with a lot of incoming and
> outgoing text documents (email, word docs, short comments, etc), grouped
> together by some common factor (basically, email threads), and want to do
> full-text search across those threads.
>
> We've settled on Solr, of course. :)
>
> Right now, I'm adding each new incoming/outgoing message as a new document,
> and can search just fine, unless I want to look for multiple terms that span
> documents.  So, "foo" is in the first document, "bar" is in the second, and
> although they both have a 'thread_id' field identifying them as belonging to
> the same group, searching for "+foo +bar" doesn't yield results (which is
> not surprising).
>
> Now, I can modify the code to store one document for each group of messages
> without too much work.  But as I understand it, this means that for every
> new message coming in, I need to hand an aggregate of all previous messages
> to the indexer, because Solr will re-create the document (which indexes the
> entire group of messages) when I do update/add.  Since there can be some
> fairly large files sitting in there (50-100M in some cases), I'd rather not
> have to shove that down Solr's pipe every time something changes.
>
> So, first question, is what I think I know about update/add correct?
>
> Second, if so, is there a way that I can update single-valued fields and
> append new multivalued fields, without having to re-index the whole
> document?
>
> Third, am I just totally wrong about the way I'm trying to do this, and is
> there a better way?
>
> Thanks-in-advance!
>