You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Paul Rosen <pa...@performantsoftware.com> on 2009/08/27 19:21:36 UTC

Updating a solr record

I realize there is no way to update particular fields in a solr record. 
I know the recommendation is to delete the record from the index and 
re-add it, but in my case, it is difficult to completely reindex, so 
that creates problems with my work flow.

That is, the info that I use to create a solr doc comes from two places: 
a local file that contains most of the info, and a URL in that file that 
points to a web page that contains the rest of the info.

To completely reindex, we have to hit every website again, which is 
problematic for a number of reasons. (Plus, those websites don't change 
much, so it is just wasted effort.) (Once in a while we do reindex, and 
it is a huge production to do so.)

But that means that if I want to make a small change to either 
schema.xml or the local files that I'm indexing, I can't. I can't even 
fix minor bugs until our yearly reindexing.

So, the question is:

Is there any way to get the info that is already in the solr index for a 
document, so that I can use that as a starting place? I would just tweak 
that record and add it again.

Thanks,
Paul

Re: Updating a solr record

Posted by Paul Rosen <pa...@performantsoftware.com>.
Eric Pugh wrote:
> Do you have to "reindex"?  Are you meaning an optimize operation?  You
> can do an "update" by just sending Solr a new record, and letting Solr
> deal with the removing and adding of the data.

The problem is that I can't easily create the new record. There is some 
data that I no longer have access to, but did at the time I created the 
record to begin with.

> You can just query Solr, find the records that you want (including all
> the website data).  Update them, and then send the entire record back.

This is what I'd like to know how to do. I'll experiment with this, but 
I thought that I wouldn't be able to get back all the info I need to 
recreate the doc.

> 
> Or am I missing something?  Are these documents so huge that you don't
> want to pull back an entire record for some reason?

I would like to get the record from solr because I just can't create the 
record the same way as I originally did.

(Besides the time involved in crawling all those websites, some of them 
only allow us access for a limited amount of time, so to reindex, we 
need to call them up and schedule a time for them to whitelist us.)

> 
> Eric
> 
> On Thu, Aug 27, 2009 at 1:21 PM, Paul Rosen<pa...@performantsoftware.com> wrote:
>> I realize there is no way to update particular fields in a solr record. I
>> know the recommendation is to delete the record from the index and re-add
>> it, but in my case, it is difficult to completely reindex, so that creates
>> problems with my work flow.
> 
>> That is, the info that I use to create a solr doc comes from two places: a
>> local file that contains most of the info, and a URL in that file that
>> points to a web page that contains the rest of the info.
>>
>> To completely reindex, we have to hit every website again, which is
>> problematic for a number of reasons. (Plus, those websites don't change
>> much, so it is just wasted effort.) (Once in a while we do reindex, and it
>> is a huge production to do so.)
>>
>> But that means that if I want to make a small change to either schema.xml or
>> the local files that I'm indexing, I can't. I can't even fix minor bugs
>> until our yearly reindexing.
>>
>> So, the question is:
>>
>> Is there any way to get the info that is already in the solr index for a
>> document, so that I can use that as a starting place? I would just tweak
>> that record and add it again.
>>
>> Thanks,
>> Paul
>>


Re: Updating a solr record

Posted by Paul Rosen <pa...@performantsoftware.com>.
Hi Eric,

I think I understand what you are saying but I'm not sure how it would work.

I think you are saying to have two different indexes, each one has the 
same documents, but one has the hard-to-get fields and the other has the 
easy-to-get fields. Then I would make the same query twice, once to each 
index.

So, let's say I'm looking for all documents that contain the word "poem" 
and I want to initially display the the 10 most relevant matches. I 
think I'd have to ask each index for its 10 most relevant matches, then 
merge them myself, and display the appropriate ones.

Well, the same document could appear in both lists so I'd have to get 
rid of duplicates. Also, wouldn't the relevancy of the duplicate doc go 
up? But I wouldn't know by how much.

That's the first problem, but then what if the user wants to see page 2? 
I certainly wouldn't query for documents #10-19 on each server.

Eric Pugh wrote:
> Right...  You know, if some of your data needs to updated frequently,
> but other is updated once per year, and is really massive dataset,
> then maybe splitting it up into separate cores?  Since you mentioned
> that you can't get the raw data again, you could just duplicate your
> existing index by doing a filesytem copy.  Leave that alone so you
> don't update it and lose your data, and start a new core that you can
> update and ignore the fact is has all the website data in it.  And tie
> the two cores data sets together outside of Solr.
> 
> Eric
> 
> 
> 
> On Thu, Aug 27, 2009 at 1:46 PM, Paul Tomblin<pt...@xcski.com> wrote:
>> On Thu, Aug 27, 2009 at 1:27 PM, Eric
>> Pugh<ep...@opensourceconnections.com> wrote:
>>> You can just query Solr, find the records that you want (including all
>>> the website data).  Update them, and then send the entire record back.
>>>
>> Correct me if I'm wrong, but I think you'd end up losing the fields
>> that are indexed but not stored.
>>
>>
>> --
>> http://www.linkedin.com/in/paultomblin
>>


Re: Updating a solr record

Posted by Eric Pugh <ep...@opensourceconnections.com>.
Right...  You know, if some of your data needs to updated frequently,
but other is updated once per year, and is really massive dataset,
then maybe splitting it up into separate cores?  Since you mentioned
that you can't get the raw data again, you could just duplicate your
existing index by doing a filesytem copy.  Leave that alone so you
don't update it and lose your data, and start a new core that you can
update and ignore the fact is has all the website data in it.  And tie
the two cores data sets together outside of Solr.

Eric



On Thu, Aug 27, 2009 at 1:46 PM, Paul Tomblin<pt...@xcski.com> wrote:
> On Thu, Aug 27, 2009 at 1:27 PM, Eric
> Pugh<ep...@opensourceconnections.com> wrote:
>> You can just query Solr, find the records that you want (including all
>> the website data).  Update them, and then send the entire record back.
>>
>
> Correct me if I'm wrong, but I think you'd end up losing the fields
> that are indexed but not stored.
>
>
> --
> http://www.linkedin.com/in/paultomblin
>

Re: Updating a solr record

Posted by Paul Tomblin <pt...@xcski.com>.
On Thu, Aug 27, 2009 at 1:27 PM, Eric
Pugh<ep...@opensourceconnections.com> wrote:
> You can just query Solr, find the records that you want (including all
> the website data).  Update them, and then send the entire record back.
>

Correct me if I'm wrong, but I think you'd end up losing the fields
that are indexed but not stored.


-- 
http://www.linkedin.com/in/paultomblin

Re: Updating a solr record

Posted by Eric Pugh <ep...@opensourceconnections.com>.
Do you have to "reindex"?  Are you meaning an optimize operation?  You
can do an "update" by just sending Solr a new record, and letting Solr
deal with the removing and adding of the data.

You can just query Solr, find the records that you want (including all
the website data).  Update them, and then send the entire record back.

Or am I missing something?  Are these documents so huge that you don't
want to pull back an entire record for some reason?

Eric

On Thu, Aug 27, 2009 at 1:21 PM, Paul Rosen<pa...@performantsoftware.com> wrote:
> I realize there is no way to update particular fields in a solr record. I
> know the recommendation is to delete the record from the index and re-add
> it, but in my case, it is difficult to completely reindex, so that creates
> problems with my work flow.

>
> That is, the info that I use to create a solr doc comes from two places: a
> local file that contains most of the info, and a URL in that file that
> points to a web page that contains the rest of the info.
>
> To completely reindex, we have to hit every website again, which is
> problematic for a number of reasons. (Plus, those websites don't change
> much, so it is just wasted effort.) (Once in a while we do reindex, and it
> is a huge production to do so.)
>
> But that means that if I want to make a small change to either schema.xml or
> the local files that I'm indexing, I can't. I can't even fix minor bugs
> until our yearly reindexing.
>
> So, the question is:
>
> Is there any way to get the info that is already in the solr index for a
> document, so that I can use that as a starting place? I would just tweak
> that record and add it again.
>
> Thanks,
> Paul
>

Re: Updating a solr record

Posted by Uri Boness <ub...@gmail.com>.
I guess if you have stored="true" then there is no problem.

> 2. If you don't use stored="true" you can still get access to term vectors,
> which you can probably reuse to create fake field with same term vector in
> an updated document... just an idea, may be I am wrong...
Reconstructing a the field value from a term enum might work... of 
course the value won't be as the original value, but when indexed, if 
you don't have any really special filters (e.g. shingle filter), most 
likely the tokens will be re-indexed as they are (that is, it is most 
likely that the filters will not have any effect). just make sure to 
take the position increments in account! for example, if you have 
synonym filter set up, then you'll need to choose only one term in a 
single position (otherwise the term frequency of the document will 
increase on every update).

Uri

Fuad Efendi wrote:
> I haven't read all messages in this thread yet, but I probably have an
> answer to some questions...
>
> 1. You want to change schema.xml and to reindex, but you don't have access
> to source documents (stored somewhere on Internet). But you probably use
> stored="true" in your schema. Then, use SOLR as your storage device, use
> id:[* TO *] to retrieve documents from SOLR and reindex it in another SOLR
> schema...
>
> 2. If you don't use stored="true" you can still get access to term vectors,
> which you can probably reuse to create fake field with same term vector in
> an updated document... just an idea, may be I am wrong...
>
>
> -----Original Message-----
> From: Paul Rosen [mailto:paul@performantsoftware.com] 
> Sent: August-27-09 1:22 PM
> To: solr-user@lucene.apache.org
> Subject: Updating a solr record
>
> I realize there is no way to update particular fields in a solr record. 
> I know the recommendation is to delete the record from the index and 
> re-add it, but in my case, it is difficult to completely reindex, so 
> that creates problems with my work flow.
>
> That is, the info that I use to create a solr doc comes from two places: 
> a local file that contains most of the info, and a URL in that file that 
> points to a web page that contains the rest of the info.
>
> To completely reindex, we have to hit every website again, which is 
> problematic for a number of reasons. (Plus, those websites don't change 
> much, so it is just wasted effort.) (Once in a while we do reindex, and 
> it is a huge production to do so.)
>
> But that means that if I want to make a small change to either 
> schema.xml or the local files that I'm indexing, I can't. I can't even 
> fix minor bugs until our yearly reindexing.
>
> So, the question is:
>
> Is there any way to get the info that is already in the solr index for a 
> document, so that I can use that as a starting place? I would just tweak 
> that record and add it again.
>
> Thanks,
> Paul
>
>
>
>   

RE: Updating a solr record

Posted by Fuad Efendi <fu...@efendi.ca>.
I haven't read all messages in this thread yet, but I probably have an
answer to some questions...

1. You want to change schema.xml and to reindex, but you don't have access
to source documents (stored somewhere on Internet). But you probably use
stored="true" in your schema. Then, use SOLR as your storage device, use
id:[* TO *] to retrieve documents from SOLR and reindex it in another SOLR
schema...

2. If you don't use stored="true" you can still get access to term vectors,
which you can probably reuse to create fake field with same term vector in
an updated document... just an idea, may be I am wrong...


-----Original Message-----
From: Paul Rosen [mailto:paul@performantsoftware.com] 
Sent: August-27-09 1:22 PM
To: solr-user@lucene.apache.org
Subject: Updating a solr record

I realize there is no way to update particular fields in a solr record. 
I know the recommendation is to delete the record from the index and 
re-add it, but in my case, it is difficult to completely reindex, so 
that creates problems with my work flow.

That is, the info that I use to create a solr doc comes from two places: 
a local file that contains most of the info, and a URL in that file that 
points to a web page that contains the rest of the info.

To completely reindex, we have to hit every website again, which is 
problematic for a number of reasons. (Plus, those websites don't change 
much, so it is just wasted effort.) (Once in a while we do reindex, and 
it is a huge production to do so.)

But that means that if I want to make a small change to either 
schema.xml or the local files that I'm indexing, I can't. I can't even 
fix minor bugs until our yearly reindexing.

So, the question is:

Is there any way to get the info that is already in the solr index for a 
document, so that I can use that as a starting place? I would just tweak 
that record and add it again.

Thanks,
Paul