You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by sabman <sa...@gmail.com> on 2011/12/07 20:49:20 UTC

avoid overwrite in DataImportHandler

I have a unique ID defined for the documents I am indexing. I want to avoid
overwriting the documents that have already been indexed. I am using
XPathEntityProcessor and TikaEntityProcessor to process the documents.

The DataImportHandler does not seem to have the option to set
overwrite=false. I have read some other forums to use deduplication instead
but I don't see how it is related to my problem. 

Any help on this (or explanation on how deduplication would apply to my
probelm ) would be great. Thanks!

--
View this message in context: http://lucene.472066.n3.nabble.com/avoid-overwrite-in-DataImportHandler-tp3568435p3568435.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: avoid overwrite in DataImportHandler

Posted by "Young, Cody" <Co...@move.com>.

I believe all you need to do is add a ?clean=false to your query string.

If you have a unique key setup as your ID in solr then it should update
the existing documents instead of delete and re-indexing.

Cody

-----Original Message-----
From: P Williams [mailto:williams.tricia.list@gmail.com] 
Sent: Thursday, December 08, 2011 11:11 AM
To: solr-user@lucene.apache.org
Subject: Re: avoid overwrite in DataImportHandler

Ah.  Thanks Erick.

I see now that my question is different from sabman's.

Is there a way to use the DataImportHandler's "full-import" command so
that it does not delete the existing material before it begins?

Thanks,
Tricia

On Thu, Dec 8, 2011 at 6:35 AM, Erick Erickson
<er...@gmail.com>wrote:

> This is all controlled by Solr via the <uniqueKey> field in your
schema.
> Just
> remove that entry.
>
> But then it's all up to you to handle the fact that there will be 
> multiple documents with the same ID all returned as a result of 
> querying. And it won't matter what program adds data, *nothing* will 
> be overwritten, DIH has no part in that decision.
>
> Deduplication is about defining some fields in your record and 
> avoiding adding another document if the contents are "close", where 
> close is a slippery concept. I don't think it's related to your
problem at all.
>
> Best
> Erick
>
> On Wed, Dec 7, 2011 at 3:27 PM, P Williams 
> <wi...@gmail.com> wrote:
> > Hi,
> >
> > I've wondered the same thing myself.  I feel like the "clean" 
> > parameter
> has
> > something to do with it but it doesn't work as I'd expect either.  
> > Thanks in advance to anyone who can answer this question.
> >
> > *clean* : (default 'true'). Tells whether to clean up the index 
> > before
> the
> > indexing is started.
> >
> > Tricia
> >
> > On Wed, Dec 7, 2011 at 12:49 PM, sabman <sa...@gmail.com> wrote:
> >
> >> I have a unique ID defined for the documents I am indexing. I want 
> >> to
> avoid
> >> overwriting the documents that have already been indexed. I am 
> >> using XPathEntityProcessor and TikaEntityProcessor to process the
documents.
> >>
> >> The DataImportHandler does not seem to have the option to set 
> >> overwrite=false. I have read some other forums to use deduplication
> instead
> >> but I don't see how it is related to my problem.
> >>
> >> Any help on this (or explanation on how deduplication would apply 
> >> to my probelm ) would be great. Thanks!
> >>
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/avoid-overwrite-in-DataImportHandle
> r-tp3568435p3568435.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
>

Re: avoid overwrite in DataImportHandler

Posted by P Williams <wi...@gmail.com>.

Ah.  Thanks Erick.

I see now that my question is different from sabman's.

Is there a way to use the DataImportHandler's "full-import" command so that
it does not delete the existing material before it begins?

Thanks,
Tricia

On Thu, Dec 8, 2011 at 6:35 AM, Erick Erickson <er...@gmail.com>wrote:

> This is all controlled by Solr via the <uniqueKey> field in your schema.
> Just
> remove that entry.
>
> But then it's all up to you to handle the fact that there will be multiple
> documents with the same ID all returned as a result of querying. And
> it won't matter what program adds data, *nothing* will be overwritten,
> DIH has no part in that decision.
>
> Deduplication is about defining some fields in your record and avoiding
> adding another document if the contents are "close", where close is a
> slippery concept. I don't think it's related to your problem at all.
>
> Best
> Erick
>
> On Wed, Dec 7, 2011 at 3:27 PM, P Williams
> <wi...@gmail.com> wrote:
> > Hi,
> >
> > I've wondered the same thing myself.  I feel like the "clean" parameter
> has
> > something to do with it but it doesn't work as I'd expect either.  Thanks
> > in advance to anyone who can answer this question.
> >
> > *clean* : (default 'true'). Tells whether to clean up the index before
> the
> > indexing is started.
> >
> > Tricia
> >
> > On Wed, Dec 7, 2011 at 12:49 PM, sabman <sa...@gmail.com> wrote:
> >
> >> I have a unique ID defined for the documents I am indexing. I want to
> avoid
> >> overwriting the documents that have already been indexed. I am using
> >> XPathEntityProcessor and TikaEntityProcessor to process the documents.
> >>
> >> The DataImportHandler does not seem to have the option to set
> >> overwrite=false. I have read some other forums to use deduplication
> instead
> >> but I don't see how it is related to my problem.
> >>
> >> Any help on this (or explanation on how deduplication would apply to my
> >> probelm ) would be great. Thanks!
> >>
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/avoid-overwrite-in-DataImportHandler-tp3568435p3568435.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
>

Re: avoid overwrite in DataImportHandler

Posted by Erick Erickson <er...@gmail.com>.

This is all controlled by Solr via the <uniqueKey> field in your schema. Just
remove that entry.

But then it's all up to you to handle the fact that there will be multiple
documents with the same ID all returned as a result of querying. And
it won't matter what program adds data, *nothing* will be overwritten,
DIH has no part in that decision.

Deduplication is about defining some fields in your record and avoiding
adding another document if the contents are "close", where close is a
slippery concept. I don't think it's related to your problem at all.

Best
Erick

On Wed, Dec 7, 2011 at 3:27 PM, P Williams
<wi...@gmail.com> wrote:
> Hi,
>
> I've wondered the same thing myself.  I feel like the "clean" parameter has
> something to do with it but it doesn't work as I'd expect either.  Thanks
> in advance to anyone who can answer this question.
>
> *clean* : (default 'true'). Tells whether to clean up the index before the
> indexing is started.
>
> Tricia
>
> On Wed, Dec 7, 2011 at 12:49 PM, sabman <sa...@gmail.com> wrote:
>
>> I have a unique ID defined for the documents I am indexing. I want to avoid
>> overwriting the documents that have already been indexed. I am using
>> XPathEntityProcessor and TikaEntityProcessor to process the documents.
>>
>> The DataImportHandler does not seem to have the option to set
>> overwrite=false. I have read some other forums to use deduplication instead
>> but I don't see how it is related to my problem.
>>
>> Any help on this (or explanation on how deduplication would apply to my
>> probelm ) would be great. Thanks!
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/avoid-overwrite-in-DataImportHandler-tp3568435p3568435.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>

Re: avoid overwrite in DataImportHandler

Posted by P Williams <wi...@gmail.com>.

Hi,

I've wondered the same thing myself.  I feel like the "clean" parameter has
something to do with it but it doesn't work as I'd expect either.  Thanks
in advance to anyone who can answer this question.

*clean* : (default 'true'). Tells whether to clean up the index before the
indexing is started.

Tricia

On Wed, Dec 7, 2011 at 12:49 PM, sabman <sa...@gmail.com> wrote:

> I have a unique ID defined for the documents I am indexing. I want to avoid
> overwriting the documents that have already been indexed. I am using
> XPathEntityProcessor and TikaEntityProcessor to process the documents.
>
> The DataImportHandler does not seem to have the option to set
> overwrite=false. I have read some other forums to use deduplication instead
> but I don't see how it is related to my problem.
>
> Any help on this (or explanation on how deduplication would apply to my
> probelm ) would be great. Thanks!
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/avoid-overwrite-in-DataImportHandler-tp3568435p3568435.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>