You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Antony Bowesman <ad...@teamware.com> on 2009/05/05 12:45:25 UTC

How to not overwrite a Document if it 'already exists'?

I'm adding Documents in batches to an index with IndexWriter.  In certain 
circumstances, I do not want to add the Document if it already exists, where 
existence is determined by field id=myId.

Is there any way to do this with IndexWriter or do I have to open a reader and 
look for the term id:XXX?  Given that opening a reader is expensive, is there 
any way to do this efficiently?

I guess what I want is

IndexWriter.addDocumentIfMissing(Term term, Document doc, Analyzer analyzer)

Thanks
Antony




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to not overwrite a Document if it 'already exists'?

Posted by Antony Bowesman <ad...@teamware.com>.

>> Thanks for that info.  These indexes will be large, in the 10s of millions.
>>  id field is unique and is 29 bytes.  I guess that's still a lot of data to
>> trawl through to get to the term.
> 
> Have you tested how long it takes to look up docs from your id?

Not in indexes that size in a live environment as I don't have the hardware to 
make those sorts of test :( although I know in general, lookup is fast.

> Couldn't you just give the base & full docs different ids?  Then you
> can independently choose which one to update?

I considered that, but as the normal case will not need to worry about this 
scenario.

There is only ever one instance of a mail Doc, whether it is a root mail or part 
of a forward chain and a root mail can of course be part of a forward chain at 
some point, so it should be optimal to just fetch the one Document for the mail 
Id without first trying the true Id, then some pseudo Id if it isn't found.

Unfortunately, I'm having to solve this problem in my Lucene app as the tool 
that's generating this data is unable to know what has or has not been handled 
previously.

I'm implementing it using the IndexReader approach for now and will try to get 
some performance data, so thanks for your comments Mike.

Antony








---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to not overwrite a Document if it 'already exists'?

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Tue, May 5, 2009 at 7:24 PM, Antony Bowesman <ad...@teamware.com> wrote:
> Michael McCandless wrote:
>>
>> Lucene doesn't provide any way to do this, except opening a reader.
>>
>> Opening a reader is not "that" expensive if you use it for this
>> purpose.  EG neither norms nor FieldCache will be loaded if you just
>> enumerate the term docs.
>
> Thanks for that info.  These indexes will be large, in the 10s of millions.
>  id field is unique and is 29 bytes.  I guess that's still a lot of data to
> trawl through to get to the term.

Have you tested how long it takes to look up docs from your id?

>> But, you can let Lucene do the same thing for you by just always using
>> updateDocument, which'll remove the old doc if it's present.
>
> That's precisely what I don't want to occur.  I have two forms of a
> Document, which represent mail items.  One 'full' version containing all
> index and stored data, which represents a searchable mail item and one
> 'base', which is simply a marker Document which represents a mail in a
> forwarded mail chain, with just a couple of stored fields containing the
> mail meta data.
>
> Under normal circumstances there are no problems as mails arrive in sequence
> and are never handled twice, but there is one case, during a reindex op,
> when the arrival of those mails can come out of sequence, i.e. a full mail
> is indexed first, but that mail is later processed as part of a forwarded
> mail chain of another mail.
>
> It is the second time that mail is handled as a base mail that I do not want
> it to overwrite the full version.
>
> Would it be technically difficult to support something like this in the
> IndexWriter API and if not, would it end up being more efficient that using
> a reader/terms to check this?

Couldn't you just give the base & full docs different ids?  Then you
can independently choose which one to update?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to not overwrite a Document if it 'already exists'?

Posted by Antony Bowesman <ad...@teamware.com>.

Michael McCandless wrote:
> Lucene doesn't provide any way to do this, except opening a reader.
> 
> Opening a reader is not "that" expensive if you use it for this
> purpose.  EG neither norms nor FieldCache will be loaded if you just
> enumerate the term docs.

Thanks for that info.  These indexes will be large, in the 10s of millions.  id 
field is unique and is 29 bytes.  I guess that's still a lot of data to trawl 
through to get to the term.

> But, you can let Lucene do the same thing for you by just always using
> updateDocument, which'll remove the old doc if it's present.

That's precisely what I don't want to occur.  I have two forms of a Document, 
which represent mail items.  One 'full' version containing all index and stored 
data, which represents a searchable mail item and one 'base', which is simply a 
marker Document which represents a mail in a forwarded mail chain, with just a 
couple of stored fields containing the mail meta data.

Under normal circumstances there are no problems as mails arrive in sequence and 
are never handled twice, but there is one case, during a reindex op, when the 
arrival of those mails can come out of sequence, i.e. a full mail is indexed 
first, but that mail is later processed as part of a forwarded mail chain of 
another mail.

It is the second time that mail is handled as a base mail that I do not want it 
to overwrite the full version.

Would it be technically difficult to support something like this in the 
IndexWriter API and if not, would it end up being more efficient that using a 
reader/terms to check this?

Antony

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to not overwrite a Document if it 'already exists'?

Posted by Michael McCandless <lu...@mikemccandless.com>.

Lucene doesn't provide any way to do this, except opening a reader.

Opening a reader is not "that" expensive if you use it for this
purpose.  EG neither norms nor FieldCache will be loaded if you just
enumerate the term docs.

But, you can let Lucene do the same thing for you by just always using
updateDocument, which'll remove the old doc if it's present.

Mike

On Tue, May 5, 2009 at 6:45 AM, Antony Bowesman <ad...@teamware.com> wrote:
> I'm adding Documents in batches to an index with IndexWriter.  In certain
> circumstances, I do not want to add the Document if it already exists, where
> existence is determined by field id=myId.
>
> Is there any way to do this with IndexWriter or do I have to open a reader
> and look for the term id:XXX?  Given that opening a reader is expensive, is
> there any way to do this efficiently?
>
> I guess what I want is
>
> IndexWriter.addDocumentIfMissing(Term term, Document doc, Analyzer analyzer)
>
> Thanks
> Antony
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org