You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by John Patterson <jd...@gmail.com> on 2008/08/26 09:56:13 UTC

Deleted document terms

Hi,

I just discovered some strange behaviour with deleted documents.  I do a
search for documents with a certain query and delete one using
IndexWriter.deleteDocuments(Term) using a key for the term.  Then I repeat
the search and the document is still there because I use a custom
HitCollector which does not check IndexReader.isDeleted(int).  That is all
expected.  

But when I try to show the deleted document by searching by key using the
same term it was deleted with, it is not found.  So it seems that the term
(id:MYKEY) is removed from the index.

So I was surprised that the term for the id was removed but not the other
terms for document.

But I guess this makes sense and I just need to check
IndexReader.isDeleted()

Does this all sound like correct behaviour?

Thanks,

John
-- 
View this message in context: http://www.nabble.com/Deleted-document-terms-tp19157027p19157027.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Deleted document terms

Posted by Michael McCandless <lu...@mikemccandless.com>.
Normally an ID should be indexed as Field.Index.UN_TOKENIZED.

Mike

John Patterson wrote:

>
> That was the problem - the id was not tokenized.  Thanks for your  
> help.
>
>
> Kalani Ruwanpathirana wrote:
>>
>> Hi John,
>>
>> Are you sure you made the id "tokenized" while indexing? I could  
>> overcome
>> this issue by having a tokenized field, which was used for the  
>> deletion as
>> below.
>>
>> document.add(new Field("id", id, Field.Store.YES,
>> *Field.Index.TOKENIZED*));
>>
>>
>>
>> Thanks
>>
>>
>>
>> On Tue, Aug 26, 2008 at 2:15 PM, Michael McCandless <
>> lucene@mikemccandless.com> wrote:
>>
>>>
>>>
>>> John Patterson wrote:
>>>
>>> I just discovered some strange behaviour with deleted documents.   
>>> I do a
>>>> search for documents with a certain query and delete one using
>>>> IndexWriter.deleteDocuments(Term) using a key for the term.  Then I
>>>> repeat
>>>> the search and the document is still there because I use a custom
>>>> HitCollector which does not check IndexReader.isDeleted(int).   
>>>> That is
>>>> all
>>>> expected.
>>>>
>>>
>>> Hmm -- once a document is deleted, your HitCollector won't ever  
>>> see it.
>>> During searching, isDeleted is called per document at a very low  
>>> level.
>>>
>>> If your HitCollector is seeing it, it sounds like it wasn't really
>>> deleted.
>>> Are you sure you closed the IndexWriter and then reopened your  
>>> searcher,
>>> so
>>> that the searcher will see the deletion?
>>>
>>> But when I try to show the deleted document by searching by key  
>>> using
>>> the
>>>> same term it was deleted with, it is not found.  So it seems that  
>>>> the
>>>> term
>>>> (id:MYKEY) is removed from the index.
>>>>
>>>
>>> This is odd -- the document should either be deleted (entirely),  
>>> or not.
>>> You shouldn't get different behavior if you search for the doc one  
>>> way
>>> vs
>>> another.
>>>
>>> So I was surprised that the term for the id was removed but not the
>>> other
>>>> terms for document.
>>>>
>>>
>>> That make two of us!
>>>
>>> Mike
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>> -- 
>> Kalani Ruwanpathirana
>> Department of Computer Science & Engineering
>> University of Moratuwa
>>
>>
>
> -- 
> View this message in context: http://www.nabble.com/Deleted-document-terms-tp19157027p19158657.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Deleted document terms

Posted by John Patterson <jd...@gmail.com>.
That was the problem - the id was not tokenized.  Thanks for your help.


Kalani Ruwanpathirana wrote:
> 
> Hi John,
> 
> Are you sure you made the id "tokenized" while indexing? I could overcome
> this issue by having a tokenized field, which was used for the deletion as
> below.
> 
> document.add(new Field("id", id, Field.Store.YES,
> *Field.Index.TOKENIZED*));
> 
> 
> 
> Thanks
> 
> 
> 
> On Tue, Aug 26, 2008 at 2:15 PM, Michael McCandless <
> lucene@mikemccandless.com> wrote:
> 
>>
>>
>> John Patterson wrote:
>>
>>  I just discovered some strange behaviour with deleted documents.  I do a
>>> search for documents with a certain query and delete one using
>>> IndexWriter.deleteDocuments(Term) using a key for the term.  Then I
>>> repeat
>>> the search and the document is still there because I use a custom
>>> HitCollector which does not check IndexReader.isDeleted(int).  That is
>>> all
>>> expected.
>>>
>>
>> Hmm -- once a document is deleted, your HitCollector won't ever see it.
>>  During searching, isDeleted is called per document at a very low level.
>>
>> If your HitCollector is seeing it, it sounds like it wasn't really
>> deleted.
>>  Are you sure you closed the IndexWriter and then reopened your searcher,
>> so
>> that the searcher will see the deletion?
>>
>>  But when I try to show the deleted document by searching by key using
>> the
>>> same term it was deleted with, it is not found.  So it seems that the
>>> term
>>> (id:MYKEY) is removed from the index.
>>>
>>
>> This is odd -- the document should either be deleted (entirely), or not.
>>  You shouldn't get different behavior if you search for the doc one way
>> vs
>> another.
>>
>>  So I was surprised that the term for the id was removed but not the
>> other
>>> terms for document.
>>>
>>
>> That make two of us!
>>
>> Mike
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 
> -- 
> Kalani Ruwanpathirana
> Department of Computer Science & Engineering
> University of Moratuwa
> 
> 

-- 
View this message in context: http://www.nabble.com/Deleted-document-terms-tp19157027p19158657.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Deleted document terms

Posted by Kalani Ruwanpathirana <ka...@gmail.com>.
Hi John,

Are you sure you made the id "tokenized" while indexing? I could overcome
this issue by having a tokenized field, which was used for the deletion as
below.

document.add(new Field("id", id, Field.Store.YES, *Field.Index.TOKENIZED*));



Thanks



On Tue, Aug 26, 2008 at 2:15 PM, Michael McCandless <
lucene@mikemccandless.com> wrote:

>
>
> John Patterson wrote:
>
>  I just discovered some strange behaviour with deleted documents.  I do a
>> search for documents with a certain query and delete one using
>> IndexWriter.deleteDocuments(Term) using a key for the term.  Then I repeat
>> the search and the document is still there because I use a custom
>> HitCollector which does not check IndexReader.isDeleted(int).  That is all
>> expected.
>>
>
> Hmm -- once a document is deleted, your HitCollector won't ever see it.
>  During searching, isDeleted is called per document at a very low level.
>
> If your HitCollector is seeing it, it sounds like it wasn't really deleted.
>  Are you sure you closed the IndexWriter and then reopened your searcher, so
> that the searcher will see the deletion?
>
>  But when I try to show the deleted document by searching by key using the
>> same term it was deleted with, it is not found.  So it seems that the term
>> (id:MYKEY) is removed from the index.
>>
>
> This is odd -- the document should either be deleted (entirely), or not.
>  You shouldn't get different behavior if you search for the doc one way vs
> another.
>
>  So I was surprised that the term for the id was removed but not the other
>> terms for document.
>>
>
> That make two of us!
>
> Mike
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Kalani Ruwanpathirana
Department of Computer Science & Engineering
University of Moratuwa

Re: Deleted document terms

Posted by Michael McCandless <lu...@mikemccandless.com>.

John Patterson wrote:

> I just discovered some strange behaviour with deleted documents.  I  
> do a
> search for documents with a certain query and delete one using
> IndexWriter.deleteDocuments(Term) using a key for the term.  Then I  
> repeat
> the search and the document is still there because I use a custom
> HitCollector which does not check IndexReader.isDeleted(int).  That  
> is all
> expected.

Hmm -- once a document is deleted, your HitCollector won't ever see  
it.  During searching, isDeleted is called per document at a very low  
level.

If your HitCollector is seeing it, it sounds like it wasn't really  
deleted.  Are you sure you closed the IndexWriter and then reopened  
your searcher, so that the searcher will see the deletion?

> But when I try to show the deleted document by searching by key  
> using the
> same term it was deleted with, it is not found.  So it seems that  
> the term
> (id:MYKEY) is removed from the index.

This is odd -- the document should either be deleted (entirely), or  
not.  You shouldn't get different behavior if you search for the doc  
one way vs another.

> So I was surprised that the term for the id was removed but not the  
> other
> terms for document.

That make two of us!

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org