You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Niels Ott <no...@sfs.uni-tuebingen.de> on 2008/11/28 21:31:05 UTC

Deleting from Index by URL field: is it safe?

Hi all,

I want to safely delete documents from my index. There is an URL field 
that specifies where the document came from.

I'm using something like this:

    indexwriter.deleteDocuments(new Term("URL", myURL));

(inspired by the Lucene in Action Book, page 35.)

I'm uncertain whether this is safe or not: is there a chance that I 
delete documents I would want to keep? How does the matching exactly work.

During indexing, I'm using a KeywordAnalyzer for the URL field in order 
to avoid tokenization.

Best,

    Niels

-- 
Niels Ott
Computational Linguist (B.A.)
http://www.drni.de/niels/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Deleting from Index by URL field: is it safe?

Posted by Niels Ott <no...@sfs.uni-tuebingen.de>.

Hi all,

German Kondolf schrieb:
> It works exactly as it does when you search of that term.
> 
> Review in your index creation, if you store it without analyzing it
> (Index.UN_TOKENIZED), it will only match that document when you have an
> exact URL.

Is that also true if I simply use the KeywordAnalyzer?

The reason why I want to do it this way is that I have a special 
Analyzer that encapsulates the "knowledge" on how to treat each field. 
In a way something like the PerFieldAnalyzerWrapper but more 
specialized. I want to use the very same Analyzer for querying as well, 
so it appears to me that it is good to have the "knowledge" about the 
treatment of fields in that single place.

> It's possible that the URL is not unique enought in your domain, there is no
> other unique identifier that you could use?

I think the URL is unique enough for my cases. The system is still a 
prototype so I can change that later, if it turns out that it doesn't do 
the job for me.

> I suggest you create a test and try it on a RAMDirectory and see exactly
> what happens and what you want!

This looks like a good idea to me. Thank you for the hint.

Best,

    Niels

-- 
Niels Ott
Computational Linguist (B.A.)
http://www.drni.de/niels/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Deleting from Index by URL field: is it safe?

Posted by German Kondolf <ge...@gmail.com>.

It works exactly as it does when you search of that term.

Review in your index creation, if you store it without analyzing it
(Index.UN_TOKENIZED), it will only match that document when you have an
exact URL.

It's possible that the URL is not unique enought in your domain, there is no
other unique identifier that you could use?

I suggest you create a test and try it on a RAMDirectory and see exactly
what happens and what you want!

Regards,
German K.

On Fri, Nov 28, 2008 at 6:31 PM, Niels Ott <no...@sfs.uni-tuebingen.de>wrote:

> Hi all,
>
> I want to safely delete documents from my index. There is an URL field that
> specifies where the document came from.
>
> I'm using something like this:
>
>   indexwriter.deleteDocuments(new Term("URL", myURL));
>
> (inspired by the Lucene in Action Book, page 35.)
>
> I'm uncertain whether this is safe or not: is there a chance that I delete
> documents I would want to keep? How does the matching exactly work.
>
> During indexing, I'm using a KeywordAnalyzer for the URL field in order to
> avoid tokenization.
>
> Best,
>
>   Niels
>
> --
> Niels Ott
> Computational Linguist (B.A.)
> http://www.drni.de/niels/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>