You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Mike Baranczak <mb...@twcny.rr.com> on 2005/04/20 01:44:35 UTC

globally unique field

First of all, a big thanks to all the Lucene hackers - I've only been 
using your product for a couple of weeks, and I've been very impressed 
by what I've seen.

Here's my question: I have an index with a little over 3 million 
documents in it, with more on the way. Each document has an "URL" field 
(which is not indexed). I want to guarantee that each URL is unique; 
that is, when I'm adding a new document, I have to check if another 
existing document has the same value for the URL field. What's the best 
way to do it? I can think of two possible approaches:

1 - Open an IndexReader and iterate over all the Documents that it 
contains, checking the value of the "URL" field for each Document. This 
seems a little inefficient, since I only care about one field, and I 
don't want to have to retrieve all of the fields.

2 - Rebuild the index such that the URL field is indexed. Then, I could 
just do a normal search for the value of the URL. But since the URL 
field will never be searched under any other circumstances, this seems 
like kind of a waste of disk space.

I'm sure somebody else has had to do something like this before. Is 
there a better way to do it than what I've described above? If not, 
then which of the two approaches will give me the best results?

Thanks in advance.

-MB


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: globally unique field

Posted by Chuck Williams <ch...@allthingslocal.com>.

Mike Baranczak wrote:

> First of all, a big thanks to all the Lucene hackers - I've only been 
> using your product for a couple of weeks, and I've been very impressed 
> by what I've seen.
>
> Here's my question: I have an index with a little over 3 million 
> documents in it, with more on the way. Each document has an "URL" 
> field (which is not indexed). I want to guarantee that each URL is 
> unique; that is, when I'm adding a new document, I have to check if 
> another existing document has the same value for the URL field. What's 
> the best way to do it? I can think of two possible approaches:
>
> 1 - Open an IndexReader and iterate over all the Documents that it 
> contains, checking the value of the "URL" field for each Document. 
> This seems a little inefficient, since I only care about one field, 
> and I don't want to have to retrieve all of the fields.
>
> 2 - Rebuild the index such that the URL field is indexed. Then, I 
> could just do a normal search for the value of the URL. But since the 
> URL field will never be searched under any other circumstances, this 
> seems like kind of a waste of disk space.
>
> I'm sure somebody else has had to do something like this before. Is 
> there a better way to do it than what I've described above? If not, 
> then which of the two approaches will give me the best results?

I'd suggest putting the URL into the index, untokenzied, and then use 
IndexReader.terms(<Term for you new URL>). This will seek to precisely 
the URL you want or the one just after it, and do it quickly. This will 
be orders of magnitude faster than approach 1, and much faster than 
using the search API if that is what you intended by approach 2. The 
disk space should not be noticeable unless your documents are very very 
small.

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org