You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Mike Baranczak <mb...@twcny.rr.com> on 2005/04/20 01:44:35 UTC
globally unique field
First of all, a big thanks to all the Lucene hackers - I've only been
using your product for a couple of weeks, and I've been very impressed
by what I've seen.
Here's my question: I have an index with a little over 3 million
documents in it, with more on the way. Each document has an "URL" field
(which is not indexed). I want to guarantee that each URL is unique;
that is, when I'm adding a new document, I have to check if another
existing document has the same value for the URL field. What's the best
way to do it? I can think of two possible approaches:
1 - Open an IndexReader and iterate over all the Documents that it
contains, checking the value of the "URL" field for each Document. This
seems a little inefficient, since I only care about one field, and I
don't want to have to retrieve all of the fields.
2 - Rebuild the index such that the URL field is indexed. Then, I could
just do a normal search for the value of the URL. But since the URL
field will never be searched under any other circumstances, this seems
like kind of a waste of disk space.
I'm sure somebody else has had to do something like this before. Is
there a better way to do it than what I've described above? If not,
then which of the two approaches will give me the best results?
Thanks in advance.
-MB
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: globally unique field
Posted by Chuck Williams <ch...@allthingslocal.com>.
Mike Baranczak wrote:
> First of all, a big thanks to all the Lucene hackers - I've only been
> using your product for a couple of weeks, and I've been very impressed
> by what I've seen.
>
> Here's my question: I have an index with a little over 3 million
> documents in it, with more on the way. Each document has an "URL"
> field (which is not indexed). I want to guarantee that each URL is
> unique; that is, when I'm adding a new document, I have to check if
> another existing document has the same value for the URL field. What's
> the best way to do it? I can think of two possible approaches:
>
> 1 - Open an IndexReader and iterate over all the Documents that it
> contains, checking the value of the "URL" field for each Document.
> This seems a little inefficient, since I only care about one field,
> and I don't want to have to retrieve all of the fields.
>
> 2 - Rebuild the index such that the URL field is indexed. Then, I
> could just do a normal search for the value of the URL. But since the
> URL field will never be searched under any other circumstances, this
> seems like kind of a waste of disk space.
>
> I'm sure somebody else has had to do something like this before. Is
> there a better way to do it than what I've described above? If not,
> then which of the two approaches will give me the best results?
I'd suggest putting the URL into the index, untokenzied, and then use
IndexReader.terms(<Term for you new URL>). This will seek to precisely
the URL you want or the one just after it, and do it quickly. This will
be orders of magnitude faster than approach 1, and much faster than
using the search API if that is what you intended by approach 2. The
disk space should not be noticeable unless your documents are very very
small.
Chuck
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org