You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by jacky <ja...@gmail.com> on 2006/09/08 08:23:02 UTC
duplicate fields
hi,
1. Is there an effect method to check if there exists the same field(hold a unique ID) when added into lucene index database? Make a search for this field?
2. Is there an effect method to check if there exists the duplicate fields(hold a unique ID) in the lucene index database?
Two methods: Read all documents and compare the fields, or search for each field. Is there a better one?
Thanks for your help!
Best Regards.
jacky
Re: duplicate fields
Posted by Erick Erickson <er...@gmail.com>.
I'm not at all sure what you're asking.
I believe you can use a TermEnum with an empty term ("") to get all the
terms in a particular field.
If you're asking "how can I find all the fields in a document", well, that's
tricky. Since there's no requirement that every document have the same
fields, there's no way that I know of of asking "what are the names of all
the fields in the index". You have to "just know".
That said, given a document, you *can* enumerate all the fields in that
document. But that doesn't tell you anything about the fields of the *next*
document.
If none of this is relevant, could you give more details about what you're
trying to do?
Best
Erick
On 9/11/06, vinay kumar <vp...@osi-tech.com> wrote:
>
> any one know how to get the unique fields from the field in the lucene
> index.
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
duplicate fields
Posted by vinay kumar <vp...@osi-tech.com>.
any one know how to get the unique fields from the field in the lucene
index.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: duplicate fields
Posted by Daniel Noll <da...@nuix.com.au>.
jacky wrote:
> hi Daniel,
> How do you use a separate database to check the duplicate fields? It is interesting!
It's simple enough. Every time we're about to process a new item we
look in the database to see if there is already an item with the same
ID. If there isn't, we add the row. If there is, it's a duplicate.
Daniel
--
Daniel Noll
Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia Ph: +61 2 9280 0699
Web: http://www.nuix.com.au/ Fax: +61 2 9212 6902
This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: duplicate fields
Posted by jacky <ja...@gmail.com>.
hi Daniel,
How do you use a separate database to check the duplicate fields? It is interesting!
Best Regards.
jacky
----- Original Message -----
From: "Daniel Noll" <da...@nuix.com.au>
To: <ja...@lucene.apache.org>
Sent: Friday, September 08, 2006 3:08 PM
Subject: Re: duplicate fields
> jacky wrote:
> > hi, 1. Is there an effect method to check if there exists the same
> > field(hold a unique ID) when added into lucene index database? Make a
> > search for this field?
>
> One way is to create an IndexReader and IndexSearcher on your index,
> which you reopen every now and then. But we do this task by using a
> separate database, for the sake of efficiency.
>
> > 2. Is there an effect method to check if there exists the duplicate
> > fields(hold a unique ID) in the lucene index database? Two methods:
> > Read all documents and compare the fields, or search for each field.
> > Is there a better one?
>
> The simplest way without using an external database is to use the
> termDocs enumeration. For each term you can easily see which ones have
> multiple documents, so every document other than the first for each term
> is a duplicate (which you could then use to build a filter to remove
> duplicates.)
>
> Daniel
>
>
>
> --
> Daniel Noll
>
> Nuix Pty Ltd
> Suite 79, 89 Jones St, Ultimo NSW 2007, Australia Ph: +61 2 9280 0699
> Web: http://www.nuix.com.au/ Fax: +61 2 9212 6902
>
> This message is intended only for the named recipient. If you are not
> the intended recipient you are notified that disclosing, copying,
> distributing or taking any action in reliance on the contents of this
> message or attachment is strictly prohibited.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
Re: duplicate fields
Posted by Daniel Noll <da...@nuix.com.au>.
jacky wrote:
> hi, 1. Is there an effect method to check if there exists the same
> field(hold a unique ID) when added into lucene index database? Make a
> search for this field?
One way is to create an IndexReader and IndexSearcher on your index,
which you reopen every now and then. But we do this task by using a
separate database, for the sake of efficiency.
> 2. Is there an effect method to check if there exists the duplicate
> fields(hold a unique ID) in the lucene index database? Two methods:
> Read all documents and compare the fields, or search for each field.
> Is there a better one?
The simplest way without using an external database is to use the
termDocs enumeration. For each term you can easily see which ones have
multiple documents, so every document other than the first for each term
is a duplicate (which you could then use to build a filter to remove
duplicates.)
Daniel
--
Daniel Noll
Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia Ph: +61 2 9280 0699
Web: http://www.nuix.com.au/ Fax: +61 2 9212 6902
This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org