You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by jacky <ja...@gmail.com> on 2006/09/08 08:23:02 UTC

duplicate fields

hi,
   1. Is there an effect method to check if there exists the same field(hold a unique ID) when added into lucene index database? Make a search for this field?
   2. Is there an effect method to check if there exists the duplicate fields(hold a unique ID) in the lucene index database?
      Two methods: Read all documents and compare the fields, or search for each field.  Is there a better one?

   Thanks for your help!

     Best Regards.
       jacky  
       

Re: duplicate fields

Posted by Erick Erickson <er...@gmail.com>.
I'm not at all sure what you're asking.

I believe you can use a TermEnum with an empty term ("") to get all the
terms in a particular field.

If you're asking "how can I find all the fields in a document", well, that's
tricky. Since there's no requirement that every document have the same
fields, there's no way that I know of of asking "what are the names of all
the fields in the index". You have to "just know".

That said, given a document, you *can* enumerate all the fields in that
document. But that doesn't tell you anything about the fields of the *next*
document.

If none of this is relevant, could you give more details about what you're
trying to do?

Best
Erick

On 9/11/06, vinay kumar <vp...@osi-tech.com> wrote:
>
> any one know how to get the unique fields from the field in the lucene
> index.
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

duplicate fields

Posted by vinay kumar <vp...@osi-tech.com>.
any one know how to get the unique fields from the field in the lucene
index.





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: duplicate fields

Posted by Daniel Noll <da...@nuix.com.au>.
jacky wrote:
> hi Daniel,
>    How do you use a separate database to check the duplicate fields?  It is interesting!

It's simple enough.  Every time we're about to process a new item we 
look in the database to see if there is already an item with the same 
ID.  If there isn't, we add the row.  If there is, it's a duplicate.

Daniel

-- 
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
Web: http://www.nuix.com.au/                        Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: duplicate fields

Posted by jacky <ja...@gmail.com>.
hi Daniel,
   How do you use a separate database to check the duplicate fields?  It is interesting!
 
     Best Regards.
       jacky  
       
----- Original Message ----- 
From: "Daniel Noll" <da...@nuix.com.au>
To: <ja...@lucene.apache.org>
Sent: Friday, September 08, 2006 3:08 PM
Subject: Re: duplicate fields


> jacky wrote:
> > hi, 1. Is there an effect method to check if there exists the same 
> > field(hold a unique ID) when added into lucene index database? Make a
> > search for this field?
> 
> One way is to create an IndexReader and IndexSearcher on your index, 
> which you reopen every now and then.  But we do this task by using a 
> separate database, for the sake of efficiency.
> 
> > 2. Is there an effect method to check if there exists the duplicate
> > fields(hold a unique ID) in the lucene index database? Two methods:
> > Read all documents and compare the fields, or search for each field.
> > Is there a better one?
> 
> The simplest way without using an external database is to use the 
> termDocs enumeration.  For each term you can easily see which ones have 
> multiple documents, so every document other than the first for each term 
> is a duplicate (which you could then use to build a filter to remove 
> duplicates.)
> 
> Daniel
> 
> 
> 
> -- 
> Daniel Noll
> 
> Nuix Pty Ltd
> Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
> Web: http://www.nuix.com.au/                        Fax: +61 2 9212 6902
> 
> This message is intended only for the named recipient. If you are not
> the intended recipient you are notified that disclosing, copying,
> distributing or taking any action in reliance on the contents of this
> message or attachment is strictly prohibited.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

Re: duplicate fields

Posted by Daniel Noll <da...@nuix.com.au>.
jacky wrote:
> hi, 1. Is there an effect method to check if there exists the same 
> field(hold a unique ID) when added into lucene index database? Make a
> search for this field?

One way is to create an IndexReader and IndexSearcher on your index, 
which you reopen every now and then.  But we do this task by using a 
separate database, for the sake of efficiency.

> 2. Is there an effect method to check if there exists the duplicate
> fields(hold a unique ID) in the lucene index database? Two methods:
> Read all documents and compare the fields, or search for each field.
> Is there a better one?

The simplest way without using an external database is to use the 
termDocs enumeration.  For each term you can easily see which ones have 
multiple documents, so every document other than the first for each term 
is a duplicate (which you could then use to build a filter to remove 
duplicates.)

Daniel



-- 
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
Web: http://www.nuix.com.au/                        Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org