You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucenenet.apache.org by Patrick Burrows <pb...@gmail.com> on 2007/05/18 01:30:35 UTC

Handling Duplicates

Is there anything built into DotLucene that will handle duplicate entries?
While I am going to do my best to make sure there are no duplicates, there
may be a case where I am unsure if a transaction completed (such as a disk
full situation, maybe) and I might want to re-add a certain number of items
just to make sure they are in there.

If there isn't something built in, is there a strategy others have used to
handle this?

Thanks.

Re: Handling Duplicates

Posted by Michael Garski <mg...@mac.com>.

Patrick,

I've had to do something very similar, and you have a couple of options:

1. If the 'popularity' value is stored in a database, you can look up  
those values after performing your search against the index and then  
sort.

2. Continually update the index to reflect the most recent  
'popularity' value and then perform a custom sort during your search.

For my application, #2 is what we fond to be most efficient.

Michael


On May 18, 2007, at 4:48 AM, Patrick Burrows wrote:

> Thanks guys. I'll try it out.
>
> My next question is going to be about ranking the results of my  
> searches
> based on information that is not in the index (popularity, for  
> instance,
> which might change hourly). Is there some reading I can do on the  
> subject
> before I start asking questions?
>
>
> On 5/17/07, Dave <da...@yahoo.com> wrote:
>>
>> Patrick,
>>
>> I also have to deal with the issue of duplicate
>> entries and I've found the use of the
>> IndexReader.TermDocs(Term) function invaluable toward
>> this goal. It was far easier function to use than
>> IndexSearcher.Search, which in my experience has
>> sometimes returned unexpected results, though I blame
>> the Analyzer I was using more than the IndexSearcher
>> itself. Of course, as Michael stated, this is only a
>> viable option if your documents have a field with a
>> unique id.
>>
>> Sincerely,
>> Dave
>> --- Michael Garski <mg...@mac.com> wrote:
>>
>> > Patrick,
>> >
>> > There is nothing built into Lucene to ensure that a
>> > given document is
>> > only in the index once, it is up to your application
>> > to provide that
>> > logic.  If I need to ensure that I do not add a
>> > document to the index
>> > that is already present, I first perform a search
>> > against a field that
>> > is used to denote a unique document, such as an
>> > integer id.
>> >
>> > Michael
>> >
>> > Patrick Burrows wrote:
>> > > Is there anything built into DotLucene that will
>> > handle duplicate
>> > > entries?
>> > > While I am going to do my best to make sure there
>> > are no duplicates,
>> > > there
>> > > may be a case where I am unsure if a transaction
>> > completed (such as a
>> > > disk
>> > > full situation, maybe) and I might want to re-add
>> > a certain number of
>> > > items
>> > > just to make sure they are in there.
>> > >
>> > > If there isn't something built in, is there a
>> > strategy others have
>> > > used to
>> > > handle this?
>> > >
>> > > Thanks.
>> > >
>> >
>> >
>>
>>
>>
>>
>> _____________________________________________________________________ 
>> _______________Choose
>> the right car based on your needs.  Check out Yahoo! Autos new Car  
>> Finder
>> tool.
>> http://autos.yahoo.com/carfinder/
>>
>
>
>
> -- 
> -
> P

Re: Handling Duplicates

Posted by Patrick Burrows <pb...@gmail.com>.

Thanks guys. I'll try it out.

My next question is going to be about ranking the results of my searches
based on information that is not in the index (popularity, for instance,
which might change hourly). Is there some reading I can do on the subject
before I start asking questions?


On 5/17/07, Dave <da...@yahoo.com> wrote:
>
> Patrick,
>
> I also have to deal with the issue of duplicate
> entries and I've found the use of the
> IndexReader.TermDocs(Term) function invaluable toward
> this goal. It was far easier function to use than
> IndexSearcher.Search, which in my experience has
> sometimes returned unexpected results, though I blame
> the Analyzer I was using more than the IndexSearcher
> itself. Of course, as Michael stated, this is only a
> viable option if your documents have a field with a
> unique id.
>
> Sincerely,
> Dave
> --- Michael Garski <mg...@mac.com> wrote:
>
> > Patrick,
> >
> > There is nothing built into Lucene to ensure that a
> > given document is
> > only in the index once, it is up to your application
> > to provide that
> > logic.  If I need to ensure that I do not add a
> > document to the index
> > that is already present, I first perform a search
> > against a field that
> > is used to denote a unique document, such as an
> > integer id.
> >
> > Michael
> >
> > Patrick Burrows wrote:
> > > Is there anything built into DotLucene that will
> > handle duplicate
> > > entries?
> > > While I am going to do my best to make sure there
> > are no duplicates,
> > > there
> > > may be a case where I am unsure if a transaction
> > completed (such as a
> > > disk
> > > full situation, maybe) and I might want to re-add
> > a certain number of
> > > items
> > > just to make sure they are in there.
> > >
> > > If there isn't something built in, is there a
> > strategy others have
> > > used to
> > > handle this?
> > >
> > > Thanks.
> > >
> >
> >
>
>
>
>
> ____________________________________________________________________________________Choose
> the right car based on your needs.  Check out Yahoo! Autos new Car Finder
> tool.
> http://autos.yahoo.com/carfinder/
>



-- 
-
P

Re: Handling Duplicates

Posted by Dave <da...@yahoo.com>.

Patrick,

I also have to deal with the issue of duplicate
entries and I've found the use of the
IndexReader.TermDocs(Term) function invaluable toward
this goal. It was far easier function to use than
IndexSearcher.Search, which in my experience has
sometimes returned unexpected results, though I blame
the Analyzer I was using more than the IndexSearcher
itself. Of course, as Michael stated, this is only a
viable option if your documents have a field with a
unique id.

Sincerely,
Dave
--- Michael Garski <mg...@mac.com> wrote:

> Patrick,
> 
> There is nothing built into Lucene to ensure that a
> given document is 
> only in the index once, it is up to your application
> to provide that 
> logic.  If I need to ensure that I do not add a
> document to the index 
> that is already present, I first perform a search
> against a field that 
> is used to denote a unique document, such as an
> integer id.
> 
> Michael
> 
> Patrick Burrows wrote:
> > Is there anything built into DotLucene that will
> handle duplicate 
> > entries?
> > While I am going to do my best to make sure there
> are no duplicates, 
> > there
> > may be a case where I am unsure if a transaction
> completed (such as a 
> > disk
> > full situation, maybe) and I might want to re-add
> a certain number of 
> > items
> > just to make sure they are in there.
> >
> > If there isn't something built in, is there a
> strategy others have 
> > used to
> > handle this?
> >
> > Thanks.
> >
> 
> 

____________________________________________________________________________________Choose the right car based on your needs.  Check out Yahoo! Autos new Car Finder tool.
http://autos.yahoo.com/carfinder/

Re: Handling Duplicates

Posted by Michael Garski <mg...@mac.com>.

Patrick,

There is nothing built into Lucene to ensure that a given document is 
only in the index once, it is up to your application to provide that 
logic.  If I need to ensure that I do not add a document to the index 
that is already present, I first perform a search against a field that 
is used to denote a unique document, such as an integer id.

Michael

Patrick Burrows wrote:
> Is there anything built into DotLucene that will handle duplicate 
> entries?
> While I am going to do my best to make sure there are no duplicates, 
> there
> may be a case where I am unsure if a transaction completed (such as a 
> disk
> full situation, maybe) and I might want to re-add a certain number of 
> items
> just to make sure they are in there.
>
> If there isn't something built in, is there a strategy others have 
> used to
> handle this?
>
> Thanks.
>