You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by KaktuChakarabati <ji...@gmail.com> on 2009/11/24 23:29:00 UTC

Deduplication in 1.4

Hey,
I've been trying to find some documentation on using this feature in 1.4 but
Wiki page is alittle sparse..
In specific, here's what i'm trying to do:

I have a field, say 'duplicate_group_id' that i'll populate based on some
offline documents deduplication process I have.

All I want is for solr to compute a 'duplicate_signature' field based on
this one at update time, so that when i search for documents later, all
documents with same original 'duplicate_group_id' value will be rolled up
(e.g i'll just get the first one that came back  according to relevancy).

I enabled the deduplication processor and put it into updater, but i'm not
seeing any difference in returned results (i.e results with same
duplicate_id are returned separately..)

is there anything i need to supply in query-time for this to take effect?
what should be the behaviour? is there any working example of this?

Anything will be helpful..

Thanks,
Chak
-- 
View this message in context: http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Deduplication in 1.4

Posted by Martijn v Groningen <ma...@gmail.com>.

Two sites that use field-collapsing:
1) www.ilocal.nl
2) www.welke.nl
I'm not sure what you mean with double-tripping? The sites mentioned
do not have performance problems that are caused by field collapsing.

Field-collapsing currently only supports quasi distributed
field-collapsing (as I have described on the Solr wiki). Currently I
don't know a distributed field-collapsing algorithm that works
properly and does not influence the search time in such a way that the
search becomes slow.

Martijn

2009/11/26 Otis Gospodnetic <ot...@yahoo.com>:
> Hi Martijn,
>
>
> ----- Original Message ----
>
>> From: Martijn v Groningen <ma...@gmail.com>
>> To: solr-user@lucene.apache.org
>> Sent: Thu, November 26, 2009 3:19:40 AM
>> Subject: Re: Deduplication in 1.4
>>
>> Field collapsing has been used by many in their production
>> environment.
>
> Got any pointers to public sites you know use it?  I know of a high traffic site that used an early version, and it caused performance problems.  Is double-tripping still required?
>
>> The last few months the stability of the patch grew as
>> quiet some bugs were fixed. The only big feature missing currently is
>> caching of the collapsing algorithm. I'm currently working on that and
>
> Is it also full distributed-search-ready?
>
>> I will put it in a new patch in the coming next days.  So yes the
>> patch is very near being production ready.
>
> Thanks,
> Otis
>
>> Martijn
>>
>> 2009/11/26 KaktuChakarabati :
>> >
>> > Hey Otis,
>> > Yep, I realized this myself after playing some with the dedupe feature
>> > yesterday.
>> > So it does look like Field collapsing is what I need pretty much.
>> > Any idea on how close it is to being production-ready?
>> >
>> > Thanks,
>> > -Chak
>> >
>> > Otis Gospodnetic wrote:
>> >>
>> >> Hi,
>> >>
>> >> As far as I know, the point of deduplication in Solr (
>> >> http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate
>> >> document before indexing it in order to avoid duplicates in the index in
>> >> the first place.
>> >>
>> >> What you are describing is closer to field collapsing patch in SOLR-236.
>> >>
>> >>  Otis
>> >> --
>> >> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> >> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>> >>
>> >>
>> >>
>> >> ----- Original Message ----
>> >>> From: KaktuChakarabati
>> >>> To: solr-user@lucene.apache.org
>> >>> Sent: Tue, November 24, 2009 5:29:00 PM
>> >>> Subject: Deduplication in 1.4
>> >>>
>> >>>
>> >>> Hey,
>> >>> I've been trying to find some documentation on using this feature in 1.4
>> >>> but
>> >>> Wiki page is alittle sparse..
>> >>> In specific, here's what i'm trying to do:
>> >>>
>> >>> I have a field, say 'duplicate_group_id' that i'll populate based on some
>> >>> offline documents deduplication process I have.
>> >>>
>> >>> All I want is for solr to compute a 'duplicate_signature' field based on
>> >>> this one at update time, so that when i search for documents later, all
>> >>> documents with same original 'duplicate_group_id' value will be rolled up
>> >>> (e.g i'll just get the first one that came back  according to relevancy).
>> >>>
>> >>> I enabled the deduplication processor and put it into updater, but i'm
>> >>> not
>> >>> seeing any difference in returned results (i.e results with same
>> >>> duplicate_id are returned separately..)
>> >>>
>> >>> is there anything i need to supply in query-time for this to take effect?
>> >>> what should be the behaviour? is there any working example of this?
>> >>>
>> >>> Anything will be helpful..
>> >>>
>> >>> Thanks,
>> >>> Chak
>> >>> --
>> >>> View this message in context:
>> >>> http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
>> >>> Sent from the Solr - User mailing list archive at Nabble.com.
>> >>
>> >>
>> >>
>> >
>> > --
>> > View this message in context:
>> http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html
>> > Sent from the Solr - User mailing list archive at Nabble.com.
>> >
>> >
>
>

Re: Deduplication in 1.4

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi Martijn,

 
----- Original Message ----

> From: Martijn v Groningen <ma...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Thu, November 26, 2009 3:19:40 AM
> Subject: Re: Deduplication in 1.4
> 
> Field collapsing has been used by many in their production
> environment. 

Got any pointers to public sites you know use it?  I know of a high traffic site that used an early version, and it caused performance problems.  Is double-tripping still required?

> The last few months the stability of the patch grew as
> quiet some bugs were fixed. The only big feature missing currently is
> caching of the collapsing algorithm. I'm currently working on that and

Is it also full distributed-search-ready?

> I will put it in a new patch in the coming next days.  So yes the
> patch is very near being production ready.

Thanks,
Otis

> Martijn
> 
> 2009/11/26 KaktuChakarabati :
> >
> > Hey Otis,
> > Yep, I realized this myself after playing some with the dedupe feature
> > yesterday.
> > So it does look like Field collapsing is what I need pretty much.
> > Any idea on how close it is to being production-ready?
> >
> > Thanks,
> > -Chak
> >
> > Otis Gospodnetic wrote:
> >>
> >> Hi,
> >>
> >> As far as I know, the point of deduplication in Solr (
> >> http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate
> >> document before indexing it in order to avoid duplicates in the index in
> >> the first place.
> >>
> >> What you are describing is closer to field collapsing patch in SOLR-236.
> >>
> >>  Otis
> >> --
> >> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> >> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> >>
> >>
> >>
> >> ----- Original Message ----
> >>> From: KaktuChakarabati 
> >>> To: solr-user@lucene.apache.org
> >>> Sent: Tue, November 24, 2009 5:29:00 PM
> >>> Subject: Deduplication in 1.4
> >>>
> >>>
> >>> Hey,
> >>> I've been trying to find some documentation on using this feature in 1.4
> >>> but
> >>> Wiki page is alittle sparse..
> >>> In specific, here's what i'm trying to do:
> >>>
> >>> I have a field, say 'duplicate_group_id' that i'll populate based on some
> >>> offline documents deduplication process I have.
> >>>
> >>> All I want is for solr to compute a 'duplicate_signature' field based on
> >>> this one at update time, so that when i search for documents later, all
> >>> documents with same original 'duplicate_group_id' value will be rolled up
> >>> (e.g i'll just get the first one that came back  according to relevancy).
> >>>
> >>> I enabled the deduplication processor and put it into updater, but i'm
> >>> not
> >>> seeing any difference in returned results (i.e results with same
> >>> duplicate_id are returned separately..)
> >>>
> >>> is there anything i need to supply in query-time for this to take effect?
> >>> what should be the behaviour? is there any working example of this?
> >>>
> >>> Anything will be helpful..
> >>>
> >>> Thanks,
> >>> Chak
> >>> --
> >>> View this message in context:
> >>> http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
> >>> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >>
> >>
> >
> > --
> > View this message in context: 
> http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
> >

Re: Deduplication in 1.4

Posted by Martijn v Groningen <ma...@gmail.com>.

Field collapsing has been used by many in their production
environment. The last few months the stability of the patch grew as
quiet some bugs were fixed. The only big feature missing currently is
caching of the collapsing algorithm. I'm currently working on that and
I will put it in a new patch in the coming next days.  So yes the
patch is very near being production ready.

Martijn

2009/11/26 KaktuChakarabati <ji...@gmail.com>:
>
> Hey Otis,
> Yep, I realized this myself after playing some with the dedupe feature
> yesterday.
> So it does look like Field collapsing is what I need pretty much.
> Any idea on how close it is to being production-ready?
>
> Thanks,
> -Chak
>
> Otis Gospodnetic wrote:
>>
>> Hi,
>>
>> As far as I know, the point of deduplication in Solr (
>> http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate
>> document before indexing it in order to avoid duplicates in the index in
>> the first place.
>>
>> What you are describing is closer to field collapsing patch in SOLR-236.
>>
>>  Otis
>> --
>> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>>
>>
>>
>> ----- Original Message ----
>>> From: KaktuChakarabati <ji...@gmail.com>
>>> To: solr-user@lucene.apache.org
>>> Sent: Tue, November 24, 2009 5:29:00 PM
>>> Subject: Deduplication in 1.4
>>>
>>>
>>> Hey,
>>> I've been trying to find some documentation on using this feature in 1.4
>>> but
>>> Wiki page is alittle sparse..
>>> In specific, here's what i'm trying to do:
>>>
>>> I have a field, say 'duplicate_group_id' that i'll populate based on some
>>> offline documents deduplication process I have.
>>>
>>> All I want is for solr to compute a 'duplicate_signature' field based on
>>> this one at update time, so that when i search for documents later, all
>>> documents with same original 'duplicate_group_id' value will be rolled up
>>> (e.g i'll just get the first one that came back  according to relevancy).
>>>
>>> I enabled the deduplication processor and put it into updater, but i'm
>>> not
>>> seeing any difference in returned results (i.e results with same
>>> duplicate_id are returned separately..)
>>>
>>> is there anything i need to supply in query-time for this to take effect?
>>> what should be the behaviour? is there any working example of this?
>>>
>>> Anything will be helpful..
>>>
>>> Thanks,
>>> Chak
>>> --
>>> View this message in context:
>>> http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>>
>
> --
> View this message in context: http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Deduplication in 1.4

Posted by KaktuChakarabati <ji...@gmail.com>.

Hey Otis,
Yep, I realized this myself after playing some with the dedupe feature
yesterday.
So it does look like Field collapsing is what I need pretty much.
Any idea on how close it is to being production-ready?

Thanks,
-Chak

Otis Gospodnetic wrote:
> 
> Hi,
> 
> As far as I know, the point of deduplication in Solr (
> http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate
> document before indexing it in order to avoid duplicates in the index in
> the first place.
> 
> What you are describing is closer to field collapsing patch in SOLR-236.
> 
>  Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> 
> 
> 
> ----- Original Message ----
>> From: KaktuChakarabati <ji...@gmail.com>
>> To: solr-user@lucene.apache.org
>> Sent: Tue, November 24, 2009 5:29:00 PM
>> Subject: Deduplication in 1.4
>> 
>> 
>> Hey,
>> I've been trying to find some documentation on using this feature in 1.4
>> but
>> Wiki page is alittle sparse..
>> In specific, here's what i'm trying to do:
>> 
>> I have a field, say 'duplicate_group_id' that i'll populate based on some
>> offline documents deduplication process I have.
>> 
>> All I want is for solr to compute a 'duplicate_signature' field based on
>> this one at update time, so that when i search for documents later, all
>> documents with same original 'duplicate_group_id' value will be rolled up
>> (e.g i'll just get the first one that came back  according to relevancy).
>> 
>> I enabled the deduplication processor and put it into updater, but i'm
>> not
>> seeing any difference in returned results (i.e results with same
>> duplicate_id are returned separately..)
>> 
>> is there anything i need to supply in query-time for this to take effect?
>> what should be the behaviour? is there any working example of this?
>> 
>> Anything will be helpful..
>> 
>> Thanks,
>> Chak
>> -- 
>> View this message in context: 
>> http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Deduplication in 1.4

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi,

As far as I know, the point of deduplication in Solr ( http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate document before indexing it in order to avoid duplicates in the index in the first place.

What you are describing is closer to field collapsing patch in SOLR-236.

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: KaktuChakarabati <ji...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Tue, November 24, 2009 5:29:00 PM
> Subject: Deduplication in 1.4
> 
> 
> Hey,
> I've been trying to find some documentation on using this feature in 1.4 but
> Wiki page is alittle sparse..
> In specific, here's what i'm trying to do:
> 
> I have a field, say 'duplicate_group_id' that i'll populate based on some
> offline documents deduplication process I have.
> 
> All I want is for solr to compute a 'duplicate_signature' field based on
> this one at update time, so that when i search for documents later, all
> documents with same original 'duplicate_group_id' value will be rolled up
> (e.g i'll just get the first one that came back  according to relevancy).
> 
> I enabled the deduplication processor and put it into updater, but i'm not
> seeing any difference in returned results (i.e results with same
> duplicate_id are returned separately..)
> 
> is there anything i need to supply in query-time for this to take effect?
> what should be the behaviour? is there any working example of this?
> 
> Anything will be helpful..
> 
> Thanks,
> Chak
> -- 
> View this message in context: 
> http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
> Sent from the Solr - User mailing list archive at Nabble.com.