You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Ali, Saqib" <do...@gmail.com> on 2013/08/21 23:14:46 UTC

removing duplicates

hello,

We have documents that are duplicates i.e. the ID is different, but rest of
the fields are same. Is there a query that can remove duplicate, and just
leave one copy of the document on solr? There is one numeric field that we
can key off for find duplicates.

Please advise.

Thanks

RE: removing duplicates

Posted by "Petersen, Robert" <ro...@mail.rakuten.com>.
Hi

Perhaps you could query for all documents asking for the id field to be returned and then facet on the field you say you can key off of for duplicates.  Set the facet mincount to 2, then you would have to filter on each facet value and page through all doc IDs (except skip the first document) for each returned facet and delete by ID using a small app or something like that.  Spin all the deletes into the index and then do a commit at the end.  I think that would do it.

Thanks
Robi

-----Original Message-----
From: Ali, Saqib [mailto:docbook.xml@gmail.com] 
Sent: Wednesday, August 21, 2013 2:15 PM
To: solr-user@lucene.apache.org
Subject: removing duplicates

hello,

We have documents that are duplicates i.e. the ID is different, but rest of the fields are same. Is there a query that can remove duplicate, and just leave one copy of the document on solr? There is one numeric field that we can key off for find duplicates.

Please advise.

Thanks


RE: removing duplicates

Posted by "Petersen, Robert" <ro...@mail.rakuten.com>.
This would describe the facet parameters we're talking about:

http://wiki.apache.org/solr/SimpleFacetParameters

Query something like this:
http://localhost:8983/solr/select?q=*:*&fl=id&rows=0&facet=true&facet.limit=-1&facet.field=<your field name>&facet.mincount=2

Then filter on each facet returned with a filter query described here: http://wiki.apache.org/solr/CommonQueryParameters
Example: q=*:*&fq=<your field name>:<your field value>

Then you would have to get all ids returned and delete all but the first one using some app...

Thanks 
Robi


-----Original Message-----
From: Ali, Saqib [mailto:docbook.xml@gmail.com] 
Sent: Wednesday, August 21, 2013 2:34 PM
To: solr-user@lucene.apache.org
Subject: Re: removing duplicates

Thanks Aloke and Robert. Can you please give me code/query snippets?
(newbie here)


On Wed, Aug 21, 2013 at 2:31 PM, Aloke Ghoshal <al...@gmail.com> wrote:

> Hi,
>
> Facet by one of the duplicate fields (probably by the numeric field 
> that you mentioned) and set facet.mincount=2.
>
> Regards,
> Aloke
>
>
> On Thu, Aug 22, 2013 at 2:44 AM, Ali, Saqib <do...@gmail.com> wrote:
>
> > hello,
> >
> > We have documents that are duplicates i.e. the ID is different, but 
> > rest
> of
> > the fields are same. Is there a query that can remove duplicate, and 
> > just leave one copy of the document on solr? There is one numeric 
> > field that
> we
> > can key off for find duplicates.
> >
> > Please advise.
> >
> > Thanks
> >
>


Re: removing duplicates

Posted by Aloke Ghoshal <al...@gmail.com>.
Hi,

This will help you identify the duplicates:
q=*:*&fl=id&facet=true&facet.mincount=2&rows=0&facet.field=<One_Of_The_Duplicated_Fields>

To actually remove them from Solr, you will have to do something like
Robert suggested. Write an application that uses the results to build a
delete by id query (
http://wiki.apache.org/solr/UpdateXmlMessages#A.22delete.22_documents_by_ID_and_by_Query
).

Regards,
Aloke


On Thu, Aug 22, 2013 at 3:04 AM, Ali, Saqib <do...@gmail.com> wrote:

> Thanks Aloke and Robert. Can you please give me code/query snippets?
> (newbie here)
>
>
> On Wed, Aug 21, 2013 at 2:31 PM, Aloke Ghoshal <al...@gmail.com>
> wrote:
>
> > Hi,
> >
> > Facet by one of the duplicate fields (probably by the numeric field that
> > you mentioned) and set facet.mincount=2.
> >
> > Regards,
> > Aloke
> >
> >
> > On Thu, Aug 22, 2013 at 2:44 AM, Ali, Saqib <do...@gmail.com>
> wrote:
> >
> > > hello,
> > >
> > > We have documents that are duplicates i.e. the ID is different, but
> rest
> > of
> > > the fields are same. Is there a query that can remove duplicate, and
> just
> > > leave one copy of the document on solr? There is one numeric field that
> > we
> > > can key off for find duplicates.
> > >
> > > Please advise.
> > >
> > > Thanks
> > >
> >
>

Re: removing duplicates

Posted by "Ali, Saqib" <do...@gmail.com>.
Thanks Aloke and Robert. Can you please give me code/query snippets?
(newbie here)


On Wed, Aug 21, 2013 at 2:31 PM, Aloke Ghoshal <al...@gmail.com> wrote:

> Hi,
>
> Facet by one of the duplicate fields (probably by the numeric field that
> you mentioned) and set facet.mincount=2.
>
> Regards,
> Aloke
>
>
> On Thu, Aug 22, 2013 at 2:44 AM, Ali, Saqib <do...@gmail.com> wrote:
>
> > hello,
> >
> > We have documents that are duplicates i.e. the ID is different, but rest
> of
> > the fields are same. Is there a query that can remove duplicate, and just
> > leave one copy of the document on solr? There is one numeric field that
> we
> > can key off for find duplicates.
> >
> > Please advise.
> >
> > Thanks
> >
>

Re: removing duplicates

Posted by Aloke Ghoshal <al...@gmail.com>.
Hi,

Facet by one of the duplicate fields (probably by the numeric field that
you mentioned) and set facet.mincount=2.

Regards,
Aloke


On Thu, Aug 22, 2013 at 2:44 AM, Ali, Saqib <do...@gmail.com> wrote:

> hello,
>
> We have documents that are duplicates i.e. the ID is different, but rest of
> the fields are same. Is there a query that can remove duplicate, and just
> leave one copy of the document on solr? There is one numeric field that we
> can key off for find duplicates.
>
> Please advise.
>
> Thanks
>

答复: removing duplicates

Posted by Liu <li...@duokan.com>.
This picture is extracted from apache-solr-ref-guide-4.4.pdf ,Maybe it will
help you.
You could download the document from
https://www.apache.org/dyn/closer.cgi/lucene/solr/ref-guide/

-----邮件原件-----
发件人: Ali, Saqib [mailto:docbook.xml@gmail.com] 
发送时间: 2013年8月22日 5:15
收件人: solr-user@lucene.apache.org
主题: removing duplicates

hello,

We have documents that are duplicates i.e. the ID is different, but rest of
the fields are same. Is there a query that can remove duplicate, and just
leave one copy of the document on solr? There is one numeric field that we
can key off for find duplicates.

Please advise.

Thanks