You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by 蒋明原 <ma...@gmail.com> on 2012/12/11 17:24:16 UTC

how to remove duplicate data while facet?

hi,all,

I'm doing a distribute facet query,and there duplicate data among the
distribute cluster.
for example:

server A hold documents:

Doc1: uniqueKey=1 userid=a
Doc2: uniqueKey=2 userid=b
Doc3: uniqueKey=3  userid=c

server B hold documents:
Doc1: uniqueKey=1 userid=a
Doc2: uniqueKey=4 userid=b
Doc3: uniqueKey=5  userid=c

when a make a facet query using filed "userid", the expect result is:
a:1
b:2
c:2

but solr gives me:

a:2
b:2
c:2

However, I make a normal query using : userid:a,
solr gives me total 1 result.

It seems like: when making facet query,duplicate key will still participate
in calculate,but when making normal query,solr will choose only 1 document
between duplication document.

So,My problem is "how to remove duplicate documents during distributed
facet search."

thanks !

Re: how to remove duplicate data while facet?

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi,

Sounds like you don't need to reindex. You need to find duplicates and
delete them.

Otis
--
SOLR Performance Monitoring - http://sematext.com/spm
On Dec 11, 2012 12:42 PM, "蒋明原" <ma...@gmail.com> wrote:

> Thank you,first of all,
> Yes,no same unique key means no this trouble.
>  But for me now,I can't reindex my data,it's too big.And it,s in production
> environment .
> So,any friends have solutions?
>
> Thank you .
>
> On Wednesday, December 12, 2012, Pawel wrote:
>
> > I think that solution is quite obvous. Be sure that you don't have items
> > with the same unique key in many shards :)
> >
> > On Tue, Dec 11, 2012 at 5:24 PM, 蒋明原 <mailtojiangmingyuan@gmail.com
> <javascript:;>>
> > wrote:
> >
> > > hi,all,
> > >
> > > I'm doing a distribute facet query,and there duplicate data among the
> > > distribute cluster.
> > > for example:
> > >
> > > server A hold documents:
> > >
> > > Doc1: uniqueKey=1 userid=a
> > > Doc2: uniqueKey=2 userid=b
> > > Doc3: uniqueKey=3  userid=c
> > >
> > > server B hold documents:
> > > Doc1: uniqueKey=1 userid=a
> > > Doc2: uniqueKey=4 userid=b
> > > Doc3: uniqueKey=5  userid=c
> > >
> > > when a make a facet query using filed "userid", the expect result is:
> > > a:1
> > > b:2
> > > c:2
> > >
> > > but solr gives me:
> > >
> > > a:2
> > > b:2
> > > c:2
> > >
> > > However, I make a normal query using : userid:a,
> > > solr gives me total 1 result.
> > >
> > > It seems like: when making facet query,duplicate key will still
> > participate
> > > in calculate,but when making normal query,solr will choose only 1
> > document
> > > between duplication document.
> > >
> > > So,My problem is "how to remove duplicate documents during distributed
> > > facet search."
> > >
> > > thanks !
> > >
> >
>

Re: how to remove duplicate data while facet?

Posted by 蒋明原 <ma...@gmail.com>.

Thank you,first of all,
Yes,no same unique key means no this trouble.
 But for me now,I can't reindex my data,it's too big.And it,s in production
environment .
So,any friends have solutions?

Thank you .

On Wednesday, December 12, 2012, Pawel wrote:

> I think that solution is quite obvous. Be sure that you don't have items
> with the same unique key in many shards :)
>
> On Tue, Dec 11, 2012 at 5:24 PM, 蒋明原 <mailtojiangmingyuan@gmail.com<javascript:;>>
> wrote:
>
> > hi,all,
> >
> > I'm doing a distribute facet query,and there duplicate data among the
> > distribute cluster.
> > for example:
> >
> > server A hold documents:
> >
> > Doc1: uniqueKey=1 userid=a
> > Doc2: uniqueKey=2 userid=b
> > Doc3: uniqueKey=3  userid=c
> >
> > server B hold documents:
> > Doc1: uniqueKey=1 userid=a
> > Doc2: uniqueKey=4 userid=b
> > Doc3: uniqueKey=5  userid=c
> >
> > when a make a facet query using filed "userid", the expect result is:
> > a:1
> > b:2
> > c:2
> >
> > but solr gives me:
> >
> > a:2
> > b:2
> > c:2
> >
> > However, I make a normal query using : userid:a,
> > solr gives me total 1 result.
> >
> > It seems like: when making facet query,duplicate key will still
> participate
> > in calculate,but when making normal query,solr will choose only 1
> document
> > between duplication document.
> >
> > So,My problem is "how to remove duplicate documents during distributed
> > facet search."
> >
> > thanks !
> >
>

Re: how to remove duplicate data while facet?

Posted by Pawel <pa...@gmail.com>.

I think that solution is quite obvous. Be sure that you don't have items
with the same unique key in many shards :)

On Tue, Dec 11, 2012 at 5:24 PM, 蒋明原 <ma...@gmail.com> wrote:

> hi,all,
>
> I'm doing a distribute facet query,and there duplicate data among the
> distribute cluster.
> for example:
>
> server A hold documents:
>
> Doc1: uniqueKey=1 userid=a
> Doc2: uniqueKey=2 userid=b
> Doc3: uniqueKey=3  userid=c
>
> server B hold documents:
> Doc1: uniqueKey=1 userid=a
> Doc2: uniqueKey=4 userid=b
> Doc3: uniqueKey=5  userid=c
>
> when a make a facet query using filed "userid", the expect result is:
> a:1
> b:2
> c:2
>
> but solr gives me:
>
> a:2
> b:2
> c:2
>
> However, I make a normal query using : userid:a,
> solr gives me total 1 result.
>
> It seems like: when making facet query,duplicate key will still participate
> in calculate,but when making normal query,solr will choose only 1 document
> between duplication document.
>
> So,My problem is "how to remove duplicate documents during distributed
> facet search."
>
> thanks !
>