You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mark juszczec <ma...@gmail.com> on 2011/09/10 02:08:37 UTC

solr equivalent of "select distinct"

Hello everyone

Let's say each record in my index contains fields named PK, FLD1, FLD2, FLD3
.... FLD100

PK is my solr primary key and I'm creating it by concatenating
FLD1+FLD2+FLD3 and I'm guaranteed that combination will be unique

Let's say 2 of these records have FLD1 = A and FLD2 = B.  I am unsure about
the remaining fields

Right now, if I do a query specifying FLD1 = A and FLD2 = B then I get both
records.  I only want 1.

Research says I should use faceting.  But this:

q=FLD1:A and FLD2:B & rows=500 & defType=edismax & fl=FLD1, FLD2 &
facet=true & facet_field=FLD1 & facet_field=FLD2

gives me 2 records.

In fact, it gives me the same results as:

q=FLD1:A and FLD2:B & rows=500 & defType=edismax & fl=FLD1, FLD2

I'm wrong somewhere, but I'm unsure where.

Is faceting the right way to go or should I be using grouping?

Curiously, when I use grouping like this:

q=FLD1:A and FLD2:B &rows=500 &defType=edismax &indent=true &fl=FLD1, FLD2
&group=true &group.field=FLD1 &group.field=FLD2

I get 2 records as well.

Has anyone dealt with mimicing "select distinct" in Solr?

Any advice would be very appreciated.

Mark

Re: solr equivalent of "select distinct"

Posted by lee carroll <le...@googlemail.com>.
if you have a limited set of searches which need to use this and they
act on a limited known set of fields you can concat fields at index
time and then facet

PK   FLD1      FLD2    FLD3 FLD4 FLD5 copy45
AB0  A            B          0     x       y        x y
AB1  A            B          1     x       y        x y
CD0  C            D          0     a       b        a b
CD1  C            D          1     e       f         e f

faceting on copy45 field would give you the correct "distinct" term
values (plus their counts).
Its pretty contrived and limited to knowing the fields you need to concat.

What is the use case for this ? it maybe another approach would fit better

lee c

On 11 September 2011 22:26, Michael Sokolov <so...@ifactory.com> wrote:
> You can get what you want - unique lists of values from docs matching your
> query - for a single field (using facets), but not for the co-occurrence of
> two field values.  So you could combine the two fields together, if you know
> what they are going to be "in advance."  Facets also give you counts, so in
> some special cases, you could get what you want - eg you can tell when there
> is only a single pair of values since their counts will be the same and the
> same as the total.  But that's all I can think of.
>
> -Mike
>
> On 9/11/2011 12:39 PM, Mark juszczec wrote:
>>
>> Here's an example:
>>
>> PK   FLD1      FLD2    FLD3 FLD4 FLD5
>> AB0  A            B          0     x       y
>> AB1  A            B          1     x       y
>> CD0  C            D          0     a       b
>> CD1  C            D          1     e       f
>>
>> I want to write a query using only the terms FLD1 and FLD2 and ONLY get
>> back:
>>
>> A B x y
>> C D a b
>> C D e f
>>
>> Since FLD4 and FLD5 are the same for PK=AB0 and AB1, I only want one
>> occurrence of those records.
>>
>> Since FLD4 and FLD5 are different for PK=CD0 and CD1, I want BOTH
>> occurrences of those records.
>>
>
>

Re: solr equivalent of "select distinct"

Posted by Michael Sokolov <so...@ifactory.com>.
You can get what you want - unique lists of values from docs matching 
your query - for a single field (using facets), but not for the 
co-occurrence of two field values.  So you could combine the two fields 
together, if you know what they are going to be "in advance."  Facets 
also give you counts, so in some special cases, you could get what you 
want - eg you can tell when there is only a single pair of values since 
their counts will be the same and the same as the total.  But that's all 
I can think of.

-Mike

On 9/11/2011 12:39 PM, Mark juszczec wrote:
> Here's an example:
>
> PK   FLD1      FLD2    FLD3 FLD4 FLD5
> AB0  A            B          0     x       y
> AB1  A            B          1     x       y
> CD0  C            D          0     a       b
> CD1  C            D          1     e       f
>
> I want to write a query using only the terms FLD1 and FLD2 and ONLY get
> back:
>
> A B x y
> C D a b
> C D e f
>
> Since FLD4 and FLD5 are the same for PK=AB0 and AB1, I only want one
> occurrence of those records.
>
> Since FLD4 and FLD5 are different for PK=CD0 and CD1, I want BOTH
> occurrences of those records.
>


Re: solr equivalent of "select distinct"

Posted by Erick Erickson <er...@gmail.com>.
Hmmm, there's no good way I can think of off the top of my
head to do this. Whenever people find themselves thinking
in terms of RDBMSs, I have to ask whether the problem is
really appropriate for a search engine. And/or what the problem
you're trying to solve with this approach is from a higher level.
Perhaps there's another approach completely that would
serve....

Best
Erick

On Sun, Sep 11, 2011 at 12:39 PM, Mark juszczec <ma...@gmail.com> wrote:
> Erick
>
> Thanks very much for the reply.
>
> I typed this late Friday after work and tried to simplify the problem
> description.  I got something wrong.  Hopefully this restatement is better:
>
> My PK is FLD1, FLD2 and FLD3 concatenated together.
>
> In some cases FLD1 and FLD2 can be the same.  The ONLY differing field being
> FLD3.
>
> Here's an example:
>
> PK   FLD1      FLD2    FLD3 FLD4 FLD5
> AB0  A            B          0     x       y
> AB1  A            B          1     x       y
> CD0  C            D          0     a       b
> CD1  C            D          1     e       f
>
> I want to write a query using only the terms FLD1 and FLD2 and ONLY get
> back:
>
> A B x y
> C D a b
> C D e f
>
> Since FLD4 and FLD5 are the same for PK=AB0 and AB1, I only want one
> occurrence of those records.
>
> Since FLD4 and FLD5 are different for PK=CD0 and CD1, I want BOTH
> occurrences of those records.
>
> I'm hoping I can use wildcards to get FLD4 and FLD5.  If not, I can use fl=
>
> I'm using edismax.
>
> We are also creating the query string on the fly.  I suspect using SolrJ and
> plugging the values into a bean would be easier - or do I have that wrong?
>
> I hope the tables of example data display properly.
>
> Mark
>
> On Sun, Sep 11, 2011 at 12:06 PM, Erick Erickson <er...@gmail.com>wrote:
>
>> This smells like an XY problem, can you back up and give a higher-level
>> reason *why* you want this behavior?
>>
>> Because given your problem description, this seems like you are getting
>> correct behavior no matter how you define the problem. You're essentially
>> saying that you have two records with identical beginnings of your PK,
>> why is it incorrect to give you both records?
>>
>> But, anyway, if you're searching on FLD1 and FLD2, then by definition
>> you're going to get both records back or the search would be failing!
>>
>> Best
>> Erick
>>
>> On Fri, Sep 9, 2011 at 8:08 PM, Mark juszczec <ma...@gmail.com>
>> wrote:
>> > Hello everyone
>> >
>> > Let's say each record in my index contains fields named PK, FLD1, FLD2,
>> FLD3
>> > .... FLD100
>> >
>> > PK is my solr primary key and I'm creating it by concatenating
>> > FLD1+FLD2+FLD3 and I'm guaranteed that combination will be unique
>> >
>> > Let's say 2 of these records have FLD1 = A and FLD2 = B.  I am unsure
>> about
>> > the remaining fields
>> >
>> > Right now, if I do a query specifying FLD1 = A and FLD2 = B then I get
>> both
>> > records.  I only want 1.
>> >
>> > Research says I should use faceting.  But this:
>> >
>> > q=FLD1:A and FLD2:B & rows=500 & defType=edismax & fl=FLD1, FLD2 &
>> > facet=true & facet_field=FLD1 & facet_field=FLD2
>> >
>> > gives me 2 records.
>> >
>> > In fact, it gives me the same results as:
>> >
>> > q=FLD1:A and FLD2:B & rows=500 & defType=edismax & fl=FLD1, FLD2
>> >
>> > I'm wrong somewhere, but I'm unsure where.
>> >
>> > Is faceting the right way to go or should I be using grouping?
>> >
>> > Curiously, when I use grouping like this:
>> >
>> > q=FLD1:A and FLD2:B &rows=500 &defType=edismax &indent=true &fl=FLD1,
>> FLD2
>> > &group=true &group.field=FLD1 &group.field=FLD2
>> >
>> > I get 2 records as well.
>> >
>> > Has anyone dealt with mimicing "select distinct" in Solr?
>> >
>> > Any advice would be very appreciated.
>> >
>> > Mark
>> >
>>
>

Re: solr equivalent of "select distinct"

Posted by Mark juszczec <ma...@gmail.com>.
Erick

Thanks very much for the reply.

I typed this late Friday after work and tried to simplify the problem
description.  I got something wrong.  Hopefully this restatement is better:

My PK is FLD1, FLD2 and FLD3 concatenated together.

In some cases FLD1 and FLD2 can be the same.  The ONLY differing field being
FLD3.

Here's an example:

PK   FLD1      FLD2    FLD3 FLD4 FLD5
AB0  A            B          0     x       y
AB1  A            B          1     x       y
CD0  C            D          0     a       b
CD1  C            D          1     e       f

I want to write a query using only the terms FLD1 and FLD2 and ONLY get
back:

A B x y
C D a b
C D e f

Since FLD4 and FLD5 are the same for PK=AB0 and AB1, I only want one
occurrence of those records.

Since FLD4 and FLD5 are different for PK=CD0 and CD1, I want BOTH
occurrences of those records.

I'm hoping I can use wildcards to get FLD4 and FLD5.  If not, I can use fl=

I'm using edismax.

We are also creating the query string on the fly.  I suspect using SolrJ and
plugging the values into a bean would be easier - or do I have that wrong?

I hope the tables of example data display properly.

Mark

On Sun, Sep 11, 2011 at 12:06 PM, Erick Erickson <er...@gmail.com>wrote:

> This smells like an XY problem, can you back up and give a higher-level
> reason *why* you want this behavior?
>
> Because given your problem description, this seems like you are getting
> correct behavior no matter how you define the problem. You're essentially
> saying that you have two records with identical beginnings of your PK,
> why is it incorrect to give you both records?
>
> But, anyway, if you're searching on FLD1 and FLD2, then by definition
> you're going to get both records back or the search would be failing!
>
> Best
> Erick
>
> On Fri, Sep 9, 2011 at 8:08 PM, Mark juszczec <ma...@gmail.com>
> wrote:
> > Hello everyone
> >
> > Let's say each record in my index contains fields named PK, FLD1, FLD2,
> FLD3
> > .... FLD100
> >
> > PK is my solr primary key and I'm creating it by concatenating
> > FLD1+FLD2+FLD3 and I'm guaranteed that combination will be unique
> >
> > Let's say 2 of these records have FLD1 = A and FLD2 = B.  I am unsure
> about
> > the remaining fields
> >
> > Right now, if I do a query specifying FLD1 = A and FLD2 = B then I get
> both
> > records.  I only want 1.
> >
> > Research says I should use faceting.  But this:
> >
> > q=FLD1:A and FLD2:B & rows=500 & defType=edismax & fl=FLD1, FLD2 &
> > facet=true & facet_field=FLD1 & facet_field=FLD2
> >
> > gives me 2 records.
> >
> > In fact, it gives me the same results as:
> >
> > q=FLD1:A and FLD2:B & rows=500 & defType=edismax & fl=FLD1, FLD2
> >
> > I'm wrong somewhere, but I'm unsure where.
> >
> > Is faceting the right way to go or should I be using grouping?
> >
> > Curiously, when I use grouping like this:
> >
> > q=FLD1:A and FLD2:B &rows=500 &defType=edismax &indent=true &fl=FLD1,
> FLD2
> > &group=true &group.field=FLD1 &group.field=FLD2
> >
> > I get 2 records as well.
> >
> > Has anyone dealt with mimicing "select distinct" in Solr?
> >
> > Any advice would be very appreciated.
> >
> > Mark
> >
>

Re: solr equivalent of "select distinct"

Posted by Erick Erickson <er...@gmail.com>.
This smells like an XY problem, can you back up and give a higher-level
reason *why* you want this behavior?

Because given your problem description, this seems like you are getting
correct behavior no matter how you define the problem. You're essentially
saying that you have two records with identical beginnings of your PK,
why is it incorrect to give you both records?

But, anyway, if you're searching on FLD1 and FLD2, then by definition
you're going to get both records back or the search would be failing!

Best
Erick

On Fri, Sep 9, 2011 at 8:08 PM, Mark juszczec <ma...@gmail.com> wrote:
> Hello everyone
>
> Let's say each record in my index contains fields named PK, FLD1, FLD2, FLD3
> .... FLD100
>
> PK is my solr primary key and I'm creating it by concatenating
> FLD1+FLD2+FLD3 and I'm guaranteed that combination will be unique
>
> Let's say 2 of these records have FLD1 = A and FLD2 = B.  I am unsure about
> the remaining fields
>
> Right now, if I do a query specifying FLD1 = A and FLD2 = B then I get both
> records.  I only want 1.
>
> Research says I should use faceting.  But this:
>
> q=FLD1:A and FLD2:B & rows=500 & defType=edismax & fl=FLD1, FLD2 &
> facet=true & facet_field=FLD1 & facet_field=FLD2
>
> gives me 2 records.
>
> In fact, it gives me the same results as:
>
> q=FLD1:A and FLD2:B & rows=500 & defType=edismax & fl=FLD1, FLD2
>
> I'm wrong somewhere, but I'm unsure where.
>
> Is faceting the right way to go or should I be using grouping?
>
> Curiously, when I use grouping like this:
>
> q=FLD1:A and FLD2:B &rows=500 &defType=edismax &indent=true &fl=FLD1, FLD2
> &group=true &group.field=FLD1 &group.field=FLD2
>
> I get 2 records as well.
>
> Has anyone dealt with mimicing "select distinct" in Solr?
>
> Any advice would be very appreciated.
>
> Mark
>