You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Chris Hostetter <ho...@fucit.org> on 2009/09/02 01:57:05 UTC

Re: Can solr do the equivalent of "select distinct(field)"?

: lets say you filter your query on something and want to know how many
: distinct "categories" that your results comprise.
: then you can facet on the category field and count the number of facet
: values that are returned, right?

if you count the number of facet values returned you are getting a "count 
of disctinct values"

if you just want the list of distinct values in a field (for your whole 
index) there TermsComponent is the fastest way.

if you want the list of distinct values across a set of documents, then 
facet on that field when doing your query.

"select distinct category from books where bookInStock='true'" is analgous 
to looking at the facet section of...

   rows=0&q=bookInStock:true&facet=true&facet.field=category


-Hoss

Re: Can solr do the equivalent of "select distinct(field)"?

Posted by Aleksander Stensby <al...@integrasco.com>.

Forgot to add facet.mincount=1, obviously. But still, is this the only or
prefered way of doing something along these lines? Or is there a different
(better) approach?

Best regards,
 Aleksander

On Thu, Dec 17, 2009 at 5:59 PM, Aleksander Stensby <
aleksander.stensby@integrasco.com> wrote:

> A follow up question on this Hoss:
> If I have a set of documents, let's say this email thread. Each email has a
> unique author. All emails in the thread are indexed with "threadid=33" If I
> want to count the number of unique authors in this email thread, I could go
> along the lines you mention at the end:
> rows=0&threadid=33&facet=true&facet.field=author&limit=-1
> then count all returned facets. This works, but becomes unfeasable when the
> number of unique author values in the index is large. Right?
> So the limit=-1 solution is just not working for such fields. But would
> work well for "category" if the number of unique categories is low...
> It's almost faster to retrieve all entries from the thread and count
> programatically the number of unique authors... But obviouslly, I don't want
> to do that!
>
> So, how would you go about to find the number of unique authors in this
> scenario?
>
> Cheers,
>  Aleks
>
>
> On Wed, Sep 2, 2009 at 12:57 AM, Chris Hostetter <hossman_lucene@fucit.org
> > wrote:
>
>>
>> : lets say you filter your query on something and want to know how many
>> : distinct "categories" that your results comprise.
>> : then you can facet on the category field and count the number of facet
>> : values that are returned, right?
>>
>> if you count the number of facet values returned you are getting a "count
>> of disctinct values"
>>
>> if you just want the list of distinct values in a field (for your whole
>> index) there TermsComponent is the fastest way.
>>
>> if you want the list of distinct values across a set of documents, then
>> facet on that field when doing your query.
>>
>> "select distinct category from books where bookInStock='true'" is analgous
>> to looking at the facet section of...
>>
>>   rows=0&q=bookInStock:true&facet=true&facet.field=category
>>
>>
>> -Hoss
>>
>>
>

Re: Can solr do the equivalent of "select distinct(field)"?

Posted by Aleksander Stensby <al...@integrasco.com>.

Thanks for your reply Erik!

The speed of my suggested query is actually very fast once we add the
facet.mincount=1 (when searching within a limited set of documents).
The set-back seem to be in the sharding of our data.. And that puzzles me a
little bit...

I can't really see why SOLR is so slow at doing this.
The scenario:

Let's say we have two servers (s1 and s2).
If i query
the following:
q=threadid:33&facet=true&facet.field=author&limit=-1&facet.mincount=0&rows=0
directly on either server, the response is lightning fast. (<10ms)
So, in theory I could query them directly, concat the result myself and get
that done pretty fast.
But if I introduce the shards parameter, the response time booms to between
15000ms and 20000ms!
shards=s1:8983/solr,s2:8983/solr
My initial thoughts is that I MUST be doing something wrong here?

So I try the following:
Run the query on server s1, with the shards param shards=s1:8983/solr
response time goes from sub 10ms to between 5000ms and 10000ms!
Same results if i run the query on s2, and same if i use shards=s2:8983/solr

Is there really that much overhead in running a distributed facet field
query with Solr? Anyone else experienced this?

On the other hand, running regular queries without facet distributed is
lightning fast... (so can't really see that this is a network problem or
anything either). - and I can't possibly be as I tried running a facet query
on s1 with s1 as the shards param, and that is still as slow as if the
shards param was pointed to a different server...

Any insight into this would be greatly appreciated! (Would like to avoid
having to hack together our own solution concatinating results...)

Cheers,
 Aleks

On Thu, Dec 17, 2009 at 7:36 PM, Erik Hatcher <er...@gmail.com>wrote:

>
> On Dec 17, 2009, at 11:59 AM, Aleksander Stensby wrote:
>
>> A follow up question on this Hoss:
>> If I have a set of documents, let's say this email thread. Each email has
>> a
>> unique author. All emails in the thread are indexed with "threadid=33" If
>> I
>> want to count the number of unique authors in this email thread, I could
>> go
>> along the lines you mention at the end:
>> rows=0&threadid=33&facet=true&facet.field=author&limit=-1
>> then count all returned facets. This works, but becomes unfeasable when
>> the
>> number of unique author values in the index is large. Right?
>> So the limit=-1 solution is just not working for such fields. But would
>> work
>> well for "category" if the number of unique categories is low...
>> It's almost faster to retrieve all entries from the thread and count
>> programatically the number of unique authors... But obviouslly, I don't
>> want
>> to do that!
>>
>> So, how would you go about to find the number of unique authors in this
>> scenario?
>>
>
> One possible solution is "tree" faceting:
> https://issues.apache.org/jira/browse/SOLR-792
>
>    &facet.tree=threadid,author
>
> Could be a LARGE amount of data though!
>
>        Erik
>
>

Re: Can solr do the equivalent of "select distinct(field)"?

Posted by Erik Hatcher <er...@gmail.com>.

On Dec 17, 2009, at 11:59 AM, Aleksander Stensby wrote:
> A follow up question on this Hoss:
> If I have a set of documents, let's say this email thread. Each  
> email has a
> unique author. All emails in the thread are indexed with  
> "threadid=33" If I
> want to count the number of unique authors in this email thread, I  
> could go
> along the lines you mention at the end:
> rows=0&threadid=33&facet=true&facet.field=author&limit=-1
> then count all returned facets. This works, but becomes unfeasable  
> when the
> number of unique author values in the index is large. Right?
> So the limit=-1 solution is just not working for such fields. But  
> would work
> well for "category" if the number of unique categories is low...
> It's almost faster to retrieve all entries from the thread and count
> programatically the number of unique authors... But obviouslly, I  
> don't want
> to do that!
>
> So, how would you go about to find the number of unique authors in  
> this
> scenario?

One possible solution is "tree" faceting: https://issues.apache.org/jira/browse/SOLR-792

     &facet.tree=threadid,author

Could be a LARGE amount of data though!

	Erik

Re: Can solr do the equivalent of "select distinct(field)"?

Posted by Aleksander Stensby <al...@integrasco.com>.

A follow up question on this Hoss:
If I have a set of documents, let's say this email thread. Each email has a
unique author. All emails in the thread are indexed with "threadid=33" If I
want to count the number of unique authors in this email thread, I could go
along the lines you mention at the end:
rows=0&threadid=33&facet=true&facet.field=author&limit=-1
then count all returned facets. This works, but becomes unfeasable when the
number of unique author values in the index is large. Right?
So the limit=-1 solution is just not working for such fields. But would work
well for "category" if the number of unique categories is low...
It's almost faster to retrieve all entries from the thread and count
programatically the number of unique authors... But obviouslly, I don't want
to do that!

So, how would you go about to find the number of unique authors in this
scenario?

Cheers,
 Aleks

On Wed, Sep 2, 2009 at 12:57 AM, Chris Hostetter
<ho...@fucit.org>wrote:

>
> : lets say you filter your query on something and want to know how many
> : distinct "categories" that your results comprise.
> : then you can facet on the category field and count the number of facet
> : values that are returned, right?
>
> if you count the number of facet values returned you are getting a "count
> of disctinct values"
>
> if you just want the list of distinct values in a field (for your whole
> index) there TermsComponent is the fastest way.
>
> if you want the list of distinct values across a set of documents, then
> facet on that field when doing your query.
>
> "select distinct category from books where bookInStock='true'" is analgous
> to looking at the facet section of...
>
>   rows=0&q=bookInStock:true&facet=true&facet.field=category
>
>
> -Hoss
>
>