You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Ninad Raut <hb...@gmail.com> on 2010/07/16 10:07:53 UTC

Finding distinct unique IDs in documents returned by fq -- Urgent Help Req

Hi,

I have a scenario in which I have to find count of distinct unique IDs
present in a field (rootId field in my case) for a particular query.

I require this for pagination purpose.

Is there a way in Solr to do something like this we do in SQL:

select count(distinct(rootId))
from table
where (the query part).


Regards,
Ninad R

Re: Finding distinct unique IDs in documents returned by fq -- Urgent Help Req

Posted by Ninad Raut <hb...@gmail.com>.

Hi,

Also the collapsing feature doesn't give the count of number of records
returned (grouped by a field value). It gives the count of the hits for the
query. This is really not useful when it comes to pagination.

Is there a way, at least in collapsing,  wherein I can get the count of
actual records returned and not the hit count?

On Mon, Jul 19, 2010 at 7:32 PM, kenf_nc <ke...@realestate.com> wrote:

>
> Oh, okay. Got it now. Unfortunately I don't believe Solr supplies a total
> count of matching facet values. One way to do this, although performance
> may
> suffer, is to set your limit to -1 and just get back everything, that will
> give you the count. You may want to set mincount to 1 so you aren't
> counting
> facet values that aren't in your query, but that really depends on your
> need.
>
> ...&facet.limit=-1&facet.mincount=1
>
> adding that to any facet query will return all matching facet values.
> Depending on how many unique values you have, this could be a lot. But it
> will give you what you are looking for. Unless your data changes
> frequently,
> maybe you can call it once and cache the results for some period of time.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Finding-distinct-unique-IDs-in-documents-returned-by-fq-Urgent-Help-Req-tp971883p978548.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Finding distinct unique IDs in documents returned by fq -- Urgent Help Req

Posted by kenf_nc <ke...@realestate.com>.

Oh, okay. Got it now. Unfortunately I don't believe Solr supplies a total
count of matching facet values. One way to do this, although performance may
suffer, is to set your limit to -1 and just get back everything, that will
give you the count. You may want to set mincount to 1 so you aren't counting
facet values that aren't in your query, but that really depends on your
need.

...&facet.limit=-1&facet.mincount=1

adding that to any facet query will return all matching facet values.
Depending on how many unique values you have, this could be a lot. But it
will give you what you are looking for. Unless your data changes frequently,
maybe you can call it once and cache the results for some period of time.
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Finding-distinct-unique-IDs-in-documents-returned-by-fq-Urgent-Help-Req-tp971883p978548.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Finding distinct unique IDs in documents returned by fq -- Urgent Help Req

Posted by Chris Hostetter <ho...@fucit.org>.

: > being returned (consider the case where we are sorting in term order - once
: > we have collected counts for ${facet.limit} constraints, we can stop
: > iterating over terms -- but to compute the total umber of constraints (ie:
: > terms) we would have to keep going and test every one of them against
: > ${facet.mincount})
: >   
: I've been told this before, but it still doesn't really make sense to me.  How
: can you possibly find the top N constraints, without having at least examined
: all the contraints?  How do you know which are the top N if there are some you

that's exactly my point: in the scenerio where you've asked for 
facet.mincount=N&facet.limit=M&facet.sort=index you don't have to find hte 
"top" constraints, you just have to find the first M terms in index order 
that have a mincount of N.

: But I may be missing something. I've examined only one of the code
: paths/methods for faceting in source code, the one (if my reading was correct)
: that ends up used for high-cardinality multi-valued fields -- in that method,
: it looked like it should add no work at all to give you a facet unique value
: (result set value cardinality) count. (with facet.mincount of 1 anyway).  But
: I may have been mis-reading, or it may be that other methods are more
: troublesome.

in any case where you ar sorting by *counts* then yes, all of the 
constraints have to be checked, so you can count them as you go -- but 
that doesn't scale in distributed faceting, you can't just add the counts 
up from each shard because you don't know what the overlap is -- hence my 
comment about how to dedup them.

there are some simple usecases where it's feasible, but in general it's a 
very hard problem.


-Hoss

Re: Finding distinct unique IDs in documents returned by fq -- Urgent Help Req

Posted by Jonathan Rochkind <ro...@jhu.edu>.

Chris Hostetter wrote:
> computing the number:  in some algorithms it's relatively cheap (on a 
> single server) but in others it's more expensive then computing the facet 
> counts being returned (consider the case where we are sorting in term 
> order - once we have collected counts for ${facet.limit} constraints, we 
> can stop iterating over terms -- but to compute the total umber of 
> constraints (ie: terms) we would have to keep going and test every one of 
> them against ${facet.mincount})
>   
I've been told this before, but it still doesn't really make sense to 
me.  How can you possibly find the top N constraints, without having at 
least examined all the contraints?  How do you know which are the top N 
if there are some you haven't looked at? And if you've looked at them 
all, it's no problem to increment at a counter as you look at each one.  
Although I guess the facet.minCount test does possibly put a crimp in 
things, I don't ever use that param myself to be something other than 1, 
so hadn't considered it.

But I may be missing something. I've examined only one of the code 
paths/methods for faceting in source code, the one (if my reading was 
correct) that ends up used for high-cardinality multi-valued fields -- 
in that method, it looked like it should add no work at all to give you 
a facet unique value (result set value cardinality) count. (with 
facet.mincount of 1 anyway).  But I may have been mis-reading, or it may 
be that other methods are more troublesome.

At any rate, if I need it bad enough, I'll try to write my own facet 
component that does it (perhaps a subclass of the existing SimpleFacet), 
and see what happens.  It does seem to be something a variety of 
people's use cases could use, I see it mentioned periodically in the 
list serv archives.

Jonathan

RE: Finding distinct unique IDs in documents returned by fq -- Urgent Help Req

Posted by Chris Hostetter <ho...@fucit.org>.

: > I would like get the total count of the facet.field response values
: 
: I'm pretty sure there's no way to get Solr to do that -- other than not 
: setting a facet.limit, getting every value back in the response, and 
: counting them yourself (not feasible for very large counts).  I've 
: looked at trying to patch Solr to do it, because I could really use it 
: too; it's definitely possible, but made trickier because there are now 
: several different methods that Solr can use to do facetting, with 
: separate code paths.  It seems like an odd omission to me too.

beyond just having multiple facet algorithms for perforamance making it 
difficult to add this feature, the other issue is hte perforamce of 
computing the number:  in some algorithms it's relatively cheap (on a 
single server) but in others it's more expensive then computing the facet 
counts being returned (consider the case where we are sorting in term 
order - once we have collected counts for ${facet.limit} constraints, we 
can stop iterating over terms -- but to compute the total umber of 
constraints (ie: terms) we would have to keep going and test every one of 
them against ${facet.mincount})

With distributed searching it becomes even more prohibitive -- your 
description of using an infinite facet.limit and asking for every value 
back to count them is exactly what would have to be done internally in a 
distributed faceting situation -- except they couldn't just be counted, 
they'd have to be deduped and then counted)

To do this efficiently, other data structures (denormalized beyond just 
the inverted index level) would need to be built.

-Hoss

RE: Finding distinct unique IDs in documents returned by fq -- Urgent Help Req

Posted by Jonathan Rochkind <ro...@jhu.edu>.

> I would like get the total count of the facet.field response values

I'm pretty sure there's no way to get Solr to do that -- other than not setting a facet.limit, getting every value back in the response, and counting them yourself (not feasible for very large counts).   I've looked at trying to patch Solr to do it, because I could really use it too; it's definitely possible, but made trickier because there are now several different methods that Solr can use to do facetting, with separate code paths.  It seems like an odd omission to me too. 

Jonathan

Re: Finding distinct unique IDs in documents returned by fq -- Urgent Help Req

Posted by Ninad Raut <hb...@gmail.com>.

Hi,

I would like get the total count of the facet.field response values:

i.e. if my response id

<lst name="manu">
          <int name="Canon USA">17</int>
          <int name="Olympus">12</int>
          <int name="Sony">12</int>
          <int name="Panasonic">9</int>
          <int name="Nikon">4</int>
    </lst>


I would like the count of uniques names found as 5 ("Canon
USA"+"Olympus"+"Sony"+"Panasonic"+"Nikon")


On Fri, Jul 16, 2010 at 7:28 PM, kenf_nc <ke...@realestate.com> wrote:

>
> It may just be a mis-wording, but if you do distinct on 'unique' IDs, the
> count should be the same as response.numFound. But if you didn't mean
> 'unique', just count of some field in the results, Rebecca is correct,
> facets should do the job. Something like:
>
> ?q=content:query+text&facet=on&facet.field=rootId
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Finding-distinct-unique-IDs-in-documents-returned-by-fq-Urgent-Help-Req-tp971883p972601.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Finding distinct unique IDs in documents returned by fq -- Urgent Help Req

Posted by kenf_nc <ke...@realestate.com>.

It may just be a mis-wording, but if you do distinct on 'unique' IDs, the
count should be the same as response.numFound. But if you didn't mean
'unique', just count of some field in the results, Rebecca is correct,
facets should do the job. Something like:

?q=content:query+text&facet=on&facet.field=rootId
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Finding-distinct-unique-IDs-in-documents-returned-by-fq-Urgent-Help-Req-tp971883p972601.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Finding distinct unique IDs in documents returned by fq -- Urgent Help Req

Posted by Rebecca Watson <be...@gmail.com>.

hi,

would faceting work?
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Faceted-Search-Solr

if you have a field for rootId that is multivalued + facet on it -- you'll get
value+count pairs back (top 100 i think by default)

bec :)

On 16 July 2010 16:07, Ninad Raut <hb...@gmail.com> wrote:
> Hi,
>
> I have a scenario in which I have to find count of distinct unique IDs
> present in a field (rootId field in my case) for a particular query.
>
> I require this for pagination purpose.
>
> Is there a way in Solr to do something like this we do in SQL:
>
> select count(distinct(rootId))
> from table
> where (the query part).
>
>
> Regards,
> Ninad R
>