You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by ses <st...@ssims.co.uk> on 2012/11/28 13:19:18 UTC

Total number of hits within all documents

I'm trying to find a way to retrieve from a Solr query the total number of
hits for a query across all documents.

I'm using an edismax query handler which searches across several fields
(specified in the schema.xml).

I have tried:
/solr/my_core/keyword?q=knights of arabia&fl=ttf:totaltermfreq(html,'knights
of arabia')
but the totaltermfreq function only works on individual terms

I have also tried
/solr/my_core/keyword?q=knights of arabia&facet=true&facet.query={!edismiax}
knights of arabia
which retrieves the total number of documents found with the search terms
within (same as numFound)

What I want is the total number of times the search terms appear in all
documents. For a standard disjunctive query like this it would total all
occurrences of 'knights', 'of' and 'arabia'. For a query like q="knights of
arabia", it would only count all occurrences of the entire phrase, and for
q=knights AND of AND arabia the number would be the total number of times
each term appears across all documents (but results would be fewer than
q=knights of arabia as documents must have all three of these terms in them
by the nature of the query).

I hope this makes sense and that there is some way I might be able to do
this that I am missing? I would also (begrudgingly) be happy if the answer
is that due to the way searching works, this is not possible and Solr/Lucene
will not easily be modified to do this.



--
View this message in context: http://lucene.472066.n3.nabble.com/Total-number-of-hits-within-all-documents-tp4022895.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Total number of hits within all documents

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
If you migrating from a legacy search engine, you may find this book very
valuable: http://rosenfeldmedia.com/books/searchanalytics/

It will allow you to have a much more productive conversation and,
hopefully, move from arbitrary metrics like number of hits to more
objective ones like relevant results in first X pages.

Regards,
   Alex.
P.s. I am not related to the book, author or the company. I just liked what
they wrote.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Wed, Nov 28, 2012 at 9:05 AM, ses <st...@ssims.co.uk> wrote:

> Unfortunately a vague specification is all I have, due to the fact I am
> trying to replicate the functionality in a closed-source legacy search
> product. I suspect no-one at the company knows precisely how this works.
>

Re: Total number of hits within all documents

Posted by Jack Krupansky <ja...@basetechnology.com>.
If users don't understand it anyway... just sum up termfreq(field,term) for 
all query terms. Who will know that it is only an approximation?! BUT... it 
will may cause queries to be significantly slower.

I mean, you COULD add a custom value source such as 
sumtermfreq(field1,term1,field2,term2...) that iterates over all matched 
documents and adds up termfreq(field,term).

The point with Lucene and Solr is that they reduce total hits into the 
compact and more interesting and more "relevant" statistic of a score in the 
range of 0.0 to 1.0. Maybe your users would simply like to see that more 
"modern" statistic than the useless total hits which has so little real 
value anyway. And the score is essentially free. Just add to "fl": 
&fl=id,field1,...,score.

-- Jack Krupansky

-----Original Message----- 
From: ses
Sent: Wednesday, November 28, 2012 9:05 AM
To: solr-user@lucene.apache.org
Subject: Re: Total number of hits within all documents

Unfortunately a vague specification is all I have, due to the fact I am
trying to replicate the functionality in a closed-source legacy search
product. I suspect no-one at the company knows precisely how this works.

The purpose is ultimately to display to the user the entire number of 'hits'
found in all documents where a hit is any place in the text of the fields
searched (defined as 'qf' in the edismax search handler) where the search
terms appear. Essentially it should be like counting the number of
highlighted hits in a search with highlighting turned on. I could easily do
this for just the number of documents returned, specified by the 'rows'
parameter, by turning highlighting on and counting the snippets returned.
But I want this value for the entire dataset, which I have a feeling will be
too slow if I specify rows = total numFound.

I just want it to count this number for the fields specified in 'qf'. If it
could count all occurrences of terms that match wildcard queries, that would
be good but not essential. Fuzzy/span queries aren't used.

I would be fine with an approximation, for all I know this is how it works
using the old search product.

I hope this clarifies things a little, I realize it is a strange requirement
that the user is unlikely to even understand, but nevertheless apparently
the user must see something along the lines 'X documents found, Y hits
found'.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Total-number-of-hits-within-all-documents-tp4022895p4022920.html
Sent from the Solr - User mailing list archive at Nabble.com. 


Re: Total number of hits within all documents

Posted by ses <st...@ssims.co.uk>.
Unfortunately a vague specification is all I have, due to the fact I am
trying to replicate the functionality in a closed-source legacy search
product. I suspect no-one at the company knows precisely how this works.

The purpose is ultimately to display to the user the entire number of 'hits'
found in all documents where a hit is any place in the text of the fields
searched (defined as 'qf' in the edismax search handler) where the search
terms appear. Essentially it should be like counting the number of
highlighted hits in a search with highlighting turned on. I could easily do
this for just the number of documents returned, specified by the 'rows'
parameter, by turning highlighting on and counting the snippets returned.
But I want this value for the entire dataset, which I have a feeling will be
too slow if I specify rows = total numFound.

I just want it to count this number for the fields specified in 'qf'. If it
could count all occurrences of terms that match wildcard queries, that would
be good but not essential. Fuzzy/span queries aren't used.

I would be fine with an approximation, for all I know this is how it works
using the old search product.

I hope this clarifies things a little, I realize it is a strange requirement
that the user is unlikely to even understand, but nevertheless apparently
the user must see something along the lines 'X documents found, Y hits
found'.



--
View this message in context: http://lucene.472066.n3.nabble.com/Total-number-of-hits-within-all-documents-tp4022895p4022920.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Total number of hits within all documents

Posted by Jack Krupansky <ja...@basetechnology.com>.
Clue us in as to what you actually want to do with this number. Maybe an 
approximation might solve the problem as well? In other words, what degree 
of accuracy is actually required?

Also, make sure you actually can reduce your proposed calculation to a 
mathematical function. As stated, it is a little too vague and non-specific. 
For example, what about queries that combine multiple fields and OR 
operations or wildcard or fuzzy queries. What about span queries? How would 
you counts hits when OR is used? And how would some downstream process 
actually use it?

A custom scorer and/or one or more custom function query value sources might 
be able to do exactly what you want, provided that you can express it with 
mathematical crispness - in other words, specific, crisp rules.

Without a crisp specification it is difficult to say whether Lucene/Solr can 
or cannot give you your desired magic number out of the box, although with 
custom scoring and custom values sources you should be able to do just about 
anything (that can be mathematically formulated.)

-- Jack Krupansky

-----Original Message----- 
From: ses
Sent: Wednesday, November 28, 2012 7:19 AM
To: solr-user@lucene.apache.org
Subject: Total number of hits within all documents

I'm trying to find a way to retrieve from a Solr query the total number of
hits for a query across all documents.

I'm using an edismax query handler which searches across several fields
(specified in the schema.xml).

I have tried:
/solr/my_core/keyword?q=knights of arabia&fl=ttf:totaltermfreq(html,'knights
of arabia')
but the totaltermfreq function only works on individual terms

I have also tried
/solr/my_core/keyword?q=knights of arabia&facet=true&facet.query={!edismiax}
knights of arabia
which retrieves the total number of documents found with the search terms
within (same as numFound)

What I want is the total number of times the search terms appear in all
documents. For a standard disjunctive query like this it would total all
occurrences of 'knights', 'of' and 'arabia'. For a query like q="knights of
arabia", it would only count all occurrences of the entire phrase, and for
q=knights AND of AND arabia the number would be the total number of times
each term appears across all documents (but results would be fewer than
q=knights of arabia as documents must have all three of these terms in them
by the nature of the query).

I hope this makes sense and that there is some way I might be able to do
this that I am missing? I would also (begrudgingly) be happy if the answer
is that due to the way searching works, this is not possible and Solr/Lucene
will not easily be modified to do this.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Total-number-of-hits-within-all-documents-tp4022895.html
Sent from the Solr - User mailing list archive at Nabble.com.