You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "tech.vronk" <te...@vronk.net> on 2012/11/20 20:55:00 UTC

relative token count in a query result

Hello,

earlier, I was trying to retrieve the total token count per index
http://lucene.472066.n3.nabble.com/how-to-retrieve-total-token-count-per-collection-index-td4000161.html
.

now, I would like to have a token (word) count within the document-set 
(resulting of a query),
both for the matching word and as sum of all tokens of matching documents.

The ultimate goal is to be able to compute relative frequencies of 
terms, on token-base instead of per article base.

so if I search for word "Haus" within a subcollection (defined by a 
separate query) and the word appears in a matching doc A 2 times and doc 
B 5 times, i need as hit-count: 7 not 2.

+ if the subcollection contains documents
A with 300 tokens (i.e. running words, not different terms)
B with 100 tokens
C with 50 tokens

I also need this second sum, i.e. 450.

I plan to get the second number by first
preprocessing the document counting the tokens
storing the number in a separate field,
then applying the statsComponent,
which will deliver me the sum for given query/subcollection.

for the first number, i could use the termfreq() function,
but that gives me only the term frequency per document.

So, before I iterate over the whole result, to sum it,
I wonder, if the statsComponent would be able to perform the counting 
also over a dynamic field (the result of the function).
I tried this:
/solr/select/?fq=docsrc:falter&q={!func}tf(inhalt,'haus')&stats=true&stats.field=score&rows=10&indent=true&fl=score&debugQuery=true

but got the error:
<str name="msg">Field type 
text_de{class=org.apache.solr.schema.TextField,analyzer=org.apache.solr.analysis.TokenizerChain,args={positionIncrementGap=100}} 
is not currently supported</str>

Or is there any other way?

If I understand it correctly, any of tf(), idf(), sttf(), wouldn't be of 
any help here neither.

Thanks in advance

best,
matej

Re: relative token count in a query result

Posted by Mikhail Khludnev <mk...@griddynamics.com>.

Hello,

Have you tried to implement your own Collector and pass it into
IndexSearch.search()? Collector has a reference to the current scorer, and
therefore presumably can access tf info from TermQueryScorer:
org.apache.lucene.search.TermScorer.freq(). Then collector can just sum
these tfs.

Be aware, of small problem of doing the same with few disjunction clauses.


On Tue, Nov 20, 2012 at 11:55 PM, tech.vronk <te...@vronk.net> wrote:

> Hello,
>
> earlier, I was trying to retrieve the total token count per index
> http://lucene.472066.n3.**nabble.com/how-to-retrieve-**
> total-token-count-per-**collection-index-td4000161.**html<http://lucene.472066.n3.nabble.com/how-to-retrieve-total-token-count-per-collection-index-td4000161.html>
> .
>
> now, I would like to have a token (word) count within the document-set
> (resulting of a query),
> both for the matching word and as sum of all tokens of matching documents.
>
> The ultimate goal is to be able to compute relative frequencies of terms,
> on token-base instead of per article base.
>
> so if I search for word "Haus" within a subcollection (defined by a
> separate query) and the word appears in a matching doc A 2 times and doc B
> 5 times, i need as hit-count: 7 not 2.
>
> + if the subcollection contains documents
> A with 300 tokens (i.e. running words, not different terms)
> B with 100 tokens
> C with 50 tokens
>
> I also need this second sum, i.e. 450.
>
> I plan to get the second number by first
> preprocessing the document counting the tokens
> storing the number in a separate field,
> then applying the statsComponent,
> which will deliver me the sum for given query/subcollection.
>
> for the first number, i could use the termfreq() function,
> but that gives me only the term frequency per document.
>
> So, before I iterate over the whole result, to sum it,
> I wonder, if the statsComponent would be able to perform the counting also
> over a dynamic field (the result of the function).
> I tried this:
> /solr/select/?fq=docsrc:**falter&q={!func}tf(inhalt,'**
> haus')&stats=true&stats.field=**score&rows=10&indent=true&fl=**
> score&debugQuery=true
>
> but got the error:
> <str name="msg">Field type text_de{class=org.apache.solr.**
> schema.TextField,analyzer=org.**apache.solr.analysis.**
> TokenizerChain,args={**positionIncrementGap=100}} is not currently
> supported</str>
>
> Or is there any other way?
>
> If I understand it correctly, any of tf(), idf(), sttf(), wouldn't be of
> any help here neither.
>
> Thanks in advance
>
> best,
> matej
>
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 <mk...@griddynamics.com>