You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by zehua <br...@yahoo.com> on 2009/06/16 22:30:35 UTC

Question for top term frequency

We have one column called "Author" indexed which contains the author name.
We'd like to
get the records with the top 10 authors who have most records in the lucene.
Is there a 
good way to do it? I searched the mailing list, and did not find a good
match.
-- 
View this message in context: http://www.nabble.com/Question-for-top-term-frequency-tp24062253p24062253.html
Sent from the Lucene - General mailing list archive at Nabble.com.


Re: Question for top term frequency

Posted by zehua <br...@yahoo.com>.
Thanks for the reply.

The problem is that the number of global document maybe huge, for example
10,000.
If we returned all these doucments and find the top author using the term
frequency loop,
it can take longer time.

We are considering to use CustomScoreQuery. First parameter is the normal
query to match the result.
Second parameter is to use the Field "Author"'s frequency to increase the
score. So the results for
top authors will have higher score and returned. Does it makes sense?



Ted Dunning wrote:
> 
> It is easy to get global document frequencies for all authors.
> 
> Then it is easy to build a query that accepts documents from any of the
> top
> authors.
> 
> It requires more than one query, but only a few lines of code.
> 
> On Tue, Jun 16, 2009 at 1:30 PM, zehua <br...@yahoo.com> wrote:
> 
>> Is there a
>> good way to do it? I searched the mailing list, and did not find a good
>> match.
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Question-for-top-term-frequency-tp24062253p24082504.html
Sent from the Lucene - General mailing list archive at Nabble.com.


Re: Question for top term frequency

Posted by Ted Dunning <te...@gmail.com>.
It is easy to get global document frequencies for all authors.

Then it is easy to build a query that accepts documents from any of the top
authors.

It requires more than one query, but only a few lines of code.

On Tue, Jun 16, 2009 at 1:30 PM, zehua <br...@yahoo.com> wrote:

> Is there a
> good way to do it? I searched the mailing list, and did not find a good
> match.
>

Re: Question for top term frequency

Posted by Grant Ingersoll <gs...@apache.org>.
On Jun 17, 2009, at 7:28 PM, Ted Dunning wrote:

> It is indeed faceting.  I misunderstood the original request as being
> against the entire corpus.  For the very modest size result that he is
> talking about, SOLR faceting should work just fine.

Even the entire corpus is fine, just use *:* (MatchAllDocsQuery) in  
Solr  ;-)


Re: Question for top term frequency

Posted by Ted Dunning <te...@gmail.com>.
It is indeed faceting.  I misunderstood the original request as being
against the entire corpus.  For the very modest size result that he is
talking about, SOLR faceting should work just fine.

Zehua's loss of the word NOT in his latest message increased my confusion a
bit.

On Wed, Jun 17, 2009 at 3:42 PM, Grant Ingersoll <gs...@apache.org>wrote:

> Isn't this just faceting on the author field and then making a query out of
> the top ten authors?  I think you could do this in Solr pretty easily.  Or
> maybe I don't understand the question.
>
> -Grant
>
> On Jun 17, 2009, at 5:45 PM, zehua wrote:
>
>
>> One thing to add is that the top author is *[NOT]* based on all
>> doucments. It is
>> based on the returned results.
>> For example, we have 10000 results match the query, the top authors are
>> among the 10000 results.
>>
>>
>>

Re: Question for top term frequency

Posted by Grant Ingersoll <gs...@apache.org>.
Isn't this just faceting on the author field and then making a query  
out of the top ten authors?  I think you could do this in Solr pretty  
easily.  Or maybe I don't understand the question.

-Grant
On Jun 17, 2009, at 5:45 PM, zehua wrote:

>
> One thing to add is that the top author is based on all doucments.  
> It is
> based on the returned results.
> For example, we have 10000 results match the query, the top authors  
> are
> among the 10000 results.
>
>
> zehua wrote:
>>
>> We have one column called "Author" indexed which contains the  
>> author name.
>> We'd like to
>> get the records with the top 10 authors who have most records in the
>> lucene. Is there a
>> good way to do it? I searched the mailing list, and did not find a  
>> good
>> match.
>>
>
> -- 
> View this message in context: http://www.nabble.com/Question-for-top-term-frequency-tp24062253p24082683.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


Re: Question for top term frequency

Posted by zehua <br...@yahoo.com>.
One thing to add is that the top author is based on all doucments. It is
based on the returned results.
For example, we have 10000 results match the query, the top authors are
among the 10000 results.


zehua wrote:
> 
> We have one column called "Author" indexed which contains the author name.
> We'd like to
> get the records with the top 10 authors who have most records in the
> lucene. Is there a 
> good way to do it? I searched the mailing list, and did not find a good
> match.
> 

-- 
View this message in context: http://www.nabble.com/Question-for-top-term-frequency-tp24062253p24082683.html
Sent from the Lucene - General mailing list archive at Nabble.com.