You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Chris Brown <ch...@orangepics.com> on 2006/01/09 17:28:28 UTC

top n words within a results set?

Hello,

Is it possible to retrieve the top 'n' most often appearing words within a search criteria? I've seen the High Frequency Terms code in the sandbox but it works across the whole index.

To put this question into context: We're developing website that hosts a user's photo website. Searches can be specific to a particular user's website or be performed globally across one, many or all websites. I've accomplished this with a field in the index called website. What I'd like to do is give each user the top ten words that appear on their website. 

Thanks,
Chris Brown

http://www.orangepics.com/

Re: RF and IDF

Posted by Yonik Seeley <ys...@gmail.com>.

Click on "Source Repository" off of the main Lucene page.

Here is a pointer to the search package containing TermQuery/Weight/Scorer
http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/java/org/apache/lucene/search/?sortby=file#dirlist

Look in TermQuert for TermWeight (it's an inner class).

-Yonik

On 1/11/06, Klaus <kl...@vommond.de> wrote:
> Thx, but where can I find this classes?
>
> >If you really want to understand how scoring works, I'd suggest also
> >looking at TermWeight/TermScorer.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

AW: RF and IDF

Posted by Klaus <kl...@vommond.de>.

Thx, but where can I find this classes?

>If you really want to understand how scoring works, I'd suggest also
>looking at TermWeight/TermScorer.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: RF and IDF

Posted by Yonik Seeley <ys...@gmail.com>.

On 1/11/06, Klaus <kl...@vommond.de> wrote:
> Hi all,
>
> do you know how the tf und idf values are computed by the default
> similarity? I mean the exact mathematical equation.

Well, here is the default Similarity:

/** Expert: Default scoring implementation. */
public class DefaultSimilarity extends Similarity {
  /** Implemented as <code>1/sqrt(numTerms)</code>. */
  public float lengthNorm(String fieldName, int numTerms) {
    return (float)(1.0 / Math.sqrt(numTerms));
  }

  /** Implemented as <code>1/sqrt(sumOfSquaredWeights)</code>. */
  public float queryNorm(float sumOfSquaredWeights) {
    return (float)(1.0 / Math.sqrt(sumOfSquaredWeights));
  }

  /** Implemented as <code>sqrt(freq)</code>. */
  public float tf(float freq) {
    return (float)Math.sqrt(freq);
  }

  /** Implemented as <code>1 / (distance + 1)</code>. */
  public float sloppyFreq(int distance) {
    return 1.0f / (distance + 1);
  }

  /** Implemented as <code>log(numDocs/(docFreq+1)) + 1</code>. */
  public float idf(int docFreq, int numDocs) {
    return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0);
  }

  /** Implemented as <code>overlap / maxOverlap</code>. */
  public float coord(int overlap, int maxOverlap) {
    return overlap / (float)maxOverlap;
  }
}


If you really want to understand how scoring works, I'd suggest also
looking at TermWeight/TermScorer.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: top n words within a results set?

Posted by Chris Brown <ch...@orangepics.com>.

Excellent!! Thank you so much!

----- Original Message ----- 
From: "Grant Ingersoll" <gs...@syr.edu>
To: <ja...@lucene.apache.org>
Sent: Wednesday, January 11, 2006 12:07 PM
Subject: Re: top n words within a results set?


> Hey Chris,
>
> There is just such an analyzer, called the PerFieldAnalyzerWrapper.  The 
> trick is the Analyzer always passes in the Field name when it gets the 
> TokenStream,
>
> -Grant
>
> Chris Brown wrote:
>
>> Bear with me, I might be missing something.... My documents get indexed 
>> ( writer.addDocument(doc) ) with one IndexWriter created using one 
>> Analyzer (the SnowballAnalyzer). So unless you can somehow use a 
>> different Analyzer per field I don't see how the second field will help. 
>> If I get the TermFreqVector for a field for a document that was indexed 
>> using the SnowballAnalyzer, isn't it always going to return stemmed 
>> words?
>>
>> To confirm your assumption, I suppose I am trying to display the values 
>> of the indexed field. It doesn't matter to me whether I count "party" and 
>> "parties" as separate words or not but I cannot display "parti" to a user 
>> as it's not a word.
>>
>> I'm thinking I need a separate index with the field created using the 
>> StandardAnalyzer unless there's some other trick with mixing Analyzers 
>> I'm unaware of.
>>
>> Thanks again for your help,
>> Chris
>>
>> ----- Original Message ----- From: "Grant Ingersoll" <gs...@syr.edu>
>> To: <ja...@lucene.apache.org>
>> Sent: Wednesday, January 11, 2006 8:32 AM
>> Subject: Re: top n words within a results set?
>>
>>
>>> I believe the usual solution is to have a separate field on the same 
>>> document for display purposes (I am assumming you are trying to display 
>>> the values of the indexed field) that is not stemmed.   The tradeoff is 
>>> in disk space, of course.
>>>
>>> Chris Brown wrote:
>>>
>>>> Okay, I've taken Grant's advice and aggregated the TermFreqVector's for
>>>> each term in the applicable field. It works quite well, there's just 
>>>> one
>>>> glitch.
>>>>
>>>> Some words like "party" and "picture" appear as "parti" and "pictur". I 
>>>> am
>>>> using the SnowballAnalyzer, I suspect that's what's changing the words.
>>>> Short of maintaining a second index using a different analyzer, does 
>>>> anyone
>>>> have any ideas?
>>>>
>>>> ----- Original Message ----- From: "Grant Ingersoll" <gs...@syr.edu>
>>>> To: <ja...@lucene.apache.org>
>>>> Sent: Monday, January 09, 2006 12:34 PM
>>>> Subject: Re: top n words within a results set?
>>>>
>>>>
>>>>> You could use term vectors to accomplish this.  Get your hits for the 
>>>>> website, then load the term vector for the field containing the 
>>>>> keywords and add up the frequencies
>>>>>
>>>>> Chris Brown wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> Is it possible to retrieve the top 'n' most often appearing words 
>>>>>> within a search criteria? I've seen the High Frequency Terms code in 
>>>>>> the sandbox but it works across the whole index.
>>>>>>
>>>>>> To put this question into context: We're developing website that 
>>>>>> hosts a user's photo website. Searches can be specific to a 
>>>>>> particular user's website or be performed globally across one, many 
>>>>>> or all websites. I've accomplished this with a field in the index 
>>>>>> called website. What I'd like to do is give each user the top ten 
>>>>>> words that appear on their website.
>>>>>> Thanks,
>>>>>> Chris Brown
>>>>>>
>>>>>> http://www.orangepics.com/
>>>>>>
>>>>>>
>>>>>
>>>>> -- 
>>>>> ------------------------------------------------------------------- 
>>>>> Grant Ingersoll Sr. Software Engineer Center for Natural Language 
>>>>> Processing Syracuse University School of Information Studies 337 Hinds 
>>>>> Hall Syracuse, NY 13244
>>>>> http://www.cnlp.org Voice:  315-443-5484 Fax: 315-443-6886
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>> -- 
>>> ------------------------------------------------------------------- 
>>> Grant Ingersoll Sr. Software Engineer Center for Natural Language 
>>> Processing Syracuse University School of Information Studies 337 Hinds 
>>> Hall Syracuse, NY 13244
>>> http://www.cnlp.org Voice:  315-443-5484 Fax: 315-443-6886
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> -- 
> ------------------------------------------------------------------- 
> Grant Ingersoll Sr. Software Engineer Center for Natural Language 
> Processing Syracuse University School of Information Studies 337 Hinds 
> Hall Syracuse, NY 13244
> http://www.cnlp.org Voice:  315-443-5484 Fax: 315-443-6886
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: top n words within a results set?

Posted by Grant Ingersoll <gs...@syr.edu>.

Hey Chris,

There is just such an analyzer, called the PerFieldAnalyzerWrapper.  The 
trick is the Analyzer always passes in the Field name when it gets the 
TokenStream,

-Grant

Chris Brown wrote:

> Bear with me, I might be missing something.... My documents get 
> indexed ( writer.addDocument(doc) ) with one IndexWriter created using 
> one Analyzer (the SnowballAnalyzer). So unless you can somehow use a 
> different Analyzer per field I don't see how the second field will 
> help. If I get the TermFreqVector for a field for a document that was 
> indexed using the SnowballAnalyzer, isn't it always going to return 
> stemmed words?
>
> To confirm your assumption, I suppose I am trying to display the 
> values of the indexed field. It doesn't matter to me whether I count 
> "party" and "parties" as separate words or not but I cannot display 
> "parti" to a user as it's not a word.
>
> I'm thinking I need a separate index with the field created using the 
> StandardAnalyzer unless there's some other trick with mixing Analyzers 
> I'm unaware of.
>
> Thanks again for your help,
> Chris
>
> ----- Original Message ----- From: "Grant Ingersoll" <gs...@syr.edu>
> To: <ja...@lucene.apache.org>
> Sent: Wednesday, January 11, 2006 8:32 AM
> Subject: Re: top n words within a results set?
>
>
>> I believe the usual solution is to have a separate field on the same 
>> document for display purposes (I am assumming you are trying to 
>> display the values of the indexed field) that is not stemmed.   The 
>> tradeoff is in disk space, of course.
>>
>> Chris Brown wrote:
>>
>>> Okay, I've taken Grant's advice and aggregated the TermFreqVector's for
>>> each term in the applicable field. It works quite well, there's just 
>>> one
>>> glitch.
>>>
>>> Some words like "party" and "picture" appear as "parti" and 
>>> "pictur". I am
>>> using the SnowballAnalyzer, I suspect that's what's changing the words.
>>> Short of maintaining a second index using a different analyzer, does 
>>> anyone
>>> have any ideas?
>>>
>>> ----- Original Message ----- From: "Grant Ingersoll" <gs...@syr.edu>
>>> To: <ja...@lucene.apache.org>
>>> Sent: Monday, January 09, 2006 12:34 PM
>>> Subject: Re: top n words within a results set?
>>>
>>>
>>>> You could use term vectors to accomplish this.  Get your hits for 
>>>> the website, then load the term vector for the field containing the 
>>>> keywords and add up the frequencies
>>>>
>>>> Chris Brown wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> Is it possible to retrieve the top 'n' most often appearing words 
>>>>> within a search criteria? I've seen the High Frequency Terms code 
>>>>> in the sandbox but it works across the whole index.
>>>>>
>>>>> To put this question into context: We're developing website that 
>>>>> hosts a user's photo website. Searches can be specific to a 
>>>>> particular user's website or be performed globally across one, 
>>>>> many or all websites. I've accomplished this with a field in the 
>>>>> index called website. What I'd like to do is give each user the 
>>>>> top ten words that appear on their website.
>>>>> Thanks,
>>>>> Chris Brown
>>>>>
>>>>> http://www.orangepics.com/
>>>>>
>>>>>
>>>>
>>>> -- 
>>>> ------------------------------------------------------------------- 
>>>> Grant Ingersoll Sr. Software Engineer Center for Natural Language 
>>>> Processing Syracuse University School of Information Studies 337 
>>>> Hinds Hall Syracuse, NY 13244
>>>> http://www.cnlp.org Voice:  315-443-5484 Fax: 315-443-6886
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> -- 
>> ------------------------------------------------------------------- 
>> Grant Ingersoll Sr. Software Engineer Center for Natural Language 
>> Processing Syracuse University School of Information Studies 337 
>> Hinds Hall Syracuse, NY 13244
>> http://www.cnlp.org Voice:  315-443-5484 Fax: 315-443-6886
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

-- 
------------------------------------------------------------------- 
Grant Ingersoll 
Sr. Software Engineer 
Center for Natural Language Processing 
Syracuse University 
School of Information Studies 
337 Hinds Hall 
Syracuse, NY 13244 

http://www.cnlp.org 
Voice:  315-443-5484 
Fax: 315-443-6886 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: top n words within a results set?

Posted by Chris Brown <ch...@orangepics.com>.

Bear with me, I might be missing something.... My documents get indexed ( 
writer.addDocument(doc) ) with one IndexWriter created using one Analyzer 
(the SnowballAnalyzer). So unless you can somehow use a different Analyzer 
per field I don't see how the second field will help. If I get the 
TermFreqVector for a field for a document that was indexed using the 
SnowballAnalyzer, isn't it always going to return stemmed words?

To confirm your assumption, I suppose I am trying to display the values of 
the indexed field. It doesn't matter to me whether I count "party" and 
"parties" as separate words or not but I cannot display "parti" to a user as 
it's not a word.

I'm thinking I need a separate index with the field created using the 
StandardAnalyzer unless there's some other trick with mixing Analyzers I'm 
unaware of.

Thanks again for your help,
Chris

----- Original Message ----- 
From: "Grant Ingersoll" <gs...@syr.edu>
To: <ja...@lucene.apache.org>
Sent: Wednesday, January 11, 2006 8:32 AM
Subject: Re: top n words within a results set?


>I believe the usual solution is to have a separate field on the same 
>document for display purposes (I am assumming you are trying to display the 
>values of the indexed field) that is not stemmed.   The tradeoff is in disk 
>space, of course.
>
> Chris Brown wrote:
>
>> Okay, I've taken Grant's advice and aggregated the TermFreqVector's for
>> each term in the applicable field. It works quite well, there's just one
>> glitch.
>>
>> Some words like "party" and "picture" appear as "parti" and "pictur". I 
>> am
>> using the SnowballAnalyzer, I suspect that's what's changing the words.
>> Short of maintaining a second index using a different analyzer, does 
>> anyone
>> have any ideas?
>>
>> ----- Original Message ----- From: "Grant Ingersoll" <gs...@syr.edu>
>> To: <ja...@lucene.apache.org>
>> Sent: Monday, January 09, 2006 12:34 PM
>> Subject: Re: top n words within a results set?
>>
>>
>>> You could use term vectors to accomplish this.  Get your hits for the 
>>> website, then load the term vector for the field containing the keywords 
>>> and add up the frequencies
>>>
>>> Chris Brown wrote:
>>>
>>>> Hello,
>>>>
>>>> Is it possible to retrieve the top 'n' most often appearing words 
>>>> within a search criteria? I've seen the High Frequency Terms code in 
>>>> the sandbox but it works across the whole index.
>>>>
>>>> To put this question into context: We're developing website that hosts 
>>>> a user's photo website. Searches can be specific to a particular user's 
>>>> website or be performed globally across one, many or all websites. I've 
>>>> accomplished this with a field in the index called website. What I'd 
>>>> like to do is give each user the top ten words that appear on their 
>>>> website.
>>>> Thanks,
>>>> Chris Brown
>>>>
>>>> http://www.orangepics.com/
>>>>
>>>>
>>>
>>> -- 
>>> ------------------------------------------------------------------- 
>>> Grant Ingersoll Sr. Software Engineer Center for Natural Language 
>>> Processing Syracuse University School of Information Studies 337 Hinds 
>>> Hall Syracuse, NY 13244
>>> http://www.cnlp.org Voice:  315-443-5484 Fax: 315-443-6886
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> -- 
> ------------------------------------------------------------------- 
> Grant Ingersoll Sr. Software Engineer Center for Natural Language 
> Processing Syracuse University School of Information Studies 337 Hinds 
> Hall Syracuse, NY 13244
> http://www.cnlp.org Voice:  315-443-5484 Fax: 315-443-6886
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Boolean Query

Posted by Chris Hostetter <ho...@fucit.org>.

: BooleanQuery query = new BooleanQuery();
: for(Term t: terms)
: {
: 	query = new TermQuery(t);
: 	query.add(t, false, false); // ist his wrong?
: }
:
: If I construct the query as a string like "A a OR B b OR C" I get much more
: results. I assume that the Boolean query uses an AND operator. How can I
: change that.

The "false, false" on when you add the subclauses should be doing the "OR"
behavior, but more then likely the problem you are running into has to do
with the analyzer being used by your QueryParser when it parses your
string -- when you build the query up by hand, no analyzer is used, so if
the analyzer used at indexing time did any lowercasing or steming you'll
miss a lot of matches.

a quick thing you should try is comparing the toString from each of the
queries you are comparing (the one QueryParser built, and the one you
built by hand).  You should also look at this wiki entry, and pick up a
copy of Lucene in Action and read chapter 4.

: And I'm wondering what happens if I boost a TermQuery with a value smaller
: then one. I'm asking because I would like to boost each TermQuery with the
: td*idf Value of the term in the original document. From my point of view,
: this should lead to a better precision, but on the first looks the results
: are worse.

Before you try this, make sure you understand the existing score
claculation ... look a the explain info for each document against your
query and see what it's already doing.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Boolean Query

Posted by Klaus <kl...@vommond.de>.

Hi,

I have got another question... How do I construct a BooleanQuery, where the
terms with the query a connected with OR? 

I have a list of term, representing to high scored terms in a document. Here
is my code

BooleanQuery query = new BooleanQuery();
for(Term t: terms)
{
	query = new TermQuery(t);
	query.add(t, false, false); // ist his wrong?	  
}

If I construct the query as a string like "A a OR B b OR C" I get much more
results. I assume that the Boolean query uses an AND operator. How can I
change that. 

And I'm wondering what happens if I boost a TermQuery with a value smaller
then one. I'm asking because I would like to boost each TermQuery with the
td*idf Value of the term in the original document. From my point of view,
this should lead to a better precision, but on the first looks the results
are worse.

THX,

Klaus



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RF and IDF

Posted by Klaus <kl...@vommond.de>.

Hi all,

do you know how the tf und idf values are computed by the default
similarity? I mean the exact mathematical equation.

Thx,

Klaus




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: top n words within a results set?

Posted by Grant Ingersoll <gs...@syr.edu>.

I believe the usual solution is to have a separate field on the same 
document for display purposes (I am assumming you are trying to display 
the values of the indexed field) that is not stemmed.   The tradeoff is 
in disk space, of course.

Chris Brown wrote:

> Okay, I've taken Grant's advice and aggregated the TermFreqVector's for
> each term in the applicable field. It works quite well, there's just one
> glitch.
>
> Some words like "party" and "picture" appear as "parti" and "pictur". 
> I am
> using the SnowballAnalyzer, I suspect that's what's changing the words.
> Short of maintaining a second index using a different analyzer, does 
> anyone
> have any ideas?
>
> ----- Original Message ----- From: "Grant Ingersoll" <gs...@syr.edu>
> To: <ja...@lucene.apache.org>
> Sent: Monday, January 09, 2006 12:34 PM
> Subject: Re: top n words within a results set?
>
>
>> You could use term vectors to accomplish this.  Get your hits for the 
>> website, then load the term vector for the field containing the 
>> keywords and add up the frequencies
>>
>> Chris Brown wrote:
>>
>>> Hello,
>>>
>>> Is it possible to retrieve the top 'n' most often appearing words 
>>> within a search criteria? I've seen the High Frequency Terms code in 
>>> the sandbox but it works across the whole index.
>>>
>>> To put this question into context: We're developing website that 
>>> hosts a user's photo website. Searches can be specific to a 
>>> particular user's website or be performed globally across one, many 
>>> or all websites. I've accomplished this with a field in the index 
>>> called website. What I'd like to do is give each user the top ten 
>>> words that appear on their website.
>>> Thanks,
>>> Chris Brown
>>>
>>> http://www.orangepics.com/
>>>
>>>
>>
>> -- 
>> ------------------------------------------------------------------- 
>> Grant Ingersoll Sr. Software Engineer Center for Natural Language 
>> Processing Syracuse University School of Information Studies 337 
>> Hinds Hall Syracuse, NY 13244
>> http://www.cnlp.org Voice:  315-443-5484 Fax: 315-443-6886
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

-- 
------------------------------------------------------------------- 
Grant Ingersoll 
Sr. Software Engineer 
Center for Natural Language Processing 
Syracuse University 
School of Information Studies 
337 Hinds Hall 
Syracuse, NY 13244 

http://www.cnlp.org 
Voice:  315-443-5484 
Fax: 315-443-6886 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: top n words within a results set?

Posted by Chris Brown <ch...@redsky.ca>.

Okay, I've taken Grant's advice and aggregated the TermFreqVector's for
each term in the applicable field. It works quite well, there's just one
glitch.

Some words like "party" and "picture" appear as "parti" and "pictur". I am
using the SnowballAnalyzer, I suspect that's what's changing the words.
Short of maintaining a second index using a different analyzer, does anyone
have any ideas?

----- Original Message ----- 
From: "Grant Ingersoll" <gs...@syr.edu>
To: <ja...@lucene.apache.org>
Sent: Monday, January 09, 2006 12:34 PM
Subject: Re: top n words within a results set?


> You could use term vectors to accomplish this.  Get your hits for the 
> website, then load the term vector for the field containing the keywords 
> and add up the frequencies
>
> Chris Brown wrote:
>
>>Hello,
>>
>>Is it possible to retrieve the top 'n' most often appearing words within a 
>>search criteria? I've seen the High Frequency Terms code in the sandbox 
>>but it works across the whole index.
>>
>>To put this question into context: We're developing website that hosts a 
>>user's photo website. Searches can be specific to a particular user's 
>>website or be performed globally across one, many or all websites. I've 
>>accomplished this with a field in the index called website. What I'd like 
>>to do is give each user the top ten words that appear on their website.
>>Thanks,
>>Chris Brown
>>
>>http://www.orangepics.com/
>>
>>
>
> -- 
> ------------------------------------------------------------------- 
> Grant Ingersoll Sr. Software Engineer Center for Natural Language 
> Processing Syracuse University School of Information Studies 337 Hinds 
> Hall Syracuse, NY 13244
> http://www.cnlp.org Voice:  315-443-5484 Fax: 315-443-6886
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: top n words within a results set?

Posted by Chris Brown <ch...@orangepics.com>.

Okay great! Thanks for the quick response and pointing me in the right 
direction. I'll go get out my Lucene in Action book ;) and learn all about 
term vectors.

----- Original Message ----- 
From: "Grant Ingersoll" <gs...@syr.edu>
To: <ja...@lucene.apache.org>
Sent: Monday, January 09, 2006 12:34 PM
Subject: Re: top n words within a results set?


> You could use term vectors to accomplish this.  Get your hits for the 
> website, then load the term vector for the field containing the keywords 
> and add up the frequencies
>
> Chris Brown wrote:
>
>>Hello,
>>
>>Is it possible to retrieve the top 'n' most often appearing words within a 
>>search criteria? I've seen the High Frequency Terms code in the sandbox 
>>but it works across the whole index.
>>
>>To put this question into context: We're developing website that hosts a 
>>user's photo website. Searches can be specific to a particular user's 
>>website or be performed globally across one, many or all websites. I've 
>>accomplished this with a field in the index called website. What I'd like 
>>to do is give each user the top ten words that appear on their website.
>>Thanks,
>>Chris Brown
>>
>>http://www.orangepics.com/
>>
>>
>
> -- 
> ------------------------------------------------------------------- 
> Grant Ingersoll Sr. Software Engineer Center for Natural Language 
> Processing Syracuse University School of Information Studies 337 Hinds 
> Hall Syracuse, NY 13244
> http://www.cnlp.org Voice:  315-443-5484 Fax: 315-443-6886
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: top n words within a results set?

Posted by Grant Ingersoll <gs...@syr.edu>.

You could use term vectors to accomplish this.  Get your hits for the 
website, then load the term vector for the field containing the keywords 
and add up the frequencies

Chris Brown wrote:

>Hello,
>
>Is it possible to retrieve the top 'n' most often appearing words within a search criteria? I've seen the High Frequency Terms code in the sandbox but it works across the whole index.
>
>To put this question into context: We're developing website that hosts a user's photo website. Searches can be specific to a particular user's website or be performed globally across one, many or all websites. I've accomplished this with a field in the index called website. What I'd like to do is give each user the top ten words that appear on their website. 
>
>Thanks,
>Chris Brown
>
>http://www.orangepics.com/
>
>  
>

-- 
------------------------------------------------------------------- 
Grant Ingersoll 
Sr. Software Engineer 
Center for Natural Language Processing 
Syracuse University 
School of Information Studies 
337 Hinds Hall 
Syracuse, NY 13244 

http://www.cnlp.org 
Voice:  315-443-5484 
Fax: 315-443-6886 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org