You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Chris K Wensel <ch...@thomson.com> on 2006/09/26 00:02:10 UTC
term frequency
Hi all
I'm interested in playing with term frequency values in a nutch index on a
per document and index wide scope.
for example, something similar to this lucene faq entry.
http://tinyurl.com/ra3ys
so what is the 'correct' way to inspect the nutch index for these values.
Particularly against the lucene IndexReader behind the nutch IndexSearcher.
Since I don't see anything on the Searcher interface, is there some other
hadoop-ified way to do this?
assuming there isn't, if I was to add the ability to get document and index
wide term frequencies, would this be exposed on the nutch.searcher.Searcher
interface?
e.g.
Searcher.getTermVector( Hit hit ) // returns a nutch friendly TermVec obj
Searcher.getTermVector( Hit hit, String field )
Searcher.getTermVector( String field )
or is there a more relevant interface this should hang off of? Searcher
doesn't seem like a fit, neither does HitDetailer. Maybe HitTermVector and
IndexTermVector??
or is this just insane, it won't work like I think and I should just forget
trying to get corpus relevant info from the indexes during runtime?
cheers,
ckw
Re: term frequency
Posted by Enis Soztutar <en...@gmail.com>.
Chris K Wensel wrote:
> Hi all
>
> I'm interested in playing with term frequency values in a nutch index on a
> per document and index wide scope.
>
> for example, something similar to this lucene faq entry.
> http://tinyurl.com/ra3ys
>
> so what is the 'correct' way to inspect the nutch index for these values.
> Particularly against the lucene IndexReader behind the nutch IndexSearcher.
> Since I don't see anything on the Searcher interface, is there some other
> hadoop-ified way to do this?
>
> assuming there isn't, if I was to add the ability to get document and index
> wide term frequencies, would this be exposed on the nutch.searcher.Searcher
> interface?
>
> e.g.
>
> Searcher.getTermVector( Hit hit ) // returns a nutch friendly TermVec obj
> Searcher.getTermVector( Hit hit, String field )
> Searcher.getTermVector( String field )
>
> or is there a more relevant interface this should hang off of? Searcher
> doesn't seem like a fit, neither does HitDetailer. Maybe HitTermVector and
> IndexTermVector??
>
> or is this just insane, it won't work like I think and I should just forget
> trying to get corpus relevant info from the indexes during runtime?
>
> cheers,
> ckw
>
>
>
Hi,
For some statistical analysis, I also needed term frequencies across all
the collection,
Since lucene only gives termfreq by document, I have calculated the term
frequencies by
summing all the frequencies of the term. the below code fragment does this:
/**
* Returns total occurrences of the given term.
* @param term
* @return #of occurrences of term.
* @throws IOException
*/
private int getCount(Term term) throws IOException{
int count = 0;
TermDocs termDocs = reader.termDocs(term);
while(termDocs.next()) {
count += termDocs.freq();
}
return count;
}
But, this method is inefficient, since it recalculates the value
everytime it is called. So a caching mechanism will prove useful.
Alternatively, you may initially build an HashMap and store the <term,
frequency> info in it.