You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Chris K Wensel <ch...@thomson.com> on 2006/09/26 00:02:10 UTC

term frequency

Hi all

I'm interested in playing with term frequency values in a nutch index on a
per document and index wide scope.

for example, something similar to this lucene faq entry.
http://tinyurl.com/ra3ys

so  what is the 'correct' way to inspect the nutch index for these values.
Particularly against the lucene IndexReader behind the nutch IndexSearcher.
Since I don't see anything on the Searcher interface, is there some other
hadoop-ified way to do this?

assuming there isn't, if I was to add the ability to get document and index
wide term frequencies, would this be exposed on the nutch.searcher.Searcher
interface? 

e.g. 

Searcher.getTermVector( Hit hit ) // returns a nutch friendly TermVec obj
Searcher.getTermVector( Hit hit, String field )
Searcher.getTermVector( String field )

or is there a more relevant interface this should hang off of? Searcher
doesn't seem like a fit, neither does HitDetailer. Maybe HitTermVector and
IndexTermVector??

or is this just insane, it won't work like I think and I should just forget
trying to get corpus relevant info from the indexes during runtime?

cheers,
ckw


Re: term frequency

Posted by Enis Soztutar <en...@gmail.com>.
Chris K Wensel wrote:
> Hi all
>
> I'm interested in playing with term frequency values in a nutch index on a
> per document and index wide scope.
>
> for example, something similar to this lucene faq entry.
> http://tinyurl.com/ra3ys
>
> so  what is the 'correct' way to inspect the nutch index for these values.
> Particularly against the lucene IndexReader behind the nutch IndexSearcher.
> Since I don't see anything on the Searcher interface, is there some other
> hadoop-ified way to do this?
>
> assuming there isn't, if I was to add the ability to get document and index
> wide term frequencies, would this be exposed on the nutch.searcher.Searcher
> interface? 
>
> e.g. 
>
> Searcher.getTermVector( Hit hit ) // returns a nutch friendly TermVec obj
> Searcher.getTermVector( Hit hit, String field )
> Searcher.getTermVector( String field )
>
> or is there a more relevant interface this should hang off of? Searcher
> doesn't seem like a fit, neither does HitDetailer. Maybe HitTermVector and
> IndexTermVector??
>
> or is this just insane, it won't work like I think and I should just forget
> trying to get corpus relevant info from the indexes during runtime?
>
> cheers,
> ckw
>
>
>   
Hi,

For some statistical analysis, I also needed term frequencies across all 
the collection,
Since lucene only gives termfreq by document, I have calculated the term 
frequencies by
summing all the frequencies of the term. the below code fragment does this:

    /**
     * Returns total occurrences of the given term.
     * @param term
     * @return #of occurrences of term.
     * @throws IOException
     */
    private int getCount(Term term) throws IOException{
        int count = 0;
        TermDocs termDocs = reader.termDocs(term);
        while(termDocs.next()) {
            count += termDocs.freq();
        }
        return count;
    }


But, this method is inefficient, since it recalculates the value 
everytime it is called. So a caching mechanism will prove useful. 
Alternatively, you may initially build an HashMap and store the <term, 
frequency> info in it.