You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Erik J <sw...@hotmail.com> on 2006/10/02 09:36:49 UTC

Accessing term frequency with/without Vector?

Hello all,

I would like to be able to get the term frequency for a given term in a 
given document. As far as I can see, TermFrequencyVector is the only way to 
do this, but it seems that this vector is not creted by default at the 
indexing phase in Nutch/Lucene. Strangely, I can see the term frequency in 
the Nutch web-gui under "Explain", so the information about term frequency 
is obviously there somewhere. So:

1. Is there a way to access term frequency without the use of 
TermFrequencyVector? I would love something like int tf = 
getTermFrequency(document, field, term)
2. How can I make Nutch/Lucene create the TermVector and does this involve 
any recompiling?

I realize that this might be a trivial question, but any help would greatly 
be appreciated nevertheless...

Best regards,

Erik

_________________________________________________________________
Satsa på kärleken i höst! http://www.msn.se/dejting/


Re: Accessing term frequency with/without Vector?

Posted by Erik J <sw...@hotmail.com>.
Hello,

Thanks for the input, I tried the following:

public void handleFreq(String searchstring){
   NutchBean bean = new NutchBean();
   Query query = Query.parse(searchstring);
   Hits hits = bean.search(query, 100);

   for(int i = 0; i < hits.getLength(); i++) {
      Hit hit = hits.getHit(i);
      int docno = hit.getIndexDocNo();
      FSDirectory dir = 
FSDirectory.getDirectory("C:/nutch-0.7.2/results/crawl/index", false);
      IndexReader reader = IndexReader.open(dir);

      TermDocs td = reader.termDocs(new Term("contents", searchstring));
      int freq=0;
      while (td.next()) {
         System.out.println("Inside while-loop");
         if (td.doc() == docno) {
            freq = td.freq();
            System.out.println("Document no: " + td.doc());
            System.out.println("Term frequency: " + freq);
         }
      }
   }
}

The problem is that "td" seems to be empty since the while-loop is never 
entered. I suspect it might be the rows starting with "TermDocs td "or 
"FSDirectory " that are causing the problems.

I would very much appreciate any help since I am completely stuck!

Best regards,

Erik

_________________________________________________________________
Lyssna obegränsat på musik! http://www.msn.se/music


Re: Accessing term frequency with/without Vector?

Posted by José Ramón Pérez Agüera <jo...@fdi.ucm.es>.
Hello,

i think that you can use the method termDocs From IndexReader to obtain a TermDoc object which contain term frequency by document.

TermDoc http://lucene.apache.org/java/docs/api/org/apache/lucene/index/TermDocs.html
IndexReader (termDocs) 
http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.html#termDocs(org.apache.lucene.index.Term)

bye

jose


José Ramón Pérez Agüera

Visiting Research Scholar at Yahoo! Research Spain.
Ocata 1, 1st floor 08003 Barcelona Catalunya, Spain

Dept. de Ingeniería del Software e Inteligencia Artificial
Despacho 411 tlf. 913947599
Facultad de Informática
Universidad Complutense de Madrid