You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Erik J <sw...@hotmail.com> on 2006/10/02 09:36:49 UTC
Accessing term frequency with/without Vector?
Hello all,
I would like to be able to get the term frequency for a given term in a
given document. As far as I can see, TermFrequencyVector is the only way to
do this, but it seems that this vector is not creted by default at the
indexing phase in Nutch/Lucene. Strangely, I can see the term frequency in
the Nutch web-gui under "Explain", so the information about term frequency
is obviously there somewhere. So:
1. Is there a way to access term frequency without the use of
TermFrequencyVector? I would love something like int tf =
getTermFrequency(document, field, term)
2. How can I make Nutch/Lucene create the TermVector and does this involve
any recompiling?
I realize that this might be a trivial question, but any help would greatly
be appreciated nevertheless...
Best regards,
Erik
_________________________________________________________________
Satsa på kärleken i höst! http://www.msn.se/dejting/
Re: Accessing term frequency with/without Vector?
Posted by Erik J <sw...@hotmail.com>.
Hello,
Thanks for the input, I tried the following:
public void handleFreq(String searchstring){
NutchBean bean = new NutchBean();
Query query = Query.parse(searchstring);
Hits hits = bean.search(query, 100);
for(int i = 0; i < hits.getLength(); i++) {
Hit hit = hits.getHit(i);
int docno = hit.getIndexDocNo();
FSDirectory dir =
FSDirectory.getDirectory("C:/nutch-0.7.2/results/crawl/index", false);
IndexReader reader = IndexReader.open(dir);
TermDocs td = reader.termDocs(new Term("contents", searchstring));
int freq=0;
while (td.next()) {
System.out.println("Inside while-loop");
if (td.doc() == docno) {
freq = td.freq();
System.out.println("Document no: " + td.doc());
System.out.println("Term frequency: " + freq);
}
}
}
}
The problem is that "td" seems to be empty since the while-loop is never
entered. I suspect it might be the rows starting with "TermDocs td "or
"FSDirectory " that are causing the problems.
I would very much appreciate any help since I am completely stuck!
Best regards,
Erik
_________________________________________________________________
Lyssna obegränsat på musik! http://www.msn.se/music
Re: Accessing term frequency with/without Vector?
Posted by José Ramón Pérez Agüera <jo...@fdi.ucm.es>.
Hello,
i think that you can use the method termDocs From IndexReader to obtain a TermDoc object which contain term frequency by document.
TermDoc http://lucene.apache.org/java/docs/api/org/apache/lucene/index/TermDocs.html
IndexReader (termDocs)
http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.html#termDocs(org.apache.lucene.index.Term)
bye
jose
José Ramón Pérez Agüera
Visiting Research Scholar at Yahoo! Research Spain.
Ocata 1, 1st floor 08003 Barcelona Catalunya, Spain
Dept. de Ingeniería del Software e Inteligencia Artificial
Despacho 411 tlf. 913947599
Facultad de Informática
Universidad Complutense de Madrid