You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Bruno Patini Furtado <bp...@gmail.com> on 2005/10/31 20:29:31 UTC
Lucene basic Document fields used by Nutch
Hi,
I´ve looked into org.apache.nutch.indexer.basic.BasicIndexingFilter and saw
that the fields indexed into Lucene index by nutch are: host, site, url,
content. anchors and title.
Of these, the field content is always used.
But with this simple code using the Lucene API I couldn't retrieve any Hit:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Searcher;
import org.apache.nutch.analysis.NutchDocumentAnalyzer;
public class LuceneTest
{
public static void main(String[] args) throws Exception
{
Searcher searcher = new IndexSearcher("C:/PathNutchCrawlOutput/index");
Analyzer analyzer = new NutchDocumentAnalyzer();
String field = "content";
Query query = QueryParser.parse("a", field, analyzer);
System.out.println(query);
System.out.println("Searching for: " + query.toString(field));
Hits hits = searcher.search(query);
System.out.println(hits.length() + " total matching documents");
}
}
I receive the Exception below that makes me think I´m using the wrong field
name for the web pages content.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1
at java.util.ArrayList.get(ArrayList.java:324)
at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:155)
at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:151)
at org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEnum.java
:149)
at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:115)
at org.apache.lucene.index.TermInfosReader.readIndex(TermInfosReader.java
:86)
at org.apache.lucene.index.TermInfosReader.<init>(TermInfosReader.java:45)
at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:112)
at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:89)
at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:118)
at org.apache.lucene.store.Lock$With.run(Lock.java:109)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:111)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:95)
at org.apache.lucene.search.IndexSearcher.<init>(IndexSearcher.java:38)
at br.atech.smartsearch.LuceneTest.main(LuceneTest.java:15)
Does anyone know what am I doing wrong?
Just to explain, I´m trying to use lucene API directly to access the Nutch
generated index because I need the Lucene more complete query language (with
() and boolean operators).
Thanks a lot for your time!
--
"Minds are like parachutes, they work best when open."
Bruno Patini Furtado
Software Developer
webpage: www.bpfurtado.net <http://www.bpfurtado.net>
blog: http://www.livejournal.com/users/bpfurtado/