You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Bruno Patini Furtado <bp...@gmail.com> on 2005/10/31 20:29:31 UTC

Lucene basic Document fields used by Nutch

Hi,
I´ve looked into org.apache.nutch.indexer.basic.BasicIndexingFilter and saw
that the fields indexed into Lucene index by nutch are: host, site, url,
content. anchors and title.

Of these, the field content is always used.

But with this simple code using the Lucene API I couldn't retrieve any Hit:

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Searcher;
import org.apache.nutch.analysis.NutchDocumentAnalyzer;

public class LuceneTest
{
 public static void main(String[] args) throws Exception
 {
 Searcher searcher = new IndexSearcher("C:/PathNutchCrawlOutput/index");
 Analyzer analyzer = new NutchDocumentAnalyzer();

 String field = "content";
 Query query = QueryParser.parse("a", field, analyzer);

System.out.println(query);
System.out.println("Searching for: " + query.toString(field));

Hits hits = searcher.search(query);
System.out.println(hits.length() + " total matching documents");
 }
}

I receive the Exception below that makes me think I´m using the wrong field
name for the web pages content.

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1
 at java.util.ArrayList.get(ArrayList.java:324)
 at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:155)
 at org.apache.lucene.index.FieldInfos.fieldName(FieldInfos.java:151)
 at org.apache.lucene.index.SegmentTermEnum.readTerm(SegmentTermEnum.java
:149)
 at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:115)
 at org.apache.lucene.index.TermInfosReader.readIndex(TermInfosReader.java
:86)
 at org.apache.lucene.index.TermInfosReader.<init>(TermInfosReader.java:45)
 at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:112)
 at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:89)
 at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:118)
 at org.apache.lucene.store.Lock$With.run(Lock.java:109)
 at org.apache.lucene.index.IndexReader.open(IndexReader.java:111)
 at org.apache.lucene.index.IndexReader.open(IndexReader.java:95)
 at org.apache.lucene.search.IndexSearcher.<init>(IndexSearcher.java:38)
 at br.atech.smartsearch.LuceneTest.main(LuceneTest.java:15)

Does anyone know what am I doing wrong?

Just to explain, I´m trying to use lucene API directly to access the Nutch
generated index because I need the Lucene more complete query language (with
() and boolean operators).


Thanks a lot for your time!


--
"Minds are like parachutes, they work best when open."

Bruno Patini Furtado
Software Developer
webpage: www.bpfurtado.net <http://www.bpfurtado.net>
blog: http://www.livejournal.com/users/bpfurtado/