You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Katsuki FUJISAWA <ka...@gmail.com> on 2009/09/07 06:13:46 UTC

The index file made by executing main method of org.apache.nutch.crawl.Crawl can not be read from Luke.

Hi,

I am new to nutch.
Now I am trying to do crawing from Java servlet program without using
bin/nutch commnad.
When nutch 0.9 index file made by main method of
org.apache.nutch.crawl.Crawl class can be read from program.
But when nutch 1.0 index file  made by main method of
org.apache.nutch.crawl.Crawl class can not be read from program.


Also read capability of index file by using luke is below.

index file of nutch 0.9
by bin/nutch command    readable.
by main method of Crawl class    readable.

index file of nutch 1.0
by bin/nutch command    readable.
by main method of Crawl class    unreadable.


Does anybody know reason why?
And give me a infomation please.

My program code sample is below.

*************************************************************
FSDirectory indexDir = null;

indexDir = FSDirectory.getDirectory( "C:\\nutch-1.0\\crawl\\index", false );
IndexSearcher indexSearcher = new IndexSearcher( indexDir );

List<DisplayBean> displayBeanList = new ArrayList<DisplayBean>();

Hits hits = indexSearcher.search( new MatchAllDocsQuery());

Iterator<Hit> i = hits.iterator();
int cnt = 0;
while (i.hasNext()){
	if(cnt > 2) break;

	Hit hit = (Hit)i.next();
	DisplayBean displayBean = new DisplayBean();
	displayBean.setUrl(hit.get("url"));
	displayBean.setTitle(hit.get("title"));
	displayBean.setTstamp(hit.get("tstamp"));

	displayBeanList.add(displayBean);

	cnt++;
}

indexSearcher.close();

Re: The index file made by executing main method of org.apache.nutch.crawl.Crawl can not be read from Luke.

Posted by Katsuki FUJISAWA <ka...@gmail.com>.

I am using below libraries.

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.search.Hit;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.MatchAllDocsQuery;
import org.apache.lucene.search.Sort;
import org.apache.lucene.store.FSDirectory;

Fujisawa

On Mon, Sep 7, 2009 at 1:13 PM, Katsuki
FUJISAWA<ka...@gmail.com> wrote:
> Hi,
>
> I am new to nutch.
> Now I am trying to do crawing from Java servlet program without using
> bin/nutch commnad.
> When nutch 0.9 index file made by main method of
> org.apache.nutch.crawl.Crawl class can be read from program.
> But when nutch 1.0 index file  made by main method of
> org.apache.nutch.crawl.Crawl class can not be read from program.
>
>
> Also read capability of index file by using luke is below.
>
> index file of nutch 0.9
> by bin/nutch command    readable.
> by main method of Crawl class    readable.
>
> index file of nutch 1.0
> by bin/nutch command    readable.
> by main method of Crawl class    unreadable.
>
>
> Does anybody know reason why?
> And give me a infomation please.
>
> My program code sample is below.
>
> *************************************************************
> FSDirectory indexDir = null;
>
> indexDir = FSDirectory.getDirectory( "C:\\nutch-1.0\\crawl\\index", false );
> IndexSearcher indexSearcher = new IndexSearcher( indexDir );
>
> List<DisplayBean> displayBeanList = new ArrayList<DisplayBean>();
>
> Hits hits = indexSearcher.search( new MatchAllDocsQuery());
>
> Iterator<Hit> i = hits.iterator();
> int cnt = 0;
> while (i.hasNext()){
>        if(cnt > 2) break;
>
>        Hit hit = (Hit)i.next();
>        DisplayBean displayBean = new DisplayBean();
>        displayBean.setUrl(hit.get("url"));
>        displayBean.setTitle(hit.get("title"));
>        displayBean.setTstamp(hit.get("tstamp"));
>
>        displayBeanList.add(displayBean);
>
>        cnt++;
> }
>
> indexSearcher.close();
>