You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Narayan, Anand" <an...@acs-inc.com> on 2006/09/29 16:22:39 UTC

Crawl on local site not working

I am new to nutch and am trying to see if we can use it for web search
functionality.
I am running the site on my local box on a Weblogic server.  I am using
nutch 0.8.1 on Windows XP using cygwin.

I created a "urls" directory and then created a file called "frontend" in
that directory
The local url that I have specified in that file is
http://172.16.10.99:7001/frontend/ <http://172.16.10.99:7001/frontend/> 
This is the only line in that file.

I have also changed the crawl-urlfilter file as follows
# accept hosts in MY.DOMAIN.NAME
+^http://172.16.10.99:7001/frontend/

The command I am executing is 
bin/nutch crawl urls -dir _crawloutput -depth 3 -topN 50

The crawl output I get is as follows:
crawl started in: _crawloutput
rootUrlDir = urls
threads = 10
depth = 3
topN = 50
Injector: starting
Injector: crawlDb: _crawloutput/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: starting
Generator: segment: _crawloutput/segments/20060929101916
Generator: Selecting best-scoring urls due for fetch.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: _crawloutput/segments/20060929101916
Fetcher: threads: 10
fetching http://172.16.10.99:7001/frontend/
<http://172.16.10.99:7001/frontend/> 
fetch of http://172.16.10.99:7001/frontend/
<http://172.16.10.99:7001/frontend/>  failed with:
java.lang.NullPointerException
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: _crawloutput/crawldb
CrawlDb update: segment: _crawloutput/segments/20060929101916
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: starting
Generator: segment: _crawloutput/segments/20060929101924
Generator: Selecting best-scoring urls due for fetch.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: _crawloutput/segments/20060929101924
Fetcher: threads: 10
fetching http://172.16.10.99:7001/frontend/
<http://172.16.10.99:7001/frontend/> 
fetch of http://172.16.10.99:7001/frontend/
<http://172.16.10.99:7001/frontend/>  failed with:
java.lang.NullPointerException
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: _crawloutput/crawldb
CrawlDb update: segment: _crawloutput/segments/20060929101924
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: starting
Generator: segment: _crawloutput/segments/20060929101932
Generator: Selecting best-scoring urls due for fetch.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: _crawloutput/segments/20060929101932
Fetcher: threads: 10
fetching http://172.16.10.99:7001/frontend/
<http://172.16.10.99:7001/frontend/> 
fetch of http://172.16.10.99:7001/frontend/
<http://172.16.10.99:7001/frontend/>  failed with:
java.lang.NullPointerException
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: _crawloutput/crawldb
CrawlDb update: segment: _crawloutput/segments/20060929101932
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: _crawloutput/linkdb
LinkDb: adding segment: _crawloutput/segments/20060929101916
LinkDb: adding segment: _crawloutput/segments/20060929101924
LinkDb: adding segment: _crawloutput/segments/20060929101932
LinkDb: done
Indexer: starting
Indexer: linkdb: _crawloutput/linkdb
Indexer: adding segment: _crawloutput/segments/20060929101916
Indexer: adding segment: _crawloutput/segments/20060929101924
Indexer: adding segment: _crawloutput/segments/20060929101932
Optimizing index.
Indexer: done
Dedup: starting
Dedup: adding indexes in: _crawloutput/indexes
Dedup: done
Adding _crawloutput/indexes/part-00000
crawl finished: _crawloutput

I am not sure what I am doing wrong. Can someone help?

Thanks
Anand Narayan

Re: Crawl on local site not working

Posted by Dima Mazmanov <nu...@proservice.ge>.
Hi,Anand.


You wrote 29 сентября 2006 г., 18:22:39:

> I am new to nutch and am trying to see if we can use it for web search
> functionality.
> I am running the site on my local box on a Weblogic server.  I am using
> nutch 0.8.1 on Windows XP using cygwin.

> I created a "urls" directory and then created a file called "frontend" in
> that directory
> The local url that I have specified in that file is
> http://172.16.10.99:7001/frontend/
> <http://172.16.10.99:7001/frontend/> 
> This is the only line in that file.

> I have also changed the crawl-urlfilter file as follows
> # accept hosts in MY.DOMAIN.NAME

> +^http://172.16.10.99:7001/frontend/
this is baaadd.
remove this string from file.
and then copy urls from "frented" into crawl-urlfilter file directly
after this   # accept hosts in MY.DOMAIN.NAME .

remove +. from end of file
and write -.


> The command I am executing is 
> bin/nutch crawl urls -dir _crawloutput -depth 3 -topN 50

> The crawl output I get is as follows:
> crawl started in: _crawloutput
> rootUrlDir = urls
> threads = 10
> depth = 3
> topN = 50
> Injector: starting
> Injector: crawlDb: _crawloutput/crawldb
> Injector: urlDir: urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: starting
> Generator: segment: _crawloutput/segments/20060929101916
> Generator: Selecting best-scoring urls due for fetch.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: _crawloutput/segments/20060929101916
> Fetcher: threads: 10
> fetching http://172.16.10.99:7001/frontend/
> <http://172.16.10.99:7001/frontend/> 
> fetch of http://172.16.10.99:7001/frontend/
> <http://172.16.10.99:7001/frontend/>  failed with:
> java.lang.NullPointerException
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: _crawloutput/crawldb
> CrawlDb update: segment: _crawloutput/segments/20060929101916
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: starting
> Generator: segment: _crawloutput/segments/20060929101924
> Generator: Selecting best-scoring urls due for fetch.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: _crawloutput/segments/20060929101924
> Fetcher: threads: 10
> fetching http://172.16.10.99:7001/frontend/
> <http://172.16.10.99:7001/frontend/> 
> fetch of http://172.16.10.99:7001/frontend/
> <http://172.16.10.99:7001/frontend/>  failed with:
> java.lang.NullPointerException
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: _crawloutput/crawldb
> CrawlDb update: segment: _crawloutput/segments/20060929101924
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> Generator: starting
> Generator: segment: _crawloutput/segments/20060929101932
> Generator: Selecting best-scoring urls due for fetch.
> Generator: Partitioning selected urls by host, for politeness.
> Generator: done.
> Fetcher: starting
> Fetcher: segment: _crawloutput/segments/20060929101932
> Fetcher: threads: 10
> fetching http://172.16.10.99:7001/frontend/
> <http://172.16.10.99:7001/frontend/> 
> fetch of http://172.16.10.99:7001/frontend/
> <http://172.16.10.99:7001/frontend/>  failed with:
> java.lang.NullPointerException
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: _crawloutput/crawldb
> CrawlDb update: segment: _crawloutput/segments/20060929101932
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> LinkDb: starting
> LinkDb: linkdb: _crawloutput/linkdb
> LinkDb: adding segment: _crawloutput/segments/20060929101916
> LinkDb: adding segment: _crawloutput/segments/20060929101924
> LinkDb: adding segment: _crawloutput/segments/20060929101932
> LinkDb: done
> Indexer: starting
> Indexer: linkdb: _crawloutput/linkdb
> Indexer: adding segment: _crawloutput/segments/20060929101916
> Indexer: adding segment: _crawloutput/segments/20060929101924
> Indexer: adding segment: _crawloutput/segments/20060929101932
> Optimizing index.
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: _crawloutput/indexes
> Dedup: done
> Adding _crawloutput/indexes/part-00000
> crawl finished: _crawloutput

> I am not sure what I am doing wrong. Can someone help?

> Thanks
> Anand Narayan


> __________ NOD32 1.1783 (20060929) Information __________

> This message was checked by NOD32 antivirus system.
> http://www.eset.com




-- 
Regards,
 Dima                          mailto:nuther@proservice.ge