You are viewing a plain text version of this content. The canonical link for it is here.
Posted to agent@nutch.apache.org by Richard Braman <rb...@bramantax.com> on 2006/02/26 05:50:00 UTC

Nutching IRS

Hi,
 
 I am trying to setup nutch on Tomcat 5.5/Windows2000/jdk1.5.0_04/latest
CYGWIN. I think I am about 99% of the way there, but I finally hit a
stumbling block.  I followed the instructions to a T, setup the war in
the the root context, modified the config files, etc., set env
NUTCH_JAVA_HOME, etc.  I have 2 problems 
 
1. The crawl doesn;t seem to be working.  The crawled dir gets created,
but see the log below. 0 records processed
.  My second problem is with the servlet (see 2. below).  Thanks in
advance for the help.
crawl-urlfilter.txt
 
 
-^(file|ftp|mailto):
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|m
ov|MOV|exe|png|PNG)$
-[?*!@=]
+^http://([a-z0-9]*\.)* irs.gov/
-.
 
urls
http://www.irs.gov/
 
Log:
run java in C:\Program Files\Java\jdk1.5.0_04
060225 233931 parsing file:/T:/nutch-0.7.1/conf/nutch-default.xml
060225 233931 parsing file:/T:/nutch-0.7.1/conf/crawl-tool.xml
060225 233931 parsing file:/T:/nutch-0.7.1/conf/nutch-site.xml
060225 233931 No FS indicated, using default:local
060225 233931 crawl started in: crawled
060225 233931 rootUrlFile = urls
060225 233931 threads = 10
060225 233931 depth = 3
060225 233932 Created webdb at LocalFS,T:\nutch-0.7.1\crawled\db
060225 233932 Starting URL processing
060225 233932 Plugins: looking in: T:\nutch-0.7.1\plugins
060225 233932 not including: T:\nutch-0.7.1\plugins\clustering-carrot2
060225 233932 not including: T:\nutch-0.7.1\plugins\creativecommons
060225 233932 parsing: T:\nutch-0.7.1\plugins\index-basic\plugin.xml
060225 233932 impl: point=org.apache.nutch.indexer.IndexingFilter
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
060225 233932 not including: T:\nutch-0.7.1\plugins\index-more
060225 233932 not including: T:\nutch-0.7.1\plugins\language-identifier
060225 233932 parsing:
T:\nutch-0.7.1\plugins\nutch-extensionpoints\plugin.xml
060225 233932 not including: T:\nutch-0.7.1\plugins\ontology
060225 233932 not including: T:\nutch-0.7.1\plugins\parse-ext
060225 233932 parsing: T:\nutch-0.7.1\plugins\parse-html\plugin.xml
060225 233932 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.html.HtmlParser
060225 233932 not including: T:\nutch-0.7.1\plugins\parse-js
060225 233932 not including: T:\nutch-0.7.1\plugins\parse-msword
060225 233932 not including: T:\nutch-0.7.1\plugins\parse-pdf
060225 233932 not including: T:\nutch-0.7.1\plugins\parse-rss
060225 233932 parsing: T:\nutch-0.7.1\plugins\parse-text\plugin.xml
060225 233932 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.text.TextParser
060225 233932 not including: T:\nutch-0.7.1\plugins\protocol-file
060225 233932 not including: T:\nutch-0.7.1\plugins\protocol-ftp
060225 233932 parsing: T:\nutch-0.7.1\plugins\protocol-http\plugin.xml
060225 233932 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.http.Http
060225 233932 not including: T:\nutch-0.7.1\plugins\protocol-httpclient
060225 233932 parsing: T:\nutch-0.7.1\plugins\query-basic\plugin.xml
060225 233933 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.basic.BasicQueryFilter
060225 233933 not including: T:\nutch-0.7.1\plugins\query-more
060225 233933 parsing: T:\nutch-0.7.1\plugins\query-site\plugin.xml
060225 233933 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.site.SiteQueryFilter
060225 233933 parsing: T:\nutch-0.7.1\plugins\query-url\plugin.xml
060225 233933 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.url.URLQueryFilter
060225 233933 not including: T:\nutch-0.7.1\plugins\urlfilter-prefix
060225 233933 parsing: T:\nutch-0.7.1\plugins\urlfilter-regex\plugin.xml
060225 233933 impl: point=org.apache.nutch.net.URLFilter
class=org.apache.nutch.net.RegexURLFilter
060225 233933 found resource crawl-urlfilter.txt at
file:/T:/nutch-0.7.1/conf/crawl-urlfilter.txt
.060225 233933 Added 0 pages
060225 233933 FetchListTool started
060225 233933 Overall processing: Sorted 0 entries in 0.0 seconds.
060225 233933 Overall processing: Sorted NaN entries/second
060225 233933 FetchListTool completed
060225 233933 logging at INFO
060225 233934 Updating T:\nutch-0.7.1\crawled\db
060225 233934 Updating for
T:\nutch-0.7.1\crawled\segments\20060225233933
060225 233934 Finishing update
060225 233934 Update finished
060225 233934 FetchListTool started
060225 233935 Overall processing: Sorted 0 entries in 0.0 seconds.
060225 233935 Overall processing: Sorted NaN entries/second
060225 233935 FetchListTool completed
060225 233935 logging at INFO
060225 233936 Updating T:\nutch-0.7.1\crawled\db
060225 233936 Updating for
T:\nutch-0.7.1\crawled\segments\20060225233934
060225 233936 Finishing update
060225 233936 Update finished
060225 233936 FetchListTool started
060225 233936 Overall processing: Sorted 0 entries in 0.0 seconds.
060225 233936 Overall processing: Sorted NaN entries/second
060225 233936 FetchListTool completed
060225 233936 logging at INFO
060225 233937 Updating T:\nutch-0.7.1\crawled\db
060225 233938 Updating for
T:\nutch-0.7.1\crawled\segments\20060225233936
060225 233938 Finishing update
060225 233938 Update finished
060225 233938 Updating T:\nutch-0.7.1\crawled\segments from
T:\nutch-0.7.1\crawled\db
060225 233938  reading T:\nutch-0.7.1\crawled\segments\20060225233933
060225 233938  reading T:\nutch-0.7.1\crawled\segments\20060225233934
060225 233938  reading T:\nutch-0.7.1\crawled\segments\20060225233936
060225 233938 Sorting pages by url...
060225 233938 Getting updated scores and anchors from db...
060225 233938 Sorting updates by segment...
060225 233938 Updating segments...
060225 233938 Done updating T:\nutch-0.7.1\crawled\segments from
T:\nutch-0.7.1\crawled\db
060225 233938 indexing segment:
T:\nutch-0.7.1\crawled\segments\20060225233933
060225 233938 * Opening segment 20060225233933
060225 233938 * Indexing segment 20060225233933
060225 233938 * Optimizing index...
060225 233938 * Moving index to NFS if needed...
060225 233938 DONE indexing segment 20060225233933: total 0 records in
0.14 s (NaN rec/s).
060225 233938 done indexing
060225 233938 indexing segment:
T:\nutch-0.7.1\crawled\segments\20060225233934
060225 233938 * Opening segment 20060225233934
060225 233938 * Indexing segment 20060225233934
060225 233938 * Optimizing index...
060225 233938 * Moving index to NFS if needed...
060225 233938 DONE indexing segment 20060225233934: total 0 records in
0.031 s (NaN rec/s).
060225 233938 done indexing
060225 233938 indexing segment:
T:\nutch-0.7.1\crawled\segments\20060225233936
060225 233938 * Opening segment 20060225233936
060225 233938 * Indexing segment 20060225233936
060225 233938 * Optimizing index...
060225 233938 * Moving index to NFS if needed...
060225 233938 DONE indexing segment 20060225233936: total 0 records in
0.032 s (NaN rec/s).
060225 233938 done indexing
060225 233938 Reading url hashes...
060225 233938 Sorting url hashes...
060225 233938 Deleting url duplicates...
060225 233938 Deleted 0 url duplicates.
060225 233938 Reading content hashes...
060225 233938 Sorting content hashes...
060225 233938 Deleting content duplicates...
060225 233938 Deleted 0 content duplicates.
060225 233938 Duplicate deletion complete locally.  Now returning to
NFS...
060225 233938 DeleteDuplicates complete
060225 233938 Merging segment indexes... 
060225 233938 crawl finished: crawled

 
 
 
2. Nutch seems to launch fine http://24.75.221.234:8080/ When you search
you get the following error:  Is this maybe because I haven;t completed
a good crawl yet
 
org.apache.jasper.JasperException

	
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.ja
va:370)

	
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:291)

	
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:241)

	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)



root cause 

java.lang.NullPointerException

	org.apache.nutch.searcher.NutchBean.init(NutchBean.java:96)

	org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:82)

	org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:72)

	org.apache.nutch.searcher.NutchBean.get(NutchBean.java:64)

	
org.apache.jsp.search_jsp._jspService(org.apache.jsp.search_jsp:112)

	
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97)

	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

	
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.ja
va:322)

	
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:291)

	
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:241)

	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)



Richard Braman
mailto:rbraman@taxcodesoftware.org
561.748.4002 (voice) 

http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/> 
Free Open Source Tax Software

coming soon: nutch.taxcodesoftware.org

Open directory of tax software development.