You are viewing a plain text version of this content. The canonical link for it is here.
Posted to agent@nutch.apache.org by Richard Braman <rb...@bramantax.com> on 2006/02/26 18:21:45 UTC

RE: Nutching IRS: Solved problem with URL file

the urls file needed http://www.irs.gov/index.html
without the index.html it did not work.
 
I fixed my server problem too!.  Now IRS has been nutched (to some
extent and the results can be seen here).  
My objective for using nutch is to 
1.  Hopefully learn something
2. Create an index of tax related web pages that are relevant to tax
software developers.  
 
I noticed something about not being able to parse PDF in the log file
(is that true?)
 
I cant wait to nutch some more.....
 
 

-----Original Message-----
From: Richard Braman [mailto:rbraman@bramantax.com] 
Sent: Saturday, February 25, 2006 11:50 PM
To: 'nutch-agent@lucene.apache.org'
Subject: Nutching IRS


Hi,
 
 I am trying to setup nutch on Tomcat 5.5/Windows2000/jdk1.5.0_04/latest
CYGWIN. I think I am about 99% of the way there, but I finally hit a
stumbling block.  I followed the instructions to a T, setup the war in
the the root context, modified the config files, etc., set env
NUTCH_JAVA_HOME, etc.  I have 2 problems 
 
1. The crawl doesn;t seem to be working.  The crawled dir gets created,
but see the log below. 0 records processed
.  My second problem is with the servlet (see 2. below).  Thanks in
advance for the help.
crawl-urlfilter.txt
 
 
-^(file|ftp|mailto):
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|m
ov|MOV|exe|png|PNG)$
-[?*!@=]
+^http://([a-z0-9]*\.)* irs.gov/
-.
 
urls
http://www.irs.gov/
 
Log:
run java in C:\Program Files\Java\jdk1.5.0_04
060225 233931 parsing file:/T:/nutch-0.7.1/conf/nutch-default.xml
060225 233931 parsing file:/T:/nutch-0.7.1/conf/crawl-tool.xml
060225 233931 parsing file:/T:/nutch-0.7.1/conf/nutch-site.xml
060225 233931 No FS indicated, using default:local
060225 233931 crawl started in: crawled
060225 233931 rootUrlFile = urls
060225 233931 threads = 10
060225 233931 depth = 3
060225 233932 Created webdb at LocalFS,T:\nutch-0.7.1\crawled\db
060225 233932 Starting URL processing
060225 233932 Plugins: looking in: T:\nutch-0.7.1\plugins
060225 233932 not including: T:\nutch-0.7.1\plugins\clustering-carrot2
060225 233932 not including: T:\nutch-0.7.1\plugins\creativecommons
060225 233932 parsing: T:\nutch-0.7.1\plugins\index-basic\plugin.xml
060225 233932 impl: point=org.apache.nutch.indexer.IndexingFilter
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
060225 233932 not including: T:\nutch-0.7.1\plugins\index-more
060225 233932 not including: T:\nutch-0.7.1\plugins\language-identifier
060225 233932 parsing:
T:\nutch-0.7.1\plugins\nutch-extensionpoints\plugin.xml
060225 233932 not including: T:\nutch-0.7.1\plugins\ontology
060225 233932 not including: T:\nutch-0.7.1\plugins\parse-ext
060225 233932 parsing: T:\nutch-0.7.1\plugins\parse-html\plugin.xml
060225 233932 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.html.HtmlParser
060225 233932 not including: T:\nutch-0.7.1\plugins\parse-js
060225 233932 not including: T:\nutch-0.7.1\plugins\parse-msword
060225 233932 not including: T:\nutch-0.7.1\plugins\parse-pdf
060225 233932 not including: T:\nutch-0.7.1\plugins\parse-rss
060225 233932 parsing: T:\nutch-0.7.1\plugins\parse-text\plugin.xml
060225 233932 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.text.TextParser
060225 233932 not including: T:\nutch-0.7.1\plugins\protocol-file
060225 233932 not including: T:\nutch-0.7.1\plugins\protocol-ftp
060225 233932 parsing: T:\nutch-0.7.1\plugins\protocol-http\plugin.xml
060225 233932 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.http.Http
060225 233932 not including: T:\nutch-0.7.1\plugins\protocol-httpclient
060225 233932 parsing: T:\nutch-0.7.1\plugins\query-basic\plugin.xml
060225 233933 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.basic.BasicQueryFilter
060225 233933 not including: T:\nutch-0.7.1\plugins\query-more
060225 233933 parsing: T:\nutch-0.7.1\plugins\query-site\plugin.xml
060225 233933 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.site.SiteQueryFilter
060225 233933 parsing: T:\nutch-0.7.1\plugins\query-url\plugin.xml
060225 233933 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.url.URLQueryFilter
060225 233933 not including: T:\nutch-0.7.1\plugins\urlfilter-prefix
060225 233933 parsing: T:\nutch-0.7.1\plugins\urlfilter-regex\plugin.xml
060225 233933 impl: point=org.apache.nutch.net.URLFilter
class=org.apache.nutch.net.RegexURLFilter
060225 233933 found resource crawl-urlfilter.txt at
file:/T:/nutch-0.7.1/conf/crawl-urlfilter.txt
.060225 233933 Added 0 pages
060225 233933 FetchListTool started
060225 233933 Overall processing: Sorted 0 entries in 0.0 seconds.
060225 233933 Overall processing: Sorted NaN entries/second
060225 233933 FetchListTool completed
060225 233933 logging at INFO
060225 233934 Updating T:\nutch-0.7.1\crawled\db
060225 233934 Updating for
T:\nutch-0.7.1\crawled\segments\20060225233933
060225 233934 Finishing update
060225 233934 Update finished
060225 233934 FetchListTool started
060225 233935 Overall processing: Sorted 0 entries in 0.0 seconds.
060225 233935 Overall processing: Sorted NaN entries/second
060225 233935 FetchListTool completed
060225 233935 logging at INFO
060225 233936 Updating T:\nutch-0.7.1\crawled\db
060225 233936 Updating for
T:\nutch-0.7.1\crawled\segments\20060225233934
060225 233936 Finishing update
060225 233936 Update finished
060225 233936 FetchListTool started
060225 233936 Overall processing: Sorted 0 entries in 0.0 seconds.
060225 233936 Overall processing: Sorted NaN entries/second
060225 233936 FetchListTool completed
060225 233936 logging at INFO
060225 233937 Updating T:\nutch-0.7.1\crawled\db
060225 233938 Updating for
T:\nutch-0.7.1\crawled\segments\20060225233936
060225 233938 Finishing update
060225 233938 Update finished
060225 233938 Updating T:\nutch-0.7.1\crawled\segments from
T:\nutch-0.7.1\crawled\db
060225 233938  reading T:\nutch-0.7.1\crawled\segments\20060225233933
060225 233938  reading T:\nutch-0.7.1\crawled\segments\20060225233934
060225 233938  reading T:\nutch-0.7.1\crawled\segments\20060225233936
060225 233938 Sorting pages by url...
060225 233938 Getting updated scores and anchors from db...
060225 233938 Sorting updates by segment...
060225 233938 Updating segments...
060225 233938 Done updating T:\nutch-0.7.1\crawled\segments from
T:\nutch-0.7.1\crawled\db
060225 233938 indexing segment:
T:\nutch-0.7.1\crawled\segments\20060225233933
060225 233938 * Opening segment 20060225233933
060225 233938 * Indexing segment 20060225233933
060225 233938 * Optimizing index...
060225 233938 * Moving index to NFS if needed...
060225 233938 DONE indexing segment 20060225233933: total 0 records in
0.14 s (NaN rec/s).
060225 233938 done indexing
060225 233938 indexing segment:
T:\nutch-0.7.1\crawled\segments\20060225233934
060225 233938 * Opening segment 20060225233934
060225 233938 * Indexing segment 20060225233934
060225 233938 * Optimizing index...
060225 233938 * Moving index to NFS if needed...
060225 233938 DONE indexing segment 20060225233934: total 0 records in
0.031 s (NaN rec/s).
060225 233938 done indexing
060225 233938 indexing segment:
T:\nutch-0.7.1\crawled\segments\20060225233936
060225 233938 * Opening segment 20060225233936
060225 233938 * Indexing segment 20060225233936
060225 233938 * Optimizing index...
060225 233938 * Moving index to NFS if needed...
060225 233938 DONE indexing segment 20060225233936: total 0 records in
0.032 s (NaN rec/s).
060225 233938 done indexing
060225 233938 Reading url hashes...
060225 233938 Sorting url hashes...
060225 233938 Deleting url duplicates...
060225 233938 Deleted 0 url duplicates.
060225 233938 Reading content hashes...
060225 233938 Sorting content hashes...
060225 233938 Deleting content duplicates...
060225 233938 Deleted 0 content duplicates.
060225 233938 Duplicate deletion complete locally.  Now returning to
NFS...
060225 233938 DeleteDuplicates complete
060225 233938 Merging segment indexes... 
060225 233938 crawl finished: crawled

 
 
 
2. Nutch seems to launch fine http://24.75.221.234:8080/ When you search
you get the following error:  Is this maybe because I haven;t completed
a good crawl yet
 

org.apache.jasper.JasperException

	
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.ja
va:370)

	
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:291)

	
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:241)

	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)





root cause 

java.lang.NullPointerException

	org.apache.nutch.searcher.NutchBean.init(NutchBean.java:96)

	org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:82)

	org.apache.nutch.searcher.NutchBean.<init>(NutchBean.java:72)

	org.apache.nutch.searcher.NutchBean.get(NutchBean.java:64)

	
org.apache.jsp.search_jsp._jspService(org.apache.jsp.search_jsp:112)

	
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97)

	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)

	
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.ja
va:322)

	
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:291)

	
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:241)

	javax.servlet.http.HttpServlet.service(HttpServlet.java:802)



Richard Braman
mailto:rbraman@taxcodesoftware.org
561.748.4002 (voice) 

http://www.taxcodesoftware.org <http://www.taxcodesoftware.org/> 
Free Open Source Tax Software

coming soon: nutch.taxcodesoftware.org

Open directory of tax software development.