You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Julius Schorzman <ju...@gmail.com> on 2006/07/13 21:19:01 UTC

Added 0 pages

I'm having trouble figuring out why I keep getting "Added 0 pages" when
running the crawl with nutch.  I've searched the site and can't find an
answer to as what might be going wrong.  I'm running this on windows using
eclipse because I may have to change the code slightly.  I've already made a
few modifications so that the path of the config files is specified
explicitly, but I don't think that would be related to this issue.  Any help
is greatly appreciated!

crawl-root-urls.txt:
http://www.apache.com

crawl-urlfilter.txt:
# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*apache.com/

# skip everything else
-.

Log:
060713 150946 parsing C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\WEB-INF\conf\nutch-default.xml
060713 150946 parsing C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\WEB-INF\conf\crawl-tool.xml
060713 150946 parsing C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\WEB-INF\conf\nutch-site.xml
060713 150946 No FS indicated, using default:local
060713 150946 crawl started in: crawl-20060713150946
060713 150946 rootUrlFile = crawl-root-urls.txt
060713 150946 threads = 10
060713 150946 depth = 5
060713 150947 Created webdb at LocalFS,C:\Documents and
Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\db
060713 150947 Starting URL processing
060713 150947 Plugins: looking in: C:\Nutch\WEB-INF\plugins
060713 150947 not including: C:\Nutch\WEB-INF\plugins\clustering-carrot2
060713 150947 not including: C:\Nutch\WEB-INF\plugins\creativecommons
060713 150947 parsing: C:\Nutch\WEB-INF\plugins\index-basic\plugin.xml
060713 150947 impl: point=org.apache.nutch.indexer.IndexingFilter class=
org.apache.nutch.indexer.basic.BasicIndexingFilter
060713 150947 not including: C:\Nutch\WEB-INF\plugins\index-more
060713 150947 not including: C:\Nutch\WEB-INF\plugins\language-identifier
060713 150947 parsing:
C:\Nutch\WEB-INF\plugins\nutch-extensionpoints\plugin.xml
060713 150947 not including: C:\Nutch\WEB-INF\plugins\ontology
060713 150947 not including: C:\Nutch\WEB-INF\plugins\parse-ext
060713 150947 parsing: C:\Nutch\WEB-INF\plugins\parse-html\plugin.xml
060713 150947 impl: point=org.apache.nutch.parse.Parser class=
org.apache.nutch.parse.html.HtmlParser
060713 150947 not including: C:\Nutch\WEB-INF\plugins\parse-js
060713 150947 not including: C:\Nutch\WEB-INF\plugins\parse-msword
060713 150947 not including: C:\Nutch\WEB-INF\plugins\parse-pdf
060713 150947 not including: C:\Nutch\WEB-INF\plugins\parse-rss
060713 150947 parsing: C:\Nutch\WEB-INF\plugins\parse-text\plugin.xml
060713 150947 impl: point=org.apache.nutch.parse.Parser class=
org.apache.nutch.parse.text.TextParser
060713 150947 not including: C:\Nutch\WEB-INF\plugins\protocol-file
060713 150947 not including: C:\Nutch\WEB-INF\plugins\protocol-ftp
060713 150947 parsing: C:\Nutch\WEB-INF\plugins\protocol-http\plugin.xml
060713 150947 impl: point=org.apache.nutch.protocol.Protocol class=
org.apache.nutch.protocol.http.Http
060713 150947 not including: C:\Nutch\WEB-INF\plugins\protocol-httpclient
060713 150947 parsing: C:\Nutch\WEB-INF\plugins\query-basic\plugin.xml
060713 150947 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.basic.BasicQueryFilter
060713 150947 not including: C:\Nutch\WEB-INF\plugins\query-more
060713 150947 parsing: C:\Nutch\WEB-INF\plugins\query-site\plugin.xml
060713 150947 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.site.SiteQueryFilter
.060713 150947 parsing: C:\Nutch\WEB-INF\plugins\query-url\plugin.xml
060713 150947 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.url.URLQueryFilter
060713 150947 not including: C:\Nutch\WEB-INF\plugins\urlfilter-prefix
060713 150947 parsing: C:\Nutch\WEB-INF\plugins\urlfilter-regex\plugin.xml
060713 150947 impl: point=org.apache.nutch.net.URLFilter class=
org.apache.nutch.net.RegexURLFilter
060713 150947 found resource crawl-urlfilter.txt at
file:/C:/Documents%20and%20Settings/jschorzman/My%20Documents/My%20Workspace/Nutch/WEB-INF/conf/crawl-
urlfilter.txt
060713 150947 Added 0 pages
060713 150947 FetchListTool started
060713 150947 Overall processing: Sorted 0 entries in 0.0 seconds.
060713 150947 Overall processing: Sorted NaN entries/second
060713 150947 FetchListTool completed
060713 150947 logging at INFO
060713 150948 Updating C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\db
060713 150948 Updating for C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150947
060713 150948 Finishing update
060713 150948 Update finished
060713 150948 FetchListTool started
060713 150948 Overall processing: Sorted 0 entries in 0.0 seconds.
060713 150948 Overall processing: Sorted NaN entries/second
060713 150949 FetchListTool completed
060713 150949 logging at INFO
060713 150950 Updating C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\db
060713 150950 Updating for C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150948
060713 150950 Finishing update
060713 150950 Update finished
060713 150950 FetchListTool started
060713 150950 Overall processing: Sorted 0 entries in 0.0 seconds.
060713 150950 Overall processing: Sorted NaN entries/second
060713 150950 FetchListTool completed
060713 150950 logging at INFO
060713 150951 Updating C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\db
060713 150951 Updating for C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150950
060713 150951 Finishing update
060713 150951 Update finished
060713 150951 FetchListTool started
060713 150951 Overall processing: Sorted 0 entries in 0.0 seconds.
060713 150951 Overall processing: Sorted NaN entries/second
060713 150951 FetchListTool completed
060713 150951 logging at INFO
060713 150952 Updating C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\db
060713 150952 Updating for C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150951
060713 150952 Finishing update
060713 150952 Update finished
060713 150952 FetchListTool started
060713 150953 Overall processing: Sorted 0 entries in 0.0 seconds.
060713 150953 Overall processing: Sorted NaN entries/second
060713 150953 FetchListTool completed
060713 150953 logging at INFO
060713 150954 Updating C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\db
060713 150954 Updating for C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150952
060713 150954 Finishing update
060713 150954 Update finished
060713 150954 Updating C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\segments from C:\Documents and
Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\db
060713 150954  reading C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\segments\20060713150947
060713 150954  reading C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\segments\20060713150948
060713 150954  reading C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\segments\20060713150950
060713 150954  reading C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\segments\20060713150951
060713 150954  reading C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\segments\20060713150952
060713 150954 Sorting pages by url...
060713 150954 Getting updated scores and anchors from db...
060713 150954 Sorting updates by segment...
060713 150954 Updating segments...
060713 150954 Done updating C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments from C:\Documents
and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\db
060713 150954 indexing segment: C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150947
060713 150954 * Opening segment 20060713150947
060713 150954 * Indexing segment 20060713150947
060713 150954 * Optimizing index...
060713 150954 * Moving index to NFS if needed...
060713 150954 DONE indexing segment 20060713150947: total 0 records in
0.02s (NaN rec/s).
060713 150954 done indexing
060713 150954 indexing segment: C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150948
060713 150954 * Opening segment 20060713150948
060713 150954 * Indexing segment 20060713150948
060713 150954 * Optimizing index...
060713 150954 * Moving index to NFS if needed...
060713 150954 DONE indexing segment 20060713150948: total 0 records in
0.021s (NaN rec/s).
060713 150954 done indexing
060713 150954 indexing segment: C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150950
060713 150954 * Opening segment 20060713150950
060713 150954 * Indexing segment 20060713150950
060713 150954 * Optimizing index...
060713 150954 * Moving index to NFS if needed...
060713 150954 DONE indexing segment 20060713150950: total 0 records in
0.01s (NaN rec/s).
060713 150954 done indexing
060713 150954 indexing segment: C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150951
060713 150954 * Opening segment 20060713150951
060713 150954 * Indexing segment 20060713150951
060713 150954 * Optimizing index...
060713 150954 * Moving index to NFS if needed...
060713 150954 DONE indexing segment 20060713150951: total 0 records in
0.01s (NaN rec/s).
060713 150954 done indexing
060713 150954 indexing segment: C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150952
060713 150954 * Opening segment 20060713150952
060713 150954 * Indexing segment 20060713150952
060713 150954 * Optimizing index...
060713 150954 * Moving index to NFS if needed...
060713 150954 DONE indexing segment 20060713150952: total 0 records in
0.06s (NaN rec/s).
060713 150954 done indexing
060713 150954 Reading url hashes...
060713 150954 Sorting url hashes...
060713 150954 Deleting url duplicates...
060713 150954 Deleted 0 url duplicates.
060713 150954 Reading content hashes...
060713 150954 Sorting content hashes...
060713 150954 Deleting content duplicates...
060713 150954 Deleted 0 content duplicates.
060713 150954 Duplicate deletion complete locally.  Now returning to NFS...
060713 150954 DeleteDuplicates complete
060713 150954 Merging segment indexes...
060713 150954 crawl finished: crawl-20060713150946

Added 0 pages

Posted by Julius Schorzman <ju...@gmail.com>.
I'm having trouble figuring out why I keep getting "Added 0 pages" when
running the crawl with nutch.  I've searched the site and can't find an
answer to as what might be going wrong.  I'm running this on windows using
eclipse because I may have to change the code slightly.  I've already made a
few modifications so that the path of the config files is specified
explicitly, but I don't think that would be related to this issue.  Any help
is greatly appreciated!

crawl-root-urls.txt:
http://www.apache.com

crawl-urlfilter.txt:
# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$


# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*apache.com/

# skip everything else
-.

Log:
060713 150946 parsing C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\WEB-INF\conf\nutch-default.xml
060713 150946 parsing C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\WEB-INF\conf\crawl- tool.xml
060713 150946 parsing C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\WEB-INF\conf\nutch-site.xml
060713 150946 No FS indicated, using default:local
060713 150946 crawl started in: crawl-20060713150946
060713 150946 rootUrlFile = crawl-root-urls.txt
060713 150946 threads = 10
060713 150946 depth = 5
060713 150947 Created webdb at LocalFS,C:\Documents and
Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\db
060713 150947 Starting URL processing
060713 150947 Plugins: looking in: C:\Nutch\WEB-INF\plugins
060713 150947 not including: C:\Nutch\WEB-INF\plugins\clustering-carrot2
060713 150947 not including: C:\Nutch\WEB-INF\plugins\creativecommons
060713 150947 parsing: C:\Nutch\WEB-INF\plugins\index-basic\plugin.xml
060713 150947 impl: point=org.apache.nutch.indexer.IndexingFilter class=
org.apache.nutch.indexer.basic.BasicIndexingFilter
060713 150947 not including: C:\Nutch\WEB-INF\plugins\index-more
060713 150947 not including: C:\Nutch\WEB-INF\plugins\language-identifier
060713 150947 parsing:
C:\Nutch\WEB-INF\plugins\nutch-extensionpoints\plugin.xml
060713 150947 not including: C:\Nutch\WEB-INF\plugins\ontology
060713 150947 not including: C:\Nutch\WEB-INF\plugins\parse-ext
060713 150947 parsing: C:\Nutch\WEB-INF\plugins\parse-html\plugin.xml
060713 150947 impl: point=org.apache.nutch.parse.Parser class=
org.apache.nutch.parse.html.HtmlParser
060713 150947 not including: C:\Nutch\WEB-INF\plugins\parse-js
060713 150947 not including: C:\Nutch\WEB-INF\plugins\parse-msword
060713 150947 not including: C:\Nutch\WEB-INF\plugins\parse-pdf
060713 150947 not including: C:\Nutch\WEB-INF\plugins\parse-rss
060713 150947 parsing: C:\Nutch\WEB-INF\plugins\parse-text\plugin.xml
060713 150947 impl: point=org.apache.nutch.parse.Parser class=
org.apache.nutch.parse.text.TextParser
060713 150947 not including: C:\Nutch\WEB-INF\plugins\protocol-file
060713 150947 not including: C:\Nutch\WEB-INF\plugins\protocol-ftp
060713 150947 parsing: C:\Nutch\WEB-INF\plugins\protocol-http\plugin.xml
060713 150947 impl: point=org.apache.nutch.protocol.Protocol class=
org.apache.nutch.protocol.http.Http
060713 150947 not including: C:\Nutch\WEB-INF\plugins\protocol-httpclient
060713 150947 parsing: C:\Nutch\WEB-INF\plugins\query-basic\plugin.xml
060713 150947 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.basic.BasicQueryFilter
060713 150947 not including: C:\Nutch\WEB-INF\plugins\query-more
060713 150947 parsing: C:\Nutch\WEB-INF\plugins\query-site\plugin.xml
060713 150947 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.site.SiteQueryFilter
.060713 150947 parsing: C:\Nutch\WEB-INF\plugins\query-url\plugin.xml
060713 150947 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.url.URLQueryFilter
060713 150947 not including: C:\Nutch\WEB-INF\plugins\urlfilter-prefix
060713 150947 parsing: C:\Nutch\WEB-INF\plugins\urlfilter-regex\plugin.xml
060713 150947 impl: point=org.apache.nutch.net.URLFilter class=
org.apache.nutch.net.RegexURLFilter
060713 150947 found resource crawl-urlfilter.txt at
file:/C:/Documents%20and%20Settings/jschorzman/My%20Documents/My%20Workspace/Nutch/WEB-INF/conf/crawl-
urlfilter.txt
060713 150947 Added 0 pages
060713 150947 FetchListTool started
060713 150947 Overall processing: Sorted 0 entries in 0.0 seconds.
060713 150947 Overall processing: Sorted NaN entries/second
060713 150947 FetchListTool completed
060713 150947 logging at INFO
060713 150948 Updating C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\db
060713 150948 Updating for C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150947
060713 150948 Finishing update
060713 150948 Update finished
060713 150948 FetchListTool started
060713 150948 Overall processing: Sorted 0 entries in 0.0 seconds.
060713 150948 Overall processing: Sorted NaN entries/second
060713 150949 FetchListTool completed
060713 150949 logging at INFO
060713 150950 Updating C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\db
060713 150950 Updating for C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150948
060713 150950 Finishing update
060713 150950 Update finished
060713 150950 FetchListTool started
060713 150950 Overall processing: Sorted 0 entries in 0.0 seconds.
060713 150950 Overall processing: Sorted NaN entries/second
060713 150950 FetchListTool completed
060713 150950 logging at INFO
060713 150951 Updating C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\db
060713 150951 Updating for C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150950
060713 150951 Finishing update
060713 150951 Update finished
060713 150951 FetchListTool started
060713 150951 Overall processing: Sorted 0 entries in 0.0 seconds.
060713 150951 Overall processing: Sorted NaN entries/second
060713 150951 FetchListTool completed
060713 150951 logging at INFO
060713 150952 Updating C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\db
060713 150952 Updating for C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150951
060713 150952 Finishing update
060713 150952 Update finished
060713 150952 FetchListTool started
060713 150953 Overall processing: Sorted 0 entries in 0.0 seconds.
060713 150953 Overall processing: Sorted NaN entries/second
060713 150953 FetchListTool completed
060713 150953 logging at INFO
060713 150954 Updating C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\db
060713 150954 Updating for C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150952
060713 150954 Finishing update
060713 150954 Update finished
060713 150954 Updating C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\segments from C:\Documents and
Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\db
060713 150954  reading C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\segments\20060713150947
060713 150954  reading C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\segments\20060713150948
060713 150954  reading C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\segments\20060713150950
060713 150954  reading C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\segments\20060713150951
060713 150954  reading C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\segments\20060713150952
060713 150954 Sorting pages by url...
060713 150954 Getting updated scores and anchors from db...
060713 150954 Sorting updates by segment...
060713 150954 Updating segments...
060713 150954 Done updating C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments from C:\Documents
and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\db
060713 150954 indexing segment: C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150947
060713 150954 * Opening segment 20060713150947
060713 150954 * Indexing segment 20060713150947
060713 150954 * Optimizing index...
060713 150954 * Moving index to NFS if needed...
060713 150954 DONE indexing segment 20060713150947: total 0 records in
0.02s (NaN rec/s).
060713 150954 done indexing
060713 150954 indexing segment: C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150948
060713 150954 * Opening segment 20060713150948
060713 150954 * Indexing segment 20060713150948
060713 150954 * Optimizing index...
060713 150954 * Moving index to NFS if needed...
060713 150954 DONE indexing segment 20060713150948: total 0 records in
0.021s (NaN rec/s).
060713 150954 done indexing
060713 150954 indexing segment: C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150950
060713 150954 * Opening segment 20060713150950
060713 150954 * Indexing segment 20060713150950
060713 150954 * Optimizing index...
060713 150954 * Moving index to NFS if needed...
060713 150954 DONE indexing segment 20060713150950: total 0 records in
0.01s (NaN rec/s).
060713 150954 done indexing
060713 150954 indexing segment: C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150951
060713 150954 * Opening segment 20060713150951
060713 150954 * Indexing segment 20060713150951
060713 150954 * Optimizing index...
060713 150954 * Moving index to NFS if needed...
060713 150954 DONE indexing segment 20060713150951: total 0 records in
0.01s (NaN rec/s).
060713 150954 done indexing
060713 150954 indexing segment: C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150952
060713 150954 * Opening segment 20060713150952
060713 150954 * Indexing segment 20060713150952
060713 150954 * Optimizing index...
060713 150954 * Moving index to NFS if needed...
060713 150954 DONE indexing segment 20060713150952: total 0 records in
0.06s (NaN rec/s).
060713 150954 done indexing
060713 150954 Reading url hashes...
060713 150954 Sorting url hashes...
060713 150954 Deleting url duplicates...
060713 150954 Deleted 0 url duplicates.
060713 150954 Reading content hashes...
060713 150954 Sorting content hashes...
060713 150954 Deleting content duplicates...
060713 150954 Deleted 0 content duplicates.
060713 150954 Duplicate deletion complete locally.  Now returning to NFS...
060713 150954 DeleteDuplicates complete
060713 150954 Merging segment indexes...
060713 150954 crawl finished: crawl-20060713150946

Re: Added 0 pages

Posted by Julius Schorzman <ju...@gmail.com>.
**If you received my message more than once, I deeply apologize.  I thought
my message was rejected because I didn't receive it myself.**

Thanks for the follow up!  I changed to the tutorial and it began working
for nutch's webpage.  I'll take an extra look at the documentation for the
filter.

On 7/13/06, Karsten Dello <de...@mi.fu-berlin.de> wrote:
>
> Hi,
>
> in my opinion
>
> Julius Schorzman wrote:
> > http://www.apache.com
>
> is not matched by the regex
>
> > +^http://([a-z0-9]*\.)*apache.com/
>
> as it does not end with a trailing slash.
>
> Cheers
> Karsten
>

Re: Added 0 pages

Posted by Karsten Dello <de...@mi.fu-berlin.de>.
Hi,

in my opinion

Julius Schorzman wrote:
> http://www.apache.com

is not matched by the regex

> +^http://([a-z0-9]*\.)*apache.com/

as it does not end with a trailing slash.

Cheers
Karsten

Added 0 pages

Posted by Julius Schorzman <ju...@gmail.com>.
I'm having trouble figuring out why I keep getting "Added 0 pages" when
running the crawl with nutch.  I've searched the site and can't find an
answer to as what might be going wrong.  I'm running this on windows using
eclipse because I may have to change the code slightly.  I've already made a
few modifications so that the path of the config files is specified
explicitly, but I don't think that would be related to this issue.  Any help
is greatly appreciated!

crawl-root-urls.txt:
http://www.apache.com

crawl-urlfilter.txt :
# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$


# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*apache.com/

# skip everything else
-.

Log:
060713 150946 parsing C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\WEB-INF\conf\nutch-default.xml
060713 150946 parsing C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\WEB-INF\conf\crawl- tool.xml
060713 150946 parsing C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\WEB-INF\conf\nutch-site.xml
060713 150946 No FS indicated, using default:local
060713 150946 crawl started in: crawl-20060713150946
060713 150946 rootUrlFile = crawl-root-urls.txt
060713 150946 threads = 10
060713 150946 depth = 5
060713 150947 Created webdb at LocalFS,C:\Documents and
Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\db
060713 150947 Starting URL processing
060713 150947 Plugins: looking in: C:\Nutch\WEB-INF\plugins
060713 150947 not including: C:\Nutch\WEB-INF\plugins\clustering-carrot2
060713 150947 not including: C:\Nutch\WEB-INF\plugins\creativecommons
060713 150947 parsing: C:\Nutch\WEB-INF\plugins\index-basic\plugin.xml
060713 150947 impl: point=org.apache.nutch.indexer.IndexingFilter class=
org.apache.nutch.indexer.basic.BasicIndexingFilter
060713 150947 not including: C:\Nutch\WEB-INF\plugins\index-more
060713 150947 not including: C:\Nutch\WEB-INF\plugins\language-identifier
060713 150947 parsing:
C:\Nutch\WEB-INF\plugins\nutch-extensionpoints\plugin.xml
060713 150947 not including: C:\Nutch\WEB-INF\plugins\ontology
060713 150947 not including: C:\Nutch\WEB-INF\plugins\parse-ext
060713 150947 parsing: C:\Nutch\WEB-INF\plugins\parse-html\plugin.xml
060713 150947 impl: point=org.apache.nutch.parse.Parser class=
org.apache.nutch.parse.html.HtmlParser
060713 150947 not including: C:\Nutch\WEB-INF\plugins\parse-js
060713 150947 not including: C:\Nutch\WEB-INF\plugins\parse-msword
060713 150947 not including: C:\Nutch\WEB-INF\plugins\parse-pdf
060713 150947 not including: C:\Nutch\WEB-INF\plugins\parse-rss
060713 150947 parsing: C:\Nutch\WEB-INF\plugins\parse-text\plugin.xml
060713 150947 impl: point=org.apache.nutch.parse.Parser class=
org.apache.nutch.parse.text.TextParser
060713 150947 not including: C:\Nutch\WEB-INF\plugins\protocol-file
060713 150947 not including: C:\Nutch\WEB-INF\plugins\protocol-ftp
060713 150947 parsing: C:\Nutch\WEB-INF\plugins\protocol-http\plugin.xml
060713 150947 impl: point=org.apache.nutch.protocol.Protocol class=
org.apache.nutch.protocol.http.Http
060713 150947 not including: C:\Nutch\WEB-INF\plugins\protocol-httpclient
060713 150947 parsing: C:\Nutch\WEB-INF\plugins\query-basic\plugin.xml
060713 150947 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.basic.BasicQueryFilter
060713 150947 not including: C:\Nutch\WEB-INF\plugins\query-more
060713 150947 parsing: C:\Nutch\WEB-INF\plugins\query-site\plugin.xml
060713 150947 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.site.SiteQueryFilter
.060713 150947 parsing: C:\Nutch\WEB-INF\plugins\query-url\plugin.xml
060713 150947 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.url.URLQueryFilter
060713 150947 not including: C:\Nutch\WEB-INF\plugins\urlfilter-prefix
060713 150947 parsing: C:\Nutch\WEB-INF\plugins\urlfilter-regex\plugin.xml
060713 150947 impl: point=org.apache.nutch.net.URLFilter class=
org.apache.nutch.net.RegexURLFilter
060713 150947 found resource crawl-urlfilter.txt at
file:/C:/Documents%20and%20Settings/jschorzman/My%20Documents/My%20Workspace/Nutch/WEB-INF/conf/crawl-
urlfilter.txt
060713 150947 Added 0 pages
060713 150947 FetchListTool started
060713 150947 Overall processing: Sorted 0 entries in 0.0 seconds.
060713 150947 Overall processing: Sorted NaN entries/second
060713 150947 FetchListTool completed
060713 150947 logging at INFO
060713 150948 Updating C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\db
060713 150948 Updating for C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150947
060713 150948 Finishing update
060713 150948 Update finished
060713 150948 FetchListTool started
060713 150948 Overall processing: Sorted 0 entries in 0.0 seconds.
060713 150948 Overall processing: Sorted NaN entries/second
060713 150949 FetchListTool completed
060713 150949 logging at INFO
060713 150950 Updating C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\db
060713 150950 Updating for C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150948
060713 150950 Finishing update
060713 150950 Update finished
060713 150950 FetchListTool started
060713 150950 Overall processing: Sorted 0 entries in 0.0 seconds.
060713 150950 Overall processing: Sorted NaN entries/second
060713 150950 FetchListTool completed
060713 150950 logging at INFO
060713 150951 Updating C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\db
060713 150951 Updating for C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150950
060713 150951 Finishing update
060713 150951 Update finished
060713 150951 FetchListTool started
060713 150951 Overall processing: Sorted 0 entries in 0.0 seconds.
060713 150951 Overall processing: Sorted NaN entries/second
060713 150951 FetchListTool completed
060713 150951 logging at INFO
060713 150952 Updating C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\db
060713 150952 Updating for C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150951
060713 150952 Finishing update
060713 150952 Update finished
060713 150952 FetchListTool started
060713 150953 Overall processing: Sorted 0 entries in 0.0 seconds.
060713 150953 Overall processing: Sorted NaN entries/second
060713 150953 FetchListTool completed
060713 150953 logging at INFO
060713 150954 Updating C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\db
060713 150954 Updating for C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150952
060713 150954 Finishing update
060713 150954 Update finished
060713 150954 Updating C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\segments from C:\Documents and
Settings\jschorzman\My Documents\My Workspace\Nutch\crawl-20060713150946\db
060713 150954  reading C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\segments\20060713150947
060713 150954  reading C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\segments\20060713150948
060713 150954  reading C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\segments\20060713150950
060713 150954  reading C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\segments\20060713150951
060713 150954  reading C:\Documents and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\segments\20060713150952
060713 150954 Sorting pages by url...
060713 150954 Getting updated scores and anchors from db...
060713 150954 Sorting updates by segment...
060713 150954 Updating segments...
060713 150954 Done updating C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments from C:\Documents
and Settings\jschorzman\My Documents\My
Workspace\Nutch\crawl-20060713150946\db
060713 150954 indexing segment: C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150947
060713 150954 * Opening segment 20060713150947
060713 150954 * Indexing segment 20060713150947
060713 150954 * Optimizing index...
060713 150954 * Moving index to NFS if needed...
060713 150954 DONE indexing segment 20060713150947: total 0 records in
0.02s (NaN rec/s).
060713 150954 done indexing
060713 150954 indexing segment: C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150948
060713 150954 * Opening segment 20060713150948
060713 150954 * Indexing segment 20060713150948
060713 150954 * Optimizing index...
060713 150954 * Moving index to NFS if needed...
060713 150954 DONE indexing segment 20060713150948: total 0 records in
0.021s (NaN rec/s).
060713 150954 done indexing
060713 150954 indexing segment: C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150950
060713 150954 * Opening segment 20060713150950
060713 150954 * Indexing segment 20060713150950
060713 150954 * Optimizing index...
060713 150954 * Moving index to NFS if needed...
060713 150954 DONE indexing segment 20060713150950: total 0 records in
0.01s (NaN rec/s).
060713 150954 done indexing
060713 150954 indexing segment: C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150951
060713 150954 * Opening segment 20060713150951
060713 150954 * Indexing segment 20060713150951
060713 150954 * Optimizing index...
060713 150954 * Moving index to NFS if needed...
060713 150954 DONE indexing segment 20060713150951: total 0 records in
0.01s (NaN rec/s).
060713 150954 done indexing
060713 150954 indexing segment: C:\Documents and Settings\jschorzman\My
Documents\My Workspace\Nutch\crawl-20060713150946\segments\20060713150952
060713 150954 * Opening segment 20060713150952
060713 150954 * Indexing segment 20060713150952
060713 150954 * Optimizing index...
060713 150954 * Moving index to NFS if needed...
060713 150954 DONE indexing segment 20060713150952: total 0 records in
0.06s (NaN rec/s).
060713 150954 done indexing
060713 150954 Reading url hashes...
060713 150954 Sorting url hashes...
060713 150954 Deleting url duplicates...
060713 150954 Deleted 0 url duplicates.
060713 150954 Reading content hashes...
060713 150954 Sorting content hashes...
060713 150954 Deleting content duplicates...
060713 150954 Deleted 0 content duplicates.
060713 150954 Duplicate deletion complete locally.  Now returning to NFS...
060713 150954 DeleteDuplicates complete
060713 150954 Merging segment indexes...
060713 150954 crawl finished: crawl-20060713150946