You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by kevin pang <ke...@gmail.com> on 2006/07/07 04:12:33 UTC

why i can't crawl all the linked pages in the specified page to crawl.

i set up the nutch to crawl the url: http://www.haha365.com/gd_joke/
but after crawl complete, only 54 pages were fetched.

here is the log info:

060705 154332 parsing file:/C:/cygwin/nutch-0.7.2/conf/nutch-default.xml
060705 154332 parsing file:/C:/cygwin/nutch-0.7.2/conf/crawl-tool.xml
060705 154332 parsing file:/C:/cygwin/nutch-0.7.2/conf/nutch-site.xml
060705 154332 No FS indicated, using default:local
060705 154332 crawl started in: crawled2
060705 154332 rootUrlFile = url.txt
060705 154332 threads = 4
060705 154332 depth = 3
060705 154333 Created webdb at LocalFS,C:\cygwin\nutch-0.7.2\bin\crawled2\db
060705 154333 Starting URL processing
060705 154333 Plugins: looking in: C:\cygwin\nutch-0.7.2\plugins
060705 154333 parsing: C:\cygwin\nutch-
0.7.2\plugins\urlfilter-regex\plugin.xml
060705 154333 impl: point=org.apache.nutch.net.URLFilter class=
org.apache.nutch.net.RegexURLFilter
060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\urlfilter-prefix
060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\query-url\plugin.xml
060705 154333 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.url.URLQueryFilter
060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\query-site\plugin.xml
060705 154333 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.site.SiteQueryFilter
060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\query-more
060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\query-basic\plugin.xml
060705 154333 impl: point=org.apache.nutch.searcher.QueryFilter class=
org.apache.nutch.searcher.basic.BasicQueryFilter
060705 154333 not including: C:\cygwin\nutch-
0.7.2\plugins\protocol-httpclient
060705 154333 parsing: C:\cygwin\nutch-
0.7.2\plugins\protocol-http\plugin.xml
060705 154333 impl: point=org.apache.nutch.protocol.Protocol class=
org.apache.nutch.protocol.http.Http
060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\protocol-ftp
060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\protocol-file
060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\parse-text\plugin.xml
060705 154333 impl: point=org.apache.nutch.parse.Parser class=
org.apache.nutch.parse.text.TextParser
060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse-rss
060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse-pdf
060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse-msword
060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse-js
060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\parse-html\plugin.xml
060705 154333 impl: point=org.apache.nutch.parse.Parser class=
org.apache.nutch.parse.html.HtmlParser
060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse-ext
060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\ontology
060705 154333 parsing: C:\cygwin\nutch-
0.7.2\plugins\nutch-extensionpoints\plugin.xml
060705 154333 not including: C:\cygwin\nutch-
0.7.2\plugins\language-identifier
060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\index-more
060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\index-basic\plugin.xml
060705 154333 impl: point=org.apache.nutch.indexer.IndexingFilter class=
org.apache.nutch.indexer.basic.BasicIndexingFilter
060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\creativecommons
060705 154333 not including: C:\cygwin\nutch-
0.7.2\plugins\clustering-carrot2
060705 154333 found resource crawl-urlfilter.txt at file:/C:/cygwin/nutch-
0.7.2/conf/crawl-urlfilter.txt
060705 154333 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
060705 154333 Added 1 pages
060705 154333 Processing pagesByURL: Sorted 1 instructions in 0.016 seconds.
060705 154333 Processing pagesByURL: Sorted 62.5 instructions/second
060705 154333 Processing pagesByURL: Merged to new DB containing 1 records
in 0.0 seconds
060705 154333 Processing pagesByURL: Merged Infinity records/second
060705 154333 Processing pagesByMD5: Sorted 1 instructions in 0.0 seconds.
060705 154333 Processing pagesByMD5: Sorted Infinity instructions/second
060705 154333 Processing pagesByMD5: Merged to new DB containing 1 records
in 0.0 seconds
060705 154333 Processing pagesByMD5: Merged Infinity records/second
060705 154333 Processing linksByMD5: Copied file (0 bytes) in 0.016 secs.
060705 154333 Processing linksByURL: Copied file (0 bytes) in 0.015 secs.
060705 154333 FetchListTool started
060705 154333 Processing pagesByURL: Sorted 1 instructions in 0.0 seconds.
060705 154333 Processing pagesByURL: Sorted Infinity instructions/second
060705 154333 Processing pagesByURL: Merged to new DB containing 1 records
in 0.0 seconds
060705 154333 Processing pagesByURL: Merged Infinity records/second
060705 154333 Processing pagesByMD5: Sorted 1 instructions in 0.0 seconds.
060705 154333 Processing pagesByMD5: Sorted Infinity instructions/second
060705 154334 Processing pagesByMD5: Merged to new DB containing 1 records
in 0.0 seconds
060705 154334 Processing pagesByMD5: Merged Infinity records/second
060705 154334 Processing linksByMD5: Copied file (0 bytes) in 0.031 secs.
060705 154334 Processing linksByURL: Copied file (0 bytes) in 0.015 secs.
060705 154334 Processing C:\cygwin\nutch-
0.7.2\bin\crawled2\segments\20060705154333\fetchlist.unsorted: Sorted 1
entries in 0.015 seconds.
060705 154334 Processing C:\cygwin\nutch-
0.7.2\bin\crawled2\segments\20060705154333\fetchlist.unsorted: Sorted
66.66666666666667 entries/second
060705 154334 Overall processing: Sorted 1 entries in 0.015 seconds.
060705 154334 Overall processing: Sorted 0.015 entries/second
060705 154334 FetchListTool completed
060705 154334 logging at INFO
060705 154334 fetching http://www.haha365.com/gd_joke/index_3.htm
060705 154334 http.proxy.host = null
060705 154334 http.proxy.port = 8080
060705 154334 http.timeout = 10000
060705 154334 http.content.limit = 65536
060705 154334 http.agent = NutchCVS/0.7.2 (Nutch;
http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
060705 154334 fetcher.server.delay = 1000
060705 154334 http.max.delays = 100
060705 154336 status: segment 20060705154333, 1 pages, 0 errors, 19172
bytes, 2000 ms
060705 154336 status: 0.5 pages/s, 74.890625 kb/s, 19172.0 bytes/page
060705 154337 Updating C:\cygwin\nutch-0.7.2\bin\crawled2\db
060705 154337 Updating for C:\cygwin\nutch-
0.7.2\bin\crawled2\segments\20060705154333
060705 154337 Processing document 0
060705 154337 Finishing update
060705 154337 Processing pagesByURL: Sorted 27 instructions in 0.015seconds.
060705 154337 Processing pagesByURL: Sorted 1800.0 instructions/second
060705 154337 Processing pagesByURL: Merged to new DB containing 27 records
in 0.0 seconds
060705 154337 Processing pagesByURL: Merged Infinity records/second
060705 154337 Processing pagesByMD5: Sorted 28 instructions in 0.015seconds.
060705 154337 Processing pagesByMD5: Sorted
1866.6666666666667instructions/second
060705 154337 Processing pagesByMD5: Merged to new DB containing 27 records
in 0.016 seconds
060705 154337 Processing pagesByMD5: Merged 1687.5 records/second
060705 154337 Processing linksByMD5: Sorted 27 instructions in 0.015seconds.
060705 154337 Processing linksByMD5: Sorted 1800.0 instructions/second
060705 154337 Processing linksByMD5: Merged to new DB containing 26 records
in 0.0 seconds
060705 154337 Processing linksByMD5: Merged Infinity records/second
060705 154337 Processing linksByURL: Sorted 26 instructions in 0.015seconds.
060705 154337 Processing linksByURL: Sorted
1733.3333333333335instructions/second
060705 154337 Processing linksByURL: Merged to new DB containing 26 records
in 0.0 seconds
060705 154337 Processing linksByURL: Merged Infinity records/second
060705 154337 Processing linksByMD5: Sorted 26 instructions in 0.031seconds.
060705 154337 Processing linksByMD5: Sorted 838.7096774193549instructions/second
060705 154337 Processing linksByMD5: Merged to new DB containing 26 records
in 0.0 seconds
060705 154337 Processing linksByMD5: Merged Infinity records/second
060705 154337 Update finished
060705 154337 FetchListTool started
060705 154338 Processing pagesByURL: Sorted 26 instructions in 0.016seconds.
060705 154338 Processing pagesByURL: Sorted 1625.0 instructions/second
060705 154338 Processing pagesByURL: Merged to new DB containing 27 records
in 0.0 seconds
060705 154338 Processing pagesByURL: Merged Infinity records/second
060705 154338 Processing pagesByMD5: Sorted 26 instructions in 0.0 seconds.
060705 154338 Processing pagesByMD5: Sorted Infinity instructions/second
060705 154338 Processing pagesByMD5: Merged to new DB containing 27 records
in 0.015 seconds
060705 154338 Processing pagesByMD5: Merged 1800.0 records/second
060705 154338 Processing linksByMD5: Copied file (0 bytes) in 0.016 secs.
060705 154338 Processing linksByURL: Copied file (0 bytes) in 0.0 secs.
060705 154338 Processing C:\cygwin\nutch-
0.7.2\bin\crawled2\segments\20060705154337\fetchlist.unsorted: Sorted 26
entries in 0.0 seconds.
060705 154338 Processing C:\cygwin\nutch-
0.7.2\bin\crawled2\segments\20060705154337\fetchlist.unsorted: Sorted
Infinity entries/second
060705 154338 Overall processing: Sorted 26 entries in 0.0 seconds.
060705 154338 Overall processing: Sorted 0.0 entries/second
060705 154338 FetchListTool completed
060705 154338 logging at INFO
060705 154338 fetching http://www.haha365.com/gd_joke/20050319084431.htm
060705 154338 fetching http://www.haha365.com/gd_joke/20050319084733.htm
060705 154338 fetching http://www.haha365.com/gd_joke/20050319085110.htm
060705 154338 fetching http://www.haha365.com/gd_joke/20050319084338.htm
060705 154339 fetching http://www.haha365.com/gd_joke/20050319085226.htm
060705 154340 fetching http://www.haha365.com/gd_joke/20050318163740.htm
060705 154341 fetching http://www.haha365.com/gd_joke/20050319085344.htm
060705 154343 fetching http://www.haha365.com/gd_joke/20050318163709.htm
060705 154345 fetching http://www.haha365.com/gd_joke/20050319085310.htm
060705 154347 fetching http://www.haha365.com/gd_joke/20050319085028.htm
060705 154349 fetching http://www.haha365.com/gd_joke/20050319084052.htm
060705 154350 fetching http://www.haha365.com/gd_joke/index.htm
060705 154352 fetching http://www.haha365.com/gd_joke/20050319084902.htm
060705 154353 fetching http://www.haha365.com/gd_joke/20050319084945.htm
060705 154355 fetching http://www.haha365.com/gd_joke/20050319084129.htm
060705 154356 fetching http://www.haha365.com/gd_joke/20050319084202.htm
060705 154358 fetching http://www.haha365.com/gd_joke/20050318163642.htm
060705 154359 fetching http://www.haha365.com/gd_joke/20050319084304.htm
060705 154400 fetching http://www.haha365.com/gd_joke/20050319084822.htm
060705 154402 fetching http://www.haha365.com/gd_joke/20050319085142.htm
060705 154403 fetching http://www.haha365.com/gd_joke/20050319084232.htm
060705 154408 fetching http://www.haha365.com/gd_joke/20050318163829.htm
060705 154411 fetching http://www.haha365.com/gd_joke/20050318163920.htm
060705 154415 fetching http://www.haha365.com/gd_joke/20050319084559.htm
060705 154419 fetching http://www.haha365.com/gd_joke/
060705 154423 fetching http://www.haha365.com/gd_joke/20050318163807.htm
060705 154440 status: segment 20060705154337, 26 pages, 0 errors, 323050
bytes, 62047 ms
060705 154440 status: 0.41903716 pages/s, 40.67607 kb/s, 12425.0 bytes/page
060705 154441 Updating C:\cygwin\nutch-0.7.2\bin\crawled2\db
060705 154441 Updating for C:\cygwin\nutch-
0.7.2\bin\crawled2\segments\20060705154337
060705 154441 Processing document 0
060705 154441 Finishing update
060705 154441 Processing pagesByURL: Sorted 174 instructions in 0.016seconds.
060705 154441 Processing pagesByURL: Sorted 10875.0 instructions/second
060705 154441 Processing pagesByURL: Merged to new DB containing 53 records
in 0.0 seconds
060705 154441 Processing pagesByURL: Merged Infinity records/second
060705 154441 Processing pagesByMD5: Sorted 78 instructions in 0.015seconds.
060705 154441 Processing pagesByMD5: Sorted 5200.0 instructions/second
060705 154441 Processing pagesByMD5: Merged to new DB containing 53 records
in 0.0 seconds
060705 154441 Processing pagesByMD5: Merged Infinity records/second
060705 154441 Processing linksByMD5: Sorted 174 instructions in 0.016seconds.
060705 154441 Processing linksByMD5: Sorted 10875.0 instructions/second
060705 154441 Processing linksByMD5: Merged to new DB containing 148 records
in 0.015 seconds
060705 154441 Processing linksByMD5: Merged 9866.666666666668 records/second
060705 154441 Processing linksByURL: Sorted 122 instructions in 0.0 seconds.
060705 154441 Processing linksByURL: Sorted Infinity instructions/second
060705 154441 Processing linksByURL: Merged to new DB containing 148 records
in 0.015 seconds
060705 154441 Processing linksByURL: Merged 9866.666666666668 records/second
060705 154441 Processing linksByMD5: Sorted 148 instructions in 0.0 seconds.
060705 154441 Processing linksByMD5: Sorted Infinity instructions/second
060705 154441 Processing linksByMD5: Merged to new DB containing 148 records
in 0.016 seconds
060705 154441 Processing linksByMD5: Merged 9250.0 records/second
060705 154442 Update finished
060705 154442 FetchListTool started
060705 154442 Processing pagesByURL: Sorted 26 instructions in 0.016seconds.
060705 154442 Processing pagesByURL: Sorted 1625.0 instructions/second
060705 154442 Processing pagesByURL: Merged to new DB containing 53 records
in 0.015 seconds
060705 154442 Processing pagesByURL: Merged 3533.3333333333335records/second
060705 154442 Processing pagesByMD5: Sorted 26 instructions in 0.0 seconds.
060705 154442 Processing pagesByMD5: Sorted Infinity instructions/second
060705 154442 Processing pagesByMD5: Merged to new DB containing 53 records
in 0.0 seconds
060705 154442 Processing pagesByMD5: Merged Infinity records/second
060705 154442 Processing linksByMD5: Copied file (0 bytes) in 0.016 secs.
060705 154442 Processing linksByURL: Copied file (0 bytes) in 0.0 secs.
060705 154442 Processing C:\cygwin\nutch-
0.7.2\bin\crawled2\segments\20060705154442\fetchlist.unsorted: Sorted 26
entries in 0.093 seconds.
060705 154442 Processing C:\cygwin\nutch-
0.7.2\bin\crawled2\segments\20060705154442\fetchlist.unsorted: Sorted
279.5698924731183 entries/second
060705 154442 Overall processing: Sorted 26 entries in 0.093 seconds.
060705 154442 Overall processing: Sorted 0.003576923076923077 entries/second
060705 154443 FetchListTool completed
060705 154443 logging at INFO
060705 154443 fetching http://www.haha365.com/gd_joke/20050815111532.htm
060705 154443 fetching http://www.haha365.com/gd_joke/20050815105800.htm
060705 154443 fetching http://www.haha365.com/gd_joke/20050319085605.htm
060705 154443 fetching http://www.haha365.com/gd_joke/20050815110121.htm
060705 154446 fetching http://www.haha365.com/gd_joke/20060625064748.htm
060705 154448 fetching http://www.haha365.com/gd_joke/20050815105937.htm
060705 154449 fetching http://www.haha365.com/gd_joke/20050815110925.htm
060705 154450 fetching http://www.haha365.com/gd_joke/20050815111651.htm
060705 154452 fetching http://www.haha365.com/gd_joke/20050706110014.htm
060705 154453 fetching http://www.haha365.com/gd_joke/20050318163615.htm
060705 154454 fetching http://www.haha365.com/gd_joke/20050815111228.htm
060705 154456 fetching http://www.haha365.com/gd_joke/20050706105833.htm
060705 154457 fetching http://www.haha365.com/gd_joke/20050815110411.htm
060705 154459 fetching http://www.haha365.com/gd_joke/20050815105527.htm
060705 154500 fetching http://www.haha365.com/gd_joke/20050815111758.htm
060705 154502 fetching http://www.haha365.com/gd_joke/20050706110230.htm
060705 154503 fetching http://www.haha365.com/gd_joke/20050706105453.htm
060705 154504 fetching http://www.haha365.com/gd_joke/20050706110522.htm
060705 154506 fetching http://www.haha365.com/gd_joke/20050706105104.htm
060705 154507 fetching http://www.haha365.com/gd_joke/20050709144044.htm
060705 154509 fetching http://www.haha365.com/gd_joke/20060611112617.htm
060705 154510 fetching http://www.haha365.com/gd_joke/20050815105330.htm
060705 154511 fetching http://www.haha365.com/gd_joke/20050709144708.htm
060705 154513 fetching http://www.haha365.com/gd_joke/20050706105324.htm
060705 154514 fetching http://www.haha365.com/gd_joke/20050815110707.htm
060705 154516 fetching http://www.haha365.com/gd_joke/20050706105218.htm
060705 154523 status: segment 20060705154442, 26 pages, 0 errors, 314308
bytes, 40063 ms
060705 154523 status: 0.6489779 pages/s, 61.291748 kb/s, 12088.77 bytes/page
060705 154524 Updating C:\cygwin\nutch-0.7.2\bin\crawled2\db
060705 154524 Updating for C:\cygwin\nutch-
0.7.2\bin\crawled2\segments\20060705154442
060705 154524 Processing document 0
060705 154524 Finishing update
060705 154524 Processing pagesByURL: Sorted 127 instructions in 0.0 seconds.
060705 154524 Processing pagesByURL: Sorted Infinity instructions/second
060705 154524 Processing pagesByURL: Merged to new DB containing 56 records
in 0.0 seconds
060705 154524 Processing pagesByURL: Merged Infinity records/second
060705 154524 Processing pagesByMD5: Sorted 55 instructions in 0.016seconds.
060705 154524 Processing pagesByMD5: Sorted 3437.5 instructions/second
060705 154524 Processing pagesByMD5: Merged to new DB containing 56 records
in 0.015 seconds
060705 154524 Processing pagesByMD5: Merged 3733.3333333333335records/second
060705 154524 Processing linksByMD5: Sorted 127 instructions in 0.016seconds.
060705 154524 Processing linksByMD5: Sorted 7937.5 instructions/second
060705 154524 Processing linksByMD5: Merged to new DB containing 249 records
in 0.0 seconds
060705 154524 Processing linksByMD5: Merged Infinity records/second
060705 154524 Processing linksByURL: Sorted 101 instructions in 0.0 seconds.
060705 154524 Processing linksByURL: Sorted Infinity instructions/second
060705 154524 Processing linksByURL: Merged to new DB containing 249 records
in 0.016 seconds
060705 154524 Processing linksByURL: Merged 15562.5 records/second
060705 154524 Processing linksByMD5: Sorted 127 instructions in 0.015seconds.
060705 154524 Processing linksByMD5: Sorted 8466.666666666668instructions/second
060705 154524 Processing linksByMD5: Merged to new DB containing 249 records
in 0.0 seconds
060705 154524 Processing linksByMD5: Merged Infinity records/second
060705 154524 Update finished
060705 154524 Updating C:\cygwin\nutch-0.7.2\bin\crawled2\segments from
C:\cygwin\nutch-0.7.2\bin\crawled2\db
060705 154524  reading C:\cygwin\nutch-
0.7.2\bin\crawled2\segments\20060705154333
060705 154524  reading C:\cygwin\nutch-
0.7.2\bin\crawled2\segments\20060705154337
060705 154524  reading C:\cygwin\nutch-
0.7.2\bin\crawled2\segments\20060705154442
060705 154524 Sorting pages by url...
060705 154524 Getting updated scores and anchors from db...
060705 154524 Sorting updates by segment...
060705 154524 Updating segments...
060705 154524  updating C:\cygwin\nutch-
0.7.2\bin\crawled2\segments\20060705154333
060705 154525  updating C:\cygwin\nutch-
0.7.2\bin\crawled2\segments\20060705154337
060705 154525  updating C:\cygwin\nutch-
0.7.2\bin\crawled2\segments\20060705154442
060705 154525 Done updating C:\cygwin\nutch-0.7.2\bin\crawled2\segments from
C:\cygwin\nutch-0.7.2\bin\crawled2\db
060705 154525 indexing segment: C:\cygwin\nutch-
0.7.2\bin\crawled2\segments\20060705154333
060705 154525 * Opening segment 20060705154333
060705 154525 * Indexing segment 20060705154333
060705 154525 found resource common-terms.utf8 at file:/C:/cygwin/nutch-
0.7.2/conf/common-terms.utf8
060705 154525 * Optimizing index...
060705 154525 * Moving index to NFS if needed...
060705 154525 DONE indexing segment 20060705154333: total 1 records in
0.187s (Infinity rec/s).
060705 154525 done indexing
060705 154525 indexing segment: C:\cygwin\nutch-
0.7.2\bin\crawled2\segments\20060705154337
060705 154525 * Opening segment 20060705154337
060705 154525 * Indexing segment 20060705154337
060705 154525 * Optimizing index...
060705 154525 * Moving index to NFS if needed...
060705 154525 DONE indexing segment 20060705154337: total 26 records in
0.391 s (Infinity rec/s).
060705 154525 done indexing
060705 154525 indexing segment: C:\cygwin\nutch-
0.7.2\bin\crawled2\segments\20060705154442
060705 154525 * Opening segment 20060705154442
060705 154525 * Indexing segment 20060705154442
060705 154525 * Optimizing index...
060705 154525 * Moving index to NFS if needed...
060705 154525 DONE indexing segment 20060705154442: total 26 records in
0.219 s (Infinity rec/s).
060705 154525 done indexing
060705 154526 Reading url hashes...
060705 154526 Sorting url hashes...
060705 154526 Deleting url duplicates...
060705 154526 Deleted 0 url duplicates.
060705 154526 Reading content hashes...
060705 154526 Sorting content hashes...
060705 154526 Deleting content duplicates...
060705 154526 Deleted 1 content duplicates.
060705 154526 Duplicate deletion complete locally.  Now returning to NFS...
060705 154526 DeleteDuplicates complete
060705 154526 Merging segment indexes...
060705 154526 crawl finished: crawled2

Re: why i can't crawl all the linked pages in the specified page to crawl.

Posted by Honda-Search Administrator <ad...@honda-search.com>.
I would also add that you need to make sure the crawl-urlfilter.txt file 
don't exclude any URLs on those pages.  I noticed a lot of '=' in the URLs 
on those pages.

I agree with Stefan to try to crawl with a depth of 20.

Oh, another thing you might want to consider.  If your crawl-urlfilter.txt 
file is only configured to your domain nutch MIGHT have a problem.  Looking 
at your urls it appears that many of them are relative links and not 
absolute.  The link to "/directory/" instead of 
http://www.domain.com/directory/  I'm unsure if nutch views those as 
belonging to the same domain or if it ignores it because the url does not 
conform to the crawl-urlfilter.txt rules.

Maybe someone will correct me on the last part, but I think it makes sense.

----- Original Message ----- 
From: "Stefan Groschupf" <sg...@media-style.com>
To: <nu...@lucene.apache.org>
Sent: Thursday, July 06, 2006 10:59 PM
Subject: Re: why i can't crawl all the linked pages in the specified page to 
crawl.


> Hi,
> may be you can try to have a much higher depth something like 20?
> However in general check:
> + the regex url filter file.
> + the rebotos.txt
> + nofollow tag in the pages
> + number of out links to extrac in nutch-default.cml
>
> Stefan
> On 06.07.2006, at 19:12, kevin pang wrote:
>
>> i set up the nutch to crawl the url: http://www.haha365.com/gd_joke/
>> but after crawl complete, only 54 pages were fetched.
>>
>> here is the log info:
>>
>> 060705 154332 parsing file:/C:/cygwin/nutch-0.7.2/conf/nutch- default.xml
>> 060705 154332 parsing file:/C:/cygwin/nutch-0.7.2/conf/crawl-tool.xml
>> 060705 154332 parsing file:/C:/cygwin/nutch-0.7.2/conf/nutch-site.xml
>> 060705 154332 No FS indicated, using default:local
>> 060705 154332 crawl started in: crawled2
>> 060705 154332 rootUrlFile = url.txt
>> 060705 154332 threads = 4
>> 060705 154332 depth = 3
>> 060705 154333 Created webdb at LocalFS,C:\cygwin\nutch-0.7.2\bin 
>> \crawled2\db
>> 060705 154333 Starting URL processing
>> 060705 154333 Plugins: looking in: C:\cygwin\nutch-0.7.2\plugins
>> 060705 154333 parsing: C:\cygwin\nutch-
>> 0.7.2\plugins\urlfilter-regex\plugin.xml
>> 060705 154333 impl: point=org.apache.nutch.net.URLFilter class=
>> org.apache.nutch.net.RegexURLFilter
>> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins 
>> \urlfilter-prefix
>> 060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\query-url 
>> \plugin.xml
>> 060705 154333 impl: point=org.apache.nutch.searcher.QueryFilter class=
>> org.apache.nutch.searcher.url.URLQueryFilter
>> 060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\query-site 
>> \plugin.xml
>> 060705 154333 impl: point=org.apache.nutch.searcher.QueryFilter class=
>> org.apache.nutch.searcher.site.SiteQueryFilter
>> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\query-more
>> 060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\query-basic 
>> \plugin.xml
>> 060705 154333 impl: point=org.apache.nutch.searcher.QueryFilter class=
>> org.apache.nutch.searcher.basic.BasicQueryFilter
>> 060705 154333 not including: C:\cygwin\nutch-
>> 0.7.2\plugins\protocol-httpclient
>> 060705 154333 parsing: C:\cygwin\nutch-
>> 0.7.2\plugins\protocol-http\plugin.xml
>> 060705 154333 impl: point=org.apache.nutch.protocol.Protocol class=
>> org.apache.nutch.protocol.http.Http
>> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\protocol- ftp
>> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\protocol- file
>> 060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\parse-text 
>> \plugin.xml
>> 060705 154333 impl: point=org.apache.nutch.parse.Parser class=
>> org.apache.nutch.parse.text.TextParser
>> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse-rss
>> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse-pdf
>> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse- msword
>> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse-js
>> 060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\parse-html 
>> \plugin.xml
>> 060705 154333 impl: point=org.apache.nutch.parse.Parser class=
>> org.apache.nutch.parse.html.HtmlParser
>> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse-ext
>> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\ontology
>> 060705 154333 parsing: C:\cygwin\nutch-
>> 0.7.2\plugins\nutch-extensionpoints\plugin.xml
>> 060705 154333 not including: C:\cygwin\nutch-
>> 0.7.2\plugins\language-identifier
>> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\index-more
>> 060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\index-basic 
>> \plugin.xml
>> 060705 154333 impl: point=org.apache.nutch.indexer.IndexingFilter  class=
>> org.apache.nutch.indexer.basic.BasicIndexingFilter
>> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins 
>> \creativecommons
>> 060705 154333 not including: C:\cygwin\nutch-
>> 0.7.2\plugins\clustering-carrot2
>> 060705 154333 found resource crawl-urlfilter.txt at file:/C:/cygwin/ 
>> nutch-
>> 0.7.2/conf/crawl-urlfilter.txt
>> 060705 154333 Using URL normalizer: 
>> org.apache.nutch.net.BasicUrlNormalizer
>> 060705 154333 Added 1 pages
>> 060705 154333 Processing pagesByURL: Sorted 1 instructions in 0.016 
>> seconds.
>> 060705 154333 Processing pagesByURL: Sorted 62.5 instructions/second
>> 060705 154333 Processing pagesByURL: Merged to new DB containing 1 
>> records
>> in 0.0 seconds
>> 060705 154333 Processing pagesByURL: Merged Infinity records/second
>> 060705 154333 Processing pagesByMD5: Sorted 1 instructions in 0.0 
>> seconds.
>> 060705 154333 Processing pagesByMD5: Sorted Infinity instructions/ second
>> 060705 154333 Processing pagesByMD5: Merged to new DB containing 1 
>> records
>> in 0.0 seconds
>> 060705 154333 Processing pagesByMD5: Merged Infinity records/second
>> 060705 154333 Processing linksByMD5: Copied file (0 bytes) in 0.016 
>> secs.
>> 060705 154333 Processing linksByURL: Copied file (0 bytes) in 0.015 
>> secs.
>> 060705 154333 FetchListTool started
>> 060705 154333 Processing pagesByURL: Sorted 1 instructions in 0.0 
>> seconds.
>> 060705 154333 Processing pagesByURL: Sorted Infinity instructions/ second
>> 060705 154333 Processing pagesByURL: Merged to new DB containing 1 
>> records
>> in 0.0 seconds
>> 060705 154333 Processing pagesByURL: Merged Infinity records/second
>> 060705 154333 Processing pagesByMD5: Sorted 1 instructions in 0.0 
>> seconds.
>> 060705 154333 Processing pagesByMD5: Sorted Infinity instructions/ second
>> 060705 154334 Processing pagesByMD5: Merged to new DB containing 1 
>> records
>> in 0.0 seconds
>> 060705 154334 Processing pagesByMD5: Merged Infinity records/second
>> 060705 154334 Processing linksByMD5: Copied file (0 bytes) in 0.031 
>> secs.
>> 060705 154334 Processing linksByURL: Copied file (0 bytes) in 0.015 
>> secs.
>> 060705 154334 Processing C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154333\fetchlist.unsorted:  Sorted 1
>> entries in 0.015 seconds.
>> 060705 154334 Processing C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154333\fetchlist.unsorted: Sorted
>> 66.66666666666667 entries/second
>> 060705 154334 Overall processing: Sorted 1 entries in 0.015 seconds.
>> 060705 154334 Overall processing: Sorted 0.015 entries/second
>> 060705 154334 FetchListTool completed
>> 060705 154334 logging at INFO
>> 060705 154334 fetching http://www.haha365.com/gd_joke/index_3.htm
>> 060705 154334 http.proxy.host = null
>> 060705 154334 http.proxy.port = 8080
>> 060705 154334 http.timeout = 10000
>> 060705 154334 http.content.limit = 65536
>> 060705 154334 http.agent = NutchCVS/0.7.2 (Nutch;
>> http://lucene.apache.org/nutch/bot.html; nutch- agent@lucene.apache.org)
>> 060705 154334 fetcher.server.delay = 1000
>> 060705 154334 http.max.delays = 100
>> 060705 154336 status: segment 20060705154333, 1 pages, 0 errors, 19172
>> bytes, 2000 ms
>> 060705 154336 status: 0.5 pages/s, 74.890625 kb/s, 19172.0 bytes/page
>> 060705 154337 Updating C:\cygwin\nutch-0.7.2\bin\crawled2\db
>> 060705 154337 Updating for C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154333
>> 060705 154337 Processing document 0
>> 060705 154337 Finishing update
>> 060705 154337 Processing pagesByURL: Sorted 27 instructions in 
>> 0.015seconds.
>> 060705 154337 Processing pagesByURL: Sorted 1800.0 instructions/second
>> 060705 154337 Processing pagesByURL: Merged to new DB containing 27 
>> records
>> in 0.0 seconds
>> 060705 154337 Processing pagesByURL: Merged Infinity records/second
>> 060705 154337 Processing pagesByMD5: Sorted 28 instructions in 
>> 0.015seconds.
>> 060705 154337 Processing pagesByMD5: Sorted
>> 1866.6666666666667instructions/second
>> 060705 154337 Processing pagesByMD5: Merged to new DB containing 27 
>> records
>> in 0.016 seconds
>> 060705 154337 Processing pagesByMD5: Merged 1687.5 records/second
>> 060705 154337 Processing linksByMD5: Sorted 27 instructions in 
>> 0.015seconds.
>> 060705 154337 Processing linksByMD5: Sorted 1800.0 instructions/second
>> 060705 154337 Processing linksByMD5: Merged to new DB containing 26 
>> records
>> in 0.0 seconds
>> 060705 154337 Processing linksByMD5: Merged Infinity records/second
>> 060705 154337 Processing linksByURL: Sorted 26 instructions in 
>> 0.015seconds.
>> 060705 154337 Processing linksByURL: Sorted
>> 1733.3333333333335instructions/second
>> 060705 154337 Processing linksByURL: Merged to new DB containing 26 
>> records
>> in 0.0 seconds
>> 060705 154337 Processing linksByURL: Merged Infinity records/second
>> 060705 154337 Processing linksByMD5: Sorted 26 instructions in 
>> 0.031seconds.
>> 060705 154337 Processing linksByMD5: Sorted 
>> 838.7096774193549instructions/second
>> 060705 154337 Processing linksByMD5: Merged to new DB containing 26 
>> records
>> in 0.0 seconds
>> 060705 154337 Processing linksByMD5: Merged Infinity records/second
>> 060705 154337 Update finished
>> 060705 154337 FetchListTool started
>> 060705 154338 Processing pagesByURL: Sorted 26 instructions in 
>> 0.016seconds.
>> 060705 154338 Processing pagesByURL: Sorted 1625.0 instructions/second
>> 060705 154338 Processing pagesByURL: Merged to new DB containing 27 
>> records
>> in 0.0 seconds
>> 060705 154338 Processing pagesByURL: Merged Infinity records/second
>> 060705 154338 Processing pagesByMD5: Sorted 26 instructions in 0.0 
>> seconds.
>> 060705 154338 Processing pagesByMD5: Sorted Infinity instructions/ second
>> 060705 154338 Processing pagesByMD5: Merged to new DB containing 27 
>> records
>> in 0.015 seconds
>> 060705 154338 Processing pagesByMD5: Merged 1800.0 records/second
>> 060705 154338 Processing linksByMD5: Copied file (0 bytes) in 0.016 
>> secs.
>> 060705 154338 Processing linksByURL: Copied file (0 bytes) in 0.0  secs.
>> 060705 154338 Processing C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154337\fetchlist.unsorted:  Sorted 26
>> entries in 0.0 seconds.
>> 060705 154338 Processing C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154337\fetchlist.unsorted: Sorted
>> Infinity entries/second
>> 060705 154338 Overall processing: Sorted 26 entries in 0.0 seconds.
>> 060705 154338 Overall processing: Sorted 0.0 entries/second
>> 060705 154338 FetchListTool completed
>> 060705 154338 logging at INFO
>> 060705 154338 fetching http://www.haha365.com/gd_joke/ 20050319084431.htm
>> 060705 154338 fetching http://www.haha365.com/gd_joke/ 20050319084733.htm
>> 060705 154338 fetching http://www.haha365.com/gd_joke/ 20050319085110.htm
>> 060705 154338 fetching http://www.haha365.com/gd_joke/ 20050319084338.htm
>> 060705 154339 fetching http://www.haha365.com/gd_joke/ 20050319085226.htm
>> 060705 154340 fetching http://www.haha365.com/gd_joke/ 20050318163740.htm
>> 060705 154341 fetching http://www.haha365.com/gd_joke/ 20050319085344.htm
>> 060705 154343 fetching http://www.haha365.com/gd_joke/ 20050318163709.htm
>> 060705 154345 fetching http://www.haha365.com/gd_joke/ 20050319085310.htm
>> 060705 154347 fetching http://www.haha365.com/gd_joke/ 20050319085028.htm
>> 060705 154349 fetching http://www.haha365.com/gd_joke/ 20050319084052.htm
>> 060705 154350 fetching http://www.haha365.com/gd_joke/index.htm
>> 060705 154352 fetching http://www.haha365.com/gd_joke/ 20050319084902.htm
>> 060705 154353 fetching http://www.haha365.com/gd_joke/ 20050319084945.htm
>> 060705 154355 fetching http://www.haha365.com/gd_joke/ 20050319084129.htm
>> 060705 154356 fetching http://www.haha365.com/gd_joke/ 20050319084202.htm
>> 060705 154358 fetching http://www.haha365.com/gd_joke/ 20050318163642.htm
>> 060705 154359 fetching http://www.haha365.com/gd_joke/ 20050319084304.htm
>> 060705 154400 fetching http://www.haha365.com/gd_joke/ 20050319084822.htm
>> 060705 154402 fetching http://www.haha365.com/gd_joke/ 20050319085142.htm
>> 060705 154403 fetching http://www.haha365.com/gd_joke/ 20050319084232.htm
>> 060705 154408 fetching http://www.haha365.com/gd_joke/ 20050318163829.htm
>> 060705 154411 fetching http://www.haha365.com/gd_joke/ 20050318163920.htm
>> 060705 154415 fetching http://www.haha365.com/gd_joke/ 20050319084559.htm
>> 060705 154419 fetching http://www.haha365.com/gd_joke/
>> 060705 154423 fetching http://www.haha365.com/gd_joke/ 20050318163807.htm
>> 060705 154440 status: segment 20060705154337, 26 pages, 0 errors,  323050
>> bytes, 62047 ms
>> 060705 154440 status: 0.41903716 pages/s, 40.67607 kb/s, 12425.0 
>> bytes/page
>> 060705 154441 Updating C:\cygwin\nutch-0.7.2\bin\crawled2\db
>> 060705 154441 Updating for C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154337
>> 060705 154441 Processing document 0
>> 060705 154441 Finishing update
>> 060705 154441 Processing pagesByURL: Sorted 174 instructions in 
>> 0.016seconds.
>> 060705 154441 Processing pagesByURL: Sorted 10875.0 instructions/ second
>> 060705 154441 Processing pagesByURL: Merged to new DB containing 53 
>> records
>> in 0.0 seconds
>> 060705 154441 Processing pagesByURL: Merged Infinity records/second
>> 060705 154441 Processing pagesByMD5: Sorted 78 instructions in 
>> 0.015seconds.
>> 060705 154441 Processing pagesByMD5: Sorted 5200.0 instructions/second
>> 060705 154441 Processing pagesByMD5: Merged to new DB containing 53 
>> records
>> in 0.0 seconds
>> 060705 154441 Processing pagesByMD5: Merged Infinity records/second
>> 060705 154441 Processing linksByMD5: Sorted 174 instructions in 
>> 0.016seconds.
>> 060705 154441 Processing linksByMD5: Sorted 10875.0 instructions/ second
>> 060705 154441 Processing linksByMD5: Merged to new DB containing  148 
>> records
>> in 0.015 seconds
>> 060705 154441 Processing linksByMD5: Merged 9866.666666666668 
>> records/second
>> 060705 154441 Processing linksByURL: Sorted 122 instructions in 0.0 
>> seconds.
>> 060705 154441 Processing linksByURL: Sorted Infinity instructions/ second
>> 060705 154441 Processing linksByURL: Merged to new DB containing  148 
>> records
>> in 0.015 seconds
>> 060705 154441 Processing linksByURL: Merged 9866.666666666668 
>> records/second
>> 060705 154441 Processing linksByMD5: Sorted 148 instructions in 0.0 
>> seconds.
>> 060705 154441 Processing linksByMD5: Sorted Infinity instructions/ second
>> 060705 154441 Processing linksByMD5: Merged to new DB containing  148 
>> records
>> in 0.016 seconds
>> 060705 154441 Processing linksByMD5: Merged 9250.0 records/second
>> 060705 154442 Update finished
>> 060705 154442 FetchListTool started
>> 060705 154442 Processing pagesByURL: Sorted 26 instructions in 
>> 0.016seconds.
>> 060705 154442 Processing pagesByURL: Sorted 1625.0 instructions/second
>> 060705 154442 Processing pagesByURL: Merged to new DB containing 53 
>> records
>> in 0.015 seconds
>> 060705 154442 Processing pagesByURL: Merged 
>> 3533.3333333333335records/second
>> 060705 154442 Processing pagesByMD5: Sorted 26 instructions in 0.0 
>> seconds.
>> 060705 154442 Processing pagesByMD5: Sorted Infinity instructions/ second
>> 060705 154442 Processing pagesByMD5: Merged to new DB containing 53 
>> records
>> in 0.0 seconds
>> 060705 154442 Processing pagesByMD5: Merged Infinity records/second
>> 060705 154442 Processing linksByMD5: Copied file (0 bytes) in 0.016 
>> secs.
>> 060705 154442 Processing linksByURL: Copied file (0 bytes) in 0.0  secs.
>> 060705 154442 Processing C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154442\fetchlist.unsorted:  Sorted 26
>> entries in 0.093 seconds.
>> 060705 154442 Processing C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154442\fetchlist.unsorted: Sorted
>> 279.5698924731183 entries/second
>> 060705 154442 Overall processing: Sorted 26 entries in 0.093 seconds.
>> 060705 154442 Overall processing: Sorted 0.003576923076923077 
>> entries/second
>> 060705 154443 FetchListTool completed
>> 060705 154443 logging at INFO
>> 060705 154443 fetching http://www.haha365.com/gd_joke/ 20050815111532.htm
>> 060705 154443 fetching http://www.haha365.com/gd_joke/ 20050815105800.htm
>> 060705 154443 fetching http://www.haha365.com/gd_joke/ 20050319085605.htm
>> 060705 154443 fetching http://www.haha365.com/gd_joke/ 20050815110121.htm
>> 060705 154446 fetching http://www.haha365.com/gd_joke/ 20060625064748.htm
>> 060705 154448 fetching http://www.haha365.com/gd_joke/ 20050815105937.htm
>> 060705 154449 fetching http://www.haha365.com/gd_joke/ 20050815110925.htm
>> 060705 154450 fetching http://www.haha365.com/gd_joke/ 20050815111651.htm
>> 060705 154452 fetching http://www.haha365.com/gd_joke/ 20050706110014.htm
>> 060705 154453 fetching http://www.haha365.com/gd_joke/ 20050318163615.htm
>> 060705 154454 fetching http://www.haha365.com/gd_joke/ 20050815111228.htm
>> 060705 154456 fetching http://www.haha365.com/gd_joke/ 20050706105833.htm
>> 060705 154457 fetching http://www.haha365.com/gd_joke/ 20050815110411.htm
>> 060705 154459 fetching http://www.haha365.com/gd_joke/ 20050815105527.htm
>> 060705 154500 fetching http://www.haha365.com/gd_joke/ 20050815111758.htm
>> 060705 154502 fetching http://www.haha365.com/gd_joke/ 20050706110230.htm
>> 060705 154503 fetching http://www.haha365.com/gd_joke/ 20050706105453.htm
>> 060705 154504 fetching http://www.haha365.com/gd_joke/ 20050706110522.htm
>> 060705 154506 fetching http://www.haha365.com/gd_joke/ 20050706105104.htm
>> 060705 154507 fetching http://www.haha365.com/gd_joke/ 20050709144044.htm
>> 060705 154509 fetching http://www.haha365.com/gd_joke/ 20060611112617.htm
>> 060705 154510 fetching http://www.haha365.com/gd_joke/ 20050815105330.htm
>> 060705 154511 fetching http://www.haha365.com/gd_joke/ 20050709144708.htm
>> 060705 154513 fetching http://www.haha365.com/gd_joke/ 20050706105324.htm
>> 060705 154514 fetching http://www.haha365.com/gd_joke/ 20050815110707.htm
>> 060705 154516 fetching http://www.haha365.com/gd_joke/ 20050706105218.htm
>> 060705 154523 status: segment 20060705154442, 26 pages, 0 errors,  314308
>> bytes, 40063 ms
>> 060705 154523 status: 0.6489779 pages/s, 61.291748 kb/s, 12088.77 
>> bytes/page
>> 060705 154524 Updating C:\cygwin\nutch-0.7.2\bin\crawled2\db
>> 060705 154524 Updating for C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154442
>> 060705 154524 Processing document 0
>> 060705 154524 Finishing update
>> 060705 154524 Processing pagesByURL: Sorted 127 instructions in 0.0 
>> seconds.
>> 060705 154524 Processing pagesByURL: Sorted Infinity instructions/ second
>> 060705 154524 Processing pagesByURL: Merged to new DB containing 56 
>> records
>> in 0.0 seconds
>> 060705 154524 Processing pagesByURL: Merged Infinity records/second
>> 060705 154524 Processing pagesByMD5: Sorted 55 instructions in 
>> 0.016seconds.
>> 060705 154524 Processing pagesByMD5: Sorted 3437.5 instructions/second
>> 060705 154524 Processing pagesByMD5: Merged to new DB containing 56 
>> records
>> in 0.015 seconds
>> 060705 154524 Processing pagesByMD5: Merged 
>> 3733.3333333333335records/second
>> 060705 154524 Processing linksByMD5: Sorted 127 instructions in 
>> 0.016seconds.
>> 060705 154524 Processing linksByMD5: Sorted 7937.5 instructions/second
>> 060705 154524 Processing linksByMD5: Merged to new DB containing  249 
>> records
>> in 0.0 seconds
>> 060705 154524 Processing linksByMD5: Merged Infinity records/second
>> 060705 154524 Processing linksByURL: Sorted 101 instructions in 0.0 
>> seconds.
>> 060705 154524 Processing linksByURL: Sorted Infinity instructions/ second
>> 060705 154524 Processing linksByURL: Merged to new DB containing  249 
>> records
>> in 0.016 seconds
>> 060705 154524 Processing linksByURL: Merged 15562.5 records/second
>> 060705 154524 Processing linksByMD5: Sorted 127 instructions in 
>> 0.015seconds.
>> 060705 154524 Processing linksByMD5: Sorted 
>> 8466.666666666668instructions/second
>> 060705 154524 Processing linksByMD5: Merged to new DB containing  249 
>> records
>> in 0.0 seconds
>> 060705 154524 Processing linksByMD5: Merged Infinity records/second
>> 060705 154524 Update finished
>> 060705 154524 Updating C:\cygwin\nutch-0.7.2\bin\crawled2\segments  from
>> C:\cygwin\nutch-0.7.2\bin\crawled2\db
>> 060705 154524  reading C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154333
>> 060705 154524  reading C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154337
>> 060705 154524  reading C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154442
>> 060705 154524 Sorting pages by url...
>> 060705 154524 Getting updated scores and anchors from db...
>> 060705 154524 Sorting updates by segment...
>> 060705 154524 Updating segments...
>> 060705 154524  updating C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154333
>> 060705 154525  updating C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154337
>> 060705 154525  updating C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154442
>> 060705 154525 Done updating C:\cygwin\nutch-0.7.2\bin\crawled2 \segments 
>> from
>> C:\cygwin\nutch-0.7.2\bin\crawled2\db
>> 060705 154525 indexing segment: C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154333
>> 060705 154525 * Opening segment 20060705154333
>> 060705 154525 * Indexing segment 20060705154333
>> 060705 154525 found resource common-terms.utf8 at file:/C:/cygwin/ nutch-
>> 0.7.2/conf/common-terms.utf8
>> 060705 154525 * Optimizing index...
>> 060705 154525 * Moving index to NFS if needed...
>> 060705 154525 DONE indexing segment 20060705154333: total 1 records in
>> 0.187s (Infinity rec/s).
>> 060705 154525 done indexing
>> 060705 154525 indexing segment: C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154337
>> 060705 154525 * Opening segment 20060705154337
>> 060705 154525 * Indexing segment 20060705154337
>> 060705 154525 * Optimizing index...
>> 060705 154525 * Moving index to NFS if needed...
>> 060705 154525 DONE indexing segment 20060705154337: total 26  records in
>> 0.391 s (Infinity rec/s).
>> 060705 154525 done indexing
>> 060705 154525 indexing segment: C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154442
>> 060705 154525 * Opening segment 20060705154442
>> 060705 154525 * Indexing segment 20060705154442
>> 060705 154525 * Optimizing index...
>> 060705 154525 * Moving index to NFS if needed...
>> 060705 154525 DONE indexing segment 20060705154442: total 26  records in
>> 0.219 s (Infinity rec/s).
>> 060705 154525 done indexing
>> 060705 154526 Reading url hashes...
>> 060705 154526 Sorting url hashes...
>> 060705 154526 Deleting url duplicates...
>> 060705 154526 Deleted 0 url duplicates.
>> 060705 154526 Reading content hashes...
>> 060705 154526 Sorting content hashes...
>> 060705 154526 Deleting content duplicates...
>> 060705 154526 Deleted 1 content duplicates.
>> 060705 154526 Duplicate deletion complete locally.  Now returning  to 
>> NFS...
>> 060705 154526 DeleteDuplicates complete
>> 060705 154526 Merging segment indexes...
>> 060705 154526 crawl finished: crawled2
>
>
> 


Re: why i can't crawl all the linked pages in the specified page to crawl.

Posted by "Tonal Communications (Stijn Amundsen)" <St...@tonalweb.com>.
That brings up a question, does nutch consider the URL in a form action, and
an images source URL as part of the default 100 links x page? Or does it
only count <a href> tags. What about the Google and yahoo etc. do they only
count <a href's>?


http://tonalweb.com
----- Original Message -----
From: "Stefan Groschupf" <sg...@media-style.com>
To: <nu...@lucene.apache.org>
Sent: Friday, July 07, 2006 1:59 AM
Subject: Re: why i can't crawl all the linked pages in the specified page to
crawl.


> Hi,
> may be you can try to have a much higher depth something like 20?
> However in general check:
> + the regex url filter file.
> + the rebotos.txt
> + nofollow tag in the pages
> + number of out links to extrac in nutch-default.cml
>
> Stefan
> On 06.07.2006, at 19:12, kevin pang wrote:
>
> > i set up the nutch to crawl the url: http://www.haha365.com/gd_joke/
> > but after crawl complete, only 54 pages were fetched.
> >
> > here is the log info:
> >
> > 060705 154332 parsing file:/C:/cygwin/nutch-0.7.2/conf/nutch-
> > default.xml
> > 060705 154332 parsing file:/C:/cygwin/nutch-0.7.2/conf/crawl-tool.xml
> > 060705 154332 parsing file:/C:/cygwin/nutch-0.7.2/conf/nutch-site.xml
> > 060705 154332 No FS indicated, using default:local
> > 060705 154332 crawl started in: crawled2
> > 060705 154332 rootUrlFile = url.txt
> > 060705 154332 threads = 4
> > 060705 154332 depth = 3
> > 060705 154333 Created webdb at LocalFS,C:\cygwin\nutch-0.7.2\bin
> > \crawled2\db
> > 060705 154333 Starting URL processing
> > 060705 154333 Plugins: looking in: C:\cygwin\nutch-0.7.2\plugins
> > 060705 154333 parsing: C:\cygwin\nutch-
> > 0.7.2\plugins\urlfilter-regex\plugin.xml
> > 060705 154333 impl: point=org.apache.nutch.net.URLFilter class=
> > org.apache.nutch.net.RegexURLFilter
> > 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins
> > \urlfilter-prefix
> > 060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\query-url
> > \plugin.xml
> > 060705 154333 impl: point=org.apache.nutch.searcher.QueryFilter class=
> > org.apache.nutch.searcher.url.URLQueryFilter
> > 060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\query-site
> > \plugin.xml
> > 060705 154333 impl: point=org.apache.nutch.searcher.QueryFilter class=
> > org.apache.nutch.searcher.site.SiteQueryFilter
> > 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\query-more
> > 060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\query-basic
> > \plugin.xml
> > 060705 154333 impl: point=org.apache.nutch.searcher.QueryFilter class=
> > org.apache.nutch.searcher.basic.BasicQueryFilter
> > 060705 154333 not including: C:\cygwin\nutch-
> > 0.7.2\plugins\protocol-httpclient
> > 060705 154333 parsing: C:\cygwin\nutch-
> > 0.7.2\plugins\protocol-http\plugin.xml
> > 060705 154333 impl: point=org.apache.nutch.protocol.Protocol class=
> > org.apache.nutch.protocol.http.Http
> > 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\protocol-
> > ftp
> > 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\protocol-
> > file
> > 060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\parse-text
> > \plugin.xml
> > 060705 154333 impl: point=org.apache.nutch.parse.Parser class=
> > org.apache.nutch.parse.text.TextParser
> > 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse-rss
> > 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse-pdf
> > 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse-
> > msword
> > 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse-js
> > 060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\parse-html
> > \plugin.xml
> > 060705 154333 impl: point=org.apache.nutch.parse.Parser class=
> > org.apache.nutch.parse.html.HtmlParser
> > 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse-ext
> > 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\ontology
> > 060705 154333 parsing: C:\cygwin\nutch-
> > 0.7.2\plugins\nutch-extensionpoints\plugin.xml
> > 060705 154333 not including: C:\cygwin\nutch-
> > 0.7.2\plugins\language-identifier
> > 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\index-more
> > 060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\index-basic
> > \plugin.xml
> > 060705 154333 impl: point=org.apache.nutch.indexer.IndexingFilter
> > class=
> > org.apache.nutch.indexer.basic.BasicIndexingFilter
> > 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins
> > \creativecommons
> > 060705 154333 not including: C:\cygwin\nutch-
> > 0.7.2\plugins\clustering-carrot2
> > 060705 154333 found resource crawl-urlfilter.txt at file:/C:/cygwin/
> > nutch-
> > 0.7.2/conf/crawl-urlfilter.txt
> > 060705 154333 Using URL normalizer:
> > org.apache.nutch.net.BasicUrlNormalizer
> > 060705 154333 Added 1 pages
> > 060705 154333 Processing pagesByURL: Sorted 1 instructions in 0.016
> > seconds.
> > 060705 154333 Processing pagesByURL: Sorted 62.5 instructions/second
> > 060705 154333 Processing pagesByURL: Merged to new DB containing 1
> > records
> > in 0.0 seconds
> > 060705 154333 Processing pagesByURL: Merged Infinity records/second
> > 060705 154333 Processing pagesByMD5: Sorted 1 instructions in 0.0
> > seconds.
> > 060705 154333 Processing pagesByMD5: Sorted Infinity instructions/
> > second
> > 060705 154333 Processing pagesByMD5: Merged to new DB containing 1
> > records
> > in 0.0 seconds
> > 060705 154333 Processing pagesByMD5: Merged Infinity records/second
> > 060705 154333 Processing linksByMD5: Copied file (0 bytes) in 0.016
> > secs.
> > 060705 154333 Processing linksByURL: Copied file (0 bytes) in 0.015
> > secs.
> > 060705 154333 FetchListTool started
> > 060705 154333 Processing pagesByURL: Sorted 1 instructions in 0.0
> > seconds.
> > 060705 154333 Processing pagesByURL: Sorted Infinity instructions/
> > second
> > 060705 154333 Processing pagesByURL: Merged to new DB containing 1
> > records
> > in 0.0 seconds
> > 060705 154333 Processing pagesByURL: Merged Infinity records/second
> > 060705 154333 Processing pagesByMD5: Sorted 1 instructions in 0.0
> > seconds.
> > 060705 154333 Processing pagesByMD5: Sorted Infinity instructions/
> > second
> > 060705 154334 Processing pagesByMD5: Merged to new DB containing 1
> > records
> > in 0.0 seconds
> > 060705 154334 Processing pagesByMD5: Merged Infinity records/second
> > 060705 154334 Processing linksByMD5: Copied file (0 bytes) in 0.031
> > secs.
> > 060705 154334 Processing linksByURL: Copied file (0 bytes) in 0.015
> > secs.
> > 060705 154334 Processing C:\cygwin\nutch-
> > 0.7.2\bin\crawled2\segments\20060705154333\fetchlist.unsorted:
> > Sorted 1
> > entries in 0.015 seconds.
> > 060705 154334 Processing C:\cygwin\nutch-
> > 0.7.2\bin\crawled2\segments\20060705154333\fetchlist.unsorted: Sorted
> > 66.66666666666667 entries/second
> > 060705 154334 Overall processing: Sorted 1 entries in 0.015 seconds.
> > 060705 154334 Overall processing: Sorted 0.015 entries/second
> > 060705 154334 FetchListTool completed
> > 060705 154334 logging at INFO
> > 060705 154334 fetching http://www.haha365.com/gd_joke/index_3.htm
> > 060705 154334 http.proxy.host = null
> > 060705 154334 http.proxy.port = 8080
> > 060705 154334 http.timeout = 10000
> > 060705 154334 http.content.limit = 65536
> > 060705 154334 http.agent = NutchCVS/0.7.2 (Nutch;
> > http://lucene.apache.org/nutch/bot.html; nutch-
> > agent@lucene.apache.org)
> > 060705 154334 fetcher.server.delay = 1000
> > 060705 154334 http.max.delays = 100
> > 060705 154336 status: segment 20060705154333, 1 pages, 0 errors, 19172
> > bytes, 2000 ms
> > 060705 154336 status: 0.5 pages/s, 74.890625 kb/s, 19172.0 bytes/page
> > 060705 154337 Updating C:\cygwin\nutch-0.7.2\bin\crawled2\db
> > 060705 154337 Updating for C:\cygwin\nutch-
> > 0.7.2\bin\crawled2\segments\20060705154333
> > 060705 154337 Processing document 0
> > 060705 154337 Finishing update
> > 060705 154337 Processing pagesByURL: Sorted 27 instructions in
> > 0.015seconds.
> > 060705 154337 Processing pagesByURL: Sorted 1800.0 instructions/second
> > 060705 154337 Processing pagesByURL: Merged to new DB containing 27
> > records
> > in 0.0 seconds
> > 060705 154337 Processing pagesByURL: Merged Infinity records/second
> > 060705 154337 Processing pagesByMD5: Sorted 28 instructions in
> > 0.015seconds.
> > 060705 154337 Processing pagesByMD5: Sorted
> > 1866.6666666666667instructions/second
> > 060705 154337 Processing pagesByMD5: Merged to new DB containing 27
> > records
> > in 0.016 seconds
> > 060705 154337 Processing pagesByMD5: Merged 1687.5 records/second
> > 060705 154337 Processing linksByMD5: Sorted 27 instructions in
> > 0.015seconds.
> > 060705 154337 Processing linksByMD5: Sorted 1800.0 instructions/second
> > 060705 154337 Processing linksByMD5: Merged to new DB containing 26
> > records
> > in 0.0 seconds
> > 060705 154337 Processing linksByMD5: Merged Infinity records/second
> > 060705 154337 Processing linksByURL: Sorted 26 instructions in
> > 0.015seconds.
> > 060705 154337 Processing linksByURL: Sorted
> > 1733.3333333333335instructions/second
> > 060705 154337 Processing linksByURL: Merged to new DB containing 26
> > records
> > in 0.0 seconds
> > 060705 154337 Processing linksByURL: Merged Infinity records/second
> > 060705 154337 Processing linksByMD5: Sorted 26 instructions in
> > 0.031seconds.
> > 060705 154337 Processing linksByMD5: Sorted
> > 838.7096774193549instructions/second
> > 060705 154337 Processing linksByMD5: Merged to new DB containing 26
> > records
> > in 0.0 seconds
> > 060705 154337 Processing linksByMD5: Merged Infinity records/second
> > 060705 154337 Update finished
> > 060705 154337 FetchListTool started
> > 060705 154338 Processing pagesByURL: Sorted 26 instructions in
> > 0.016seconds.
> > 060705 154338 Processing pagesByURL: Sorted 1625.0 instructions/second
> > 060705 154338 Processing pagesByURL: Merged to new DB containing 27
> > records
> > in 0.0 seconds
> > 060705 154338 Processing pagesByURL: Merged Infinity records/second
> > 060705 154338 Processing pagesByMD5: Sorted 26 instructions in 0.0
> > seconds.
> > 060705 154338 Processing pagesByMD5: Sorted Infinity instructions/
> > second
> > 060705 154338 Processing pagesByMD5: Merged to new DB containing 27
> > records
> > in 0.015 seconds
> > 060705 154338 Processing pagesByMD5: Merged 1800.0 records/second
> > 060705 154338 Processing linksByMD5: Copied file (0 bytes) in 0.016
> > secs.
> > 060705 154338 Processing linksByURL: Copied file (0 bytes) in 0.0
> > secs.
> > 060705 154338 Processing C:\cygwin\nutch-
> > 0.7.2\bin\crawled2\segments\20060705154337\fetchlist.unsorted:
> > Sorted 26
> > entries in 0.0 seconds.
> > 060705 154338 Processing C:\cygwin\nutch-
> > 0.7.2\bin\crawled2\segments\20060705154337\fetchlist.unsorted: Sorted
> > Infinity entries/second
> > 060705 154338 Overall processing: Sorted 26 entries in 0.0 seconds.
> > 060705 154338 Overall processing: Sorted 0.0 entries/second
> > 060705 154338 FetchListTool completed
> > 060705 154338 logging at INFO
> > 060705 154338 fetching http://www.haha365.com/gd_joke/
> > 20050319084431.htm
> > 060705 154338 fetching http://www.haha365.com/gd_joke/
> > 20050319084733.htm
> > 060705 154338 fetching http://www.haha365.com/gd_joke/
> > 20050319085110.htm
> > 060705 154338 fetching http://www.haha365.com/gd_joke/
> > 20050319084338.htm
> > 060705 154339 fetching http://www.haha365.com/gd_joke/
> > 20050319085226.htm
> > 060705 154340 fetching http://www.haha365.com/gd_joke/
> > 20050318163740.htm
> > 060705 154341 fetching http://www.haha365.com/gd_joke/
> > 20050319085344.htm
> > 060705 154343 fetching http://www.haha365.com/gd_joke/
> > 20050318163709.htm
> > 060705 154345 fetching http://www.haha365.com/gd_joke/
> > 20050319085310.htm
> > 060705 154347 fetching http://www.haha365.com/gd_joke/
> > 20050319085028.htm
> > 060705 154349 fetching http://www.haha365.com/gd_joke/
> > 20050319084052.htm
> > 060705 154350 fetching http://www.haha365.com/gd_joke/index.htm
> > 060705 154352 fetching http://www.haha365.com/gd_joke/
> > 20050319084902.htm
> > 060705 154353 fetching http://www.haha365.com/gd_joke/
> > 20050319084945.htm
> > 060705 154355 fetching http://www.haha365.com/gd_joke/
> > 20050319084129.htm
> > 060705 154356 fetching http://www.haha365.com/gd_joke/
> > 20050319084202.htm
> > 060705 154358 fetching http://www.haha365.com/gd_joke/
> > 20050318163642.htm
> > 060705 154359 fetching http://www.haha365.com/gd_joke/
> > 20050319084304.htm
> > 060705 154400 fetching http://www.haha365.com/gd_joke/
> > 20050319084822.htm
> > 060705 154402 fetching http://www.haha365.com/gd_joke/
> > 20050319085142.htm
> > 060705 154403 fetching http://www.haha365.com/gd_joke/
> > 20050319084232.htm
> > 060705 154408 fetching http://www.haha365.com/gd_joke/
> > 20050318163829.htm
> > 060705 154411 fetching http://www.haha365.com/gd_joke/
> > 20050318163920.htm
> > 060705 154415 fetching http://www.haha365.com/gd_joke/
> > 20050319084559.htm
> > 060705 154419 fetching http://www.haha365.com/gd_joke/
> > 060705 154423 fetching http://www.haha365.com/gd_joke/
> > 20050318163807.htm
> > 060705 154440 status: segment 20060705154337, 26 pages, 0 errors,
> > 323050
> > bytes, 62047 ms
> > 060705 154440 status: 0.41903716 pages/s, 40.67607 kb/s, 12425.0
> > bytes/page
> > 060705 154441 Updating C:\cygwin\nutch-0.7.2\bin\crawled2\db
> > 060705 154441 Updating for C:\cygwin\nutch-
> > 0.7.2\bin\crawled2\segments\20060705154337
> > 060705 154441 Processing document 0
> > 060705 154441 Finishing update
> > 060705 154441 Processing pagesByURL: Sorted 174 instructions in
> > 0.016seconds.
> > 060705 154441 Processing pagesByURL: Sorted 10875.0 instructions/
> > second
> > 060705 154441 Processing pagesByURL: Merged to new DB containing 53
> > records
> > in 0.0 seconds
> > 060705 154441 Processing pagesByURL: Merged Infinity records/second
> > 060705 154441 Processing pagesByMD5: Sorted 78 instructions in
> > 0.015seconds.
> > 060705 154441 Processing pagesByMD5: Sorted 5200.0 instructions/second
> > 060705 154441 Processing pagesByMD5: Merged to new DB containing 53
> > records
> > in 0.0 seconds
> > 060705 154441 Processing pagesByMD5: Merged Infinity records/second
> > 060705 154441 Processing linksByMD5: Sorted 174 instructions in
> > 0.016seconds.
> > 060705 154441 Processing linksByMD5: Sorted 10875.0 instructions/
> > second
> > 060705 154441 Processing linksByMD5: Merged to new DB containing
> > 148 records
> > in 0.015 seconds
> > 060705 154441 Processing linksByMD5: Merged 9866.666666666668
> > records/second
> > 060705 154441 Processing linksByURL: Sorted 122 instructions in 0.0
> > seconds.
> > 060705 154441 Processing linksByURL: Sorted Infinity instructions/
> > second
> > 060705 154441 Processing linksByURL: Merged to new DB containing
> > 148 records
> > in 0.015 seconds
> > 060705 154441 Processing linksByURL: Merged 9866.666666666668
> > records/second
> > 060705 154441 Processing linksByMD5: Sorted 148 instructions in 0.0
> > seconds.
> > 060705 154441 Processing linksByMD5: Sorted Infinity instructions/
> > second
> > 060705 154441 Processing linksByMD5: Merged to new DB containing
> > 148 records
> > in 0.016 seconds
> > 060705 154441 Processing linksByMD5: Merged 9250.0 records/second
> > 060705 154442 Update finished
> > 060705 154442 FetchListTool started
> > 060705 154442 Processing pagesByURL: Sorted 26 instructions in
> > 0.016seconds.
> > 060705 154442 Processing pagesByURL: Sorted 1625.0 instructions/second
> > 060705 154442 Processing pagesByURL: Merged to new DB containing 53
> > records
> > in 0.015 seconds
> > 060705 154442 Processing pagesByURL: Merged
> > 3533.3333333333335records/second
> > 060705 154442 Processing pagesByMD5: Sorted 26 instructions in 0.0
> > seconds.
> > 060705 154442 Processing pagesByMD5: Sorted Infinity instructions/
> > second
> > 060705 154442 Processing pagesByMD5: Merged to new DB containing 53
> > records
> > in 0.0 seconds
> > 060705 154442 Processing pagesByMD5: Merged Infinity records/second
> > 060705 154442 Processing linksByMD5: Copied file (0 bytes) in 0.016
> > secs.
> > 060705 154442 Processing linksByURL: Copied file (0 bytes) in 0.0
> > secs.
> > 060705 154442 Processing C:\cygwin\nutch-
> > 0.7.2\bin\crawled2\segments\20060705154442\fetchlist.unsorted:
> > Sorted 26
> > entries in 0.093 seconds.
> > 060705 154442 Processing C:\cygwin\nutch-
> > 0.7.2\bin\crawled2\segments\20060705154442\fetchlist.unsorted: Sorted
> > 279.5698924731183 entries/second
> > 060705 154442 Overall processing: Sorted 26 entries in 0.093 seconds.
> > 060705 154442 Overall processing: Sorted 0.003576923076923077
> > entries/second
> > 060705 154443 FetchListTool completed
> > 060705 154443 logging at INFO
> > 060705 154443 fetching http://www.haha365.com/gd_joke/
> > 20050815111532.htm
> > 060705 154443 fetching http://www.haha365.com/gd_joke/
> > 20050815105800.htm
> > 060705 154443 fetching http://www.haha365.com/gd_joke/
> > 20050319085605.htm
> > 060705 154443 fetching http://www.haha365.com/gd_joke/
> > 20050815110121.htm
> > 060705 154446 fetching http://www.haha365.com/gd_joke/
> > 20060625064748.htm
> > 060705 154448 fetching http://www.haha365.com/gd_joke/
> > 20050815105937.htm
> > 060705 154449 fetching http://www.haha365.com/gd_joke/
> > 20050815110925.htm
> > 060705 154450 fetching http://www.haha365.com/gd_joke/
> > 20050815111651.htm
> > 060705 154452 fetching http://www.haha365.com/gd_joke/
> > 20050706110014.htm
> > 060705 154453 fetching http://www.haha365.com/gd_joke/
> > 20050318163615.htm
> > 060705 154454 fetching http://www.haha365.com/gd_joke/
> > 20050815111228.htm
> > 060705 154456 fetching http://www.haha365.com/gd_joke/
> > 20050706105833.htm
> > 060705 154457 fetching http://www.haha365.com/gd_joke/
> > 20050815110411.htm
> > 060705 154459 fetching http://www.haha365.com/gd_joke/
> > 20050815105527.htm
> > 060705 154500 fetching http://www.haha365.com/gd_joke/
> > 20050815111758.htm
> > 060705 154502 fetching http://www.haha365.com/gd_joke/
> > 20050706110230.htm
> > 060705 154503 fetching http://www.haha365.com/gd_joke/
> > 20050706105453.htm
> > 060705 154504 fetching http://www.haha365.com/gd_joke/
> > 20050706110522.htm
> > 060705 154506 fetching http://www.haha365.com/gd_joke/
> > 20050706105104.htm
> > 060705 154507 fetching http://www.haha365.com/gd_joke/
> > 20050709144044.htm
> > 060705 154509 fetching http://www.haha365.com/gd_joke/
> > 20060611112617.htm
> > 060705 154510 fetching http://www.haha365.com/gd_joke/
> > 20050815105330.htm
> > 060705 154511 fetching http://www.haha365.com/gd_joke/
> > 20050709144708.htm
> > 060705 154513 fetching http://www.haha365.com/gd_joke/
> > 20050706105324.htm
> > 060705 154514 fetching http://www.haha365.com/gd_joke/
> > 20050815110707.htm
> > 060705 154516 fetching http://www.haha365.com/gd_joke/
> > 20050706105218.htm
> > 060705 154523 status: segment 20060705154442, 26 pages, 0 errors,
> > 314308
> > bytes, 40063 ms
> > 060705 154523 status: 0.6489779 pages/s, 61.291748 kb/s, 12088.77
> > bytes/page
> > 060705 154524 Updating C:\cygwin\nutch-0.7.2\bin\crawled2\db
> > 060705 154524 Updating for C:\cygwin\nutch-
> > 0.7.2\bin\crawled2\segments\20060705154442
> > 060705 154524 Processing document 0
> > 060705 154524 Finishing update
> > 060705 154524 Processing pagesByURL: Sorted 127 instructions in 0.0
> > seconds.
> > 060705 154524 Processing pagesByURL: Sorted Infinity instructions/
> > second
> > 060705 154524 Processing pagesByURL: Merged to new DB containing 56
> > records
> > in 0.0 seconds
> > 060705 154524 Processing pagesByURL: Merged Infinity records/second
> > 060705 154524 Processing pagesByMD5: Sorted 55 instructions in
> > 0.016seconds.
> > 060705 154524 Processing pagesByMD5: Sorted 3437.5 instructions/second
> > 060705 154524 Processing pagesByMD5: Merged to new DB containing 56
> > records
> > in 0.015 seconds
> > 060705 154524 Processing pagesByMD5: Merged
> > 3733.3333333333335records/second
> > 060705 154524 Processing linksByMD5: Sorted 127 instructions in
> > 0.016seconds.
> > 060705 154524 Processing linksByMD5: Sorted 7937.5 instructions/second
> > 060705 154524 Processing linksByMD5: Merged to new DB containing
> > 249 records
> > in 0.0 seconds
> > 060705 154524 Processing linksByMD5: Merged Infinity records/second
> > 060705 154524 Processing linksByURL: Sorted 101 instructions in 0.0
> > seconds.
> > 060705 154524 Processing linksByURL: Sorted Infinity instructions/
> > second
> > 060705 154524 Processing linksByURL: Merged to new DB containing
> > 249 records
> > in 0.016 seconds
> > 060705 154524 Processing linksByURL: Merged 15562.5 records/second
> > 060705 154524 Processing linksByMD5: Sorted 127 instructions in
> > 0.015seconds.
> > 060705 154524 Processing linksByMD5: Sorted
> > 8466.666666666668instructions/second
> > 060705 154524 Processing linksByMD5: Merged to new DB containing
> > 249 records
> > in 0.0 seconds
> > 060705 154524 Processing linksByMD5: Merged Infinity records/second
> > 060705 154524 Update finished
> > 060705 154524 Updating C:\cygwin\nutch-0.7.2\bin\crawled2\segments
> > from
> > C:\cygwin\nutch-0.7.2\bin\crawled2\db
> > 060705 154524  reading C:\cygwin\nutch-
> > 0.7.2\bin\crawled2\segments\20060705154333
> > 060705 154524  reading C:\cygwin\nutch-
> > 0.7.2\bin\crawled2\segments\20060705154337
> > 060705 154524  reading C:\cygwin\nutch-
> > 0.7.2\bin\crawled2\segments\20060705154442
> > 060705 154524 Sorting pages by url...
> > 060705 154524 Getting updated scores and anchors from db...
> > 060705 154524 Sorting updates by segment...
> > 060705 154524 Updating segments...
> > 060705 154524  updating C:\cygwin\nutch-
> > 0.7.2\bin\crawled2\segments\20060705154333
> > 060705 154525  updating C:\cygwin\nutch-
> > 0.7.2\bin\crawled2\segments\20060705154337
> > 060705 154525  updating C:\cygwin\nutch-
> > 0.7.2\bin\crawled2\segments\20060705154442
> > 060705 154525 Done updating C:\cygwin\nutch-0.7.2\bin\crawled2
> > \segments from
> > C:\cygwin\nutch-0.7.2\bin\crawled2\db
> > 060705 154525 indexing segment: C:\cygwin\nutch-
> > 0.7.2\bin\crawled2\segments\20060705154333
> > 060705 154525 * Opening segment 20060705154333
> > 060705 154525 * Indexing segment 20060705154333
> > 060705 154525 found resource common-terms.utf8 at file:/C:/cygwin/
> > nutch-
> > 0.7.2/conf/common-terms.utf8
> > 060705 154525 * Optimizing index...
> > 060705 154525 * Moving index to NFS if needed...
> > 060705 154525 DONE indexing segment 20060705154333: total 1 records in
> > 0.187s (Infinity rec/s).
> > 060705 154525 done indexing
> > 060705 154525 indexing segment: C:\cygwin\nutch-
> > 0.7.2\bin\crawled2\segments\20060705154337
> > 060705 154525 * Opening segment 20060705154337
> > 060705 154525 * Indexing segment 20060705154337
> > 060705 154525 * Optimizing index...
> > 060705 154525 * Moving index to NFS if needed...
> > 060705 154525 DONE indexing segment 20060705154337: total 26
> > records in
> > 0.391 s (Infinity rec/s).
> > 060705 154525 done indexing
> > 060705 154525 indexing segment: C:\cygwin\nutch-
> > 0.7.2\bin\crawled2\segments\20060705154442
> > 060705 154525 * Opening segment 20060705154442
> > 060705 154525 * Indexing segment 20060705154442
> > 060705 154525 * Optimizing index...
> > 060705 154525 * Moving index to NFS if needed...
> > 060705 154525 DONE indexing segment 20060705154442: total 26
> > records in
> > 0.219 s (Infinity rec/s).
> > 060705 154525 done indexing
> > 060705 154526 Reading url hashes...
> > 060705 154526 Sorting url hashes...
> > 060705 154526 Deleting url duplicates...
> > 060705 154526 Deleted 0 url duplicates.
> > 060705 154526 Reading content hashes...
> > 060705 154526 Sorting content hashes...
> > 060705 154526 Deleting content duplicates...
> > 060705 154526 Deleted 1 content duplicates.
> > 060705 154526 Duplicate deletion complete locally.  Now returning
> > to NFS...
> > 060705 154526 DeleteDuplicates complete
> > 060705 154526 Merging segment indexes...
> > 060705 154526 crawl finished: crawled2
>
>




______________________________________
Tonal web design and hosting
http://tonalweb.com
eCommerce development & marketing





Re: why i can't crawl all the linked pages in the specified page to crawl.

Posted by kevin <ke...@gmail.com>.
Hi,Stefan,
thanks your reply.
i've tried a 20 depth and it works better,it can crawl almost all the 
pages. however it have not crawled all pages yet.
i'll try a bigger depth like 30 later...



Stefan Groschupf 写道:
> Hi,
> may be you can try to have a much higher depth something like 20?
> However in general check:
> + the regex url filter file.
> + the rebotos.txt
> + nofollow tag in the pages
> + number of out links to extrac in nutch-default.cml
>
> Stefan
> On 06.07.2006, at 19:12, kevin pang wrote:
>
>> i set up the nutch to crawl the url: http://www.haha365.com/gd_joke/
>> but after crawl complete, only 54 pages were fetched.
>>
>> here is the log info:
>>
>> 060705 154332 parsing file:/C:/cygwin/nutch-0.7.2/conf/nutch-default.xml
>> 060705 154332 parsing file:/C:/cygwin/nutch-0.7.2/conf/crawl-tool.xml
>> 060705 154332 parsing file:/C:/cygwin/nutch-0.7.2/conf/nutch-site.xml
>> 060705 154332 No FS indicated, using default:local
>> 060705 154332 crawl started in: crawled2
>> 060705 154332 rootUrlFile = url.txt
>> 060705 154332 threads = 4
>> 060705 154332 depth = 3
>> 060705 154333 Created webdb at 
>> LocalFS,C:\cygwin\nutch-0.7.2\bin\crawled2\db
>> 060705 154333 Starting URL processing
>> 060705 154333 Plugins: looking in: C:\cygwin\nutch-0.7.2\plugins
>> 060705 154333 parsing: C:\cygwin\nutch-
>> 0.7.2\plugins\urlfilter-regex\plugin.xml
>> 060705 154333 impl: point=org.apache.nutch.net.URLFilter class=
>> org.apache.nutch.net.RegexURLFilter
>> 060705 154333 not including: 
>> C:\cygwin\nutch-0.7.2\plugins\urlfilter-prefix
>> 060705 154333 parsing: 
>> C:\cygwin\nutch-0.7.2\plugins\query-url\plugin.xml
>> 060705 154333 impl: point=org.apache.nutch.searcher.QueryFilter class=
>> org.apache.nutch.searcher.url.URLQueryFilter
>> 060705 154333 parsing: 
>> C:\cygwin\nutch-0.7.2\plugins\query-site\plugin.xml
>> 060705 154333 impl: point=org.apache.nutch.searcher.QueryFilter class=
>> org.apache.nutch.searcher.site.SiteQueryFilter
>> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\query-more
>> 060705 154333 parsing: 
>> C:\cygwin\nutch-0.7.2\plugins\query-basic\plugin.xml
>> 060705 154333 impl: point=org.apache.nutch.searcher.QueryFilter class=
>> org.apache.nutch.searcher.basic.BasicQueryFilter
>> 060705 154333 not including: C:\cygwin\nutch-
>> 0.7.2\plugins\protocol-httpclient
>> 060705 154333 parsing: C:\cygwin\nutch-
>> 0.7.2\plugins\protocol-http\plugin.xml
>> 060705 154333 impl: point=org.apache.nutch.protocol.Protocol class=
>> org.apache.nutch.protocol.http.Http
>> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\protocol-ftp
>> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\protocol-file
>> 060705 154333 parsing: 
>> C:\cygwin\nutch-0.7.2\plugins\parse-text\plugin.xml
>> 060705 154333 impl: point=org.apache.nutch.parse.Parser class=
>> org.apache.nutch.parse.text.TextParser
>> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse-rss
>> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse-pdf
>> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse-msword
>> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse-js
>> 060705 154333 parsing: 
>> C:\cygwin\nutch-0.7.2\plugins\parse-html\plugin.xml
>> 060705 154333 impl: point=org.apache.nutch.parse.Parser class=
>> org.apache.nutch.parse.html.HtmlParser
>> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse-ext
>> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\ontology
>> 060705 154333 parsing: C:\cygwin\nutch-
>> 0.7.2\plugins\nutch-extensionpoints\plugin.xml
>> 060705 154333 not including: C:\cygwin\nutch-
>> 0.7.2\plugins\language-identifier
>> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\index-more
>> 060705 154333 parsing: 
>> C:\cygwin\nutch-0.7.2\plugins\index-basic\plugin.xml
>> 060705 154333 impl: point=org.apache.nutch.indexer.IndexingFilter class=
>> org.apache.nutch.indexer.basic.BasicIndexingFilter
>> 060705 154333 not including: 
>> C:\cygwin\nutch-0.7.2\plugins\creativecommons
>> 060705 154333 not including: C:\cygwin\nutch-
>> 0.7.2\plugins\clustering-carrot2
>> 060705 154333 found resource crawl-urlfilter.txt at 
>> file:/C:/cygwin/nutch-
>> 0.7.2/conf/crawl-urlfilter.txt
>> 060705 154333 Using URL normalizer: 
>> org.apache.nutch.net.BasicUrlNormalizer
>> 060705 154333 Added 1 pages
>> 060705 154333 Processing pagesByURL: Sorted 1 instructions in 0.016 
>> seconds.
>> 060705 154333 Processing pagesByURL: Sorted 62.5 instructions/second
>> 060705 154333 Processing pagesByURL: Merged to new DB containing 1 
>> records
>> in 0.0 seconds
>> 060705 154333 Processing pagesByURL: Merged Infinity records/second
>> 060705 154333 Processing pagesByMD5: Sorted 1 instructions in 0.0 
>> seconds.
>> 060705 154333 Processing pagesByMD5: Sorted Infinity instructions/second
>> 060705 154333 Processing pagesByMD5: Merged to new DB containing 1 
>> records
>> in 0.0 seconds
>> 060705 154333 Processing pagesByMD5: Merged Infinity records/second
>> 060705 154333 Processing linksByMD5: Copied file (0 bytes) in 0.016 
>> secs.
>> 060705 154333 Processing linksByURL: Copied file (0 bytes) in 0.015 
>> secs.
>> 060705 154333 FetchListTool started
>> 060705 154333 Processing pagesByURL: Sorted 1 instructions in 0.0 
>> seconds.
>> 060705 154333 Processing pagesByURL: Sorted Infinity instructions/second
>> 060705 154333 Processing pagesByURL: Merged to new DB containing 1 
>> records
>> in 0.0 seconds
>> 060705 154333 Processing pagesByURL: Merged Infinity records/second
>> 060705 154333 Processing pagesByMD5: Sorted 1 instructions in 0.0 
>> seconds.
>> 060705 154333 Processing pagesByMD5: Sorted Infinity instructions/second
>> 060705 154334 Processing pagesByMD5: Merged to new DB containing 1 
>> records
>> in 0.0 seconds
>> 060705 154334 Processing pagesByMD5: Merged Infinity records/second
>> 060705 154334 Processing linksByMD5: Copied file (0 bytes) in 0.031 
>> secs.
>> 060705 154334 Processing linksByURL: Copied file (0 bytes) in 0.015 
>> secs.
>> 060705 154334 Processing C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154333\fetchlist.unsorted: Sorted 1
>> entries in 0.015 seconds.
>> 060705 154334 Processing C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154333\fetchlist.unsorted: Sorted
>> 66.66666666666667 entries/second
>> 060705 154334 Overall processing: Sorted 1 entries in 0.015 seconds.
>> 060705 154334 Overall processing: Sorted 0.015 entries/second
>> 060705 154334 FetchListTool completed
>> 060705 154334 logging at INFO
>> 060705 154334 fetching http://www.haha365.com/gd_joke/index_3.htm
>> 060705 154334 http.proxy.host = null
>> 060705 154334 http.proxy.port = 8080
>> 060705 154334 http.timeout = 10000
>> 060705 154334 http.content.limit = 65536
>> 060705 154334 http.agent = NutchCVS/0.7.2 (Nutch;
>> http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
>> 060705 154334 fetcher.server.delay = 1000
>> 060705 154334 http.max.delays = 100
>> 060705 154336 status: segment 20060705154333, 1 pages, 0 errors, 19172
>> bytes, 2000 ms
>> 060705 154336 status: 0.5 pages/s, 74.890625 kb/s, 19172.0 bytes/page
>> 060705 154337 Updating C:\cygwin\nutch-0.7.2\bin\crawled2\db
>> 060705 154337 Updating for C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154333
>> 060705 154337 Processing document 0
>> 060705 154337 Finishing update
>> 060705 154337 Processing pagesByURL: Sorted 27 instructions in 
>> 0.015seconds.
>> 060705 154337 Processing pagesByURL: Sorted 1800.0 instructions/second
>> 060705 154337 Processing pagesByURL: Merged to new DB containing 27 
>> records
>> in 0.0 seconds
>> 060705 154337 Processing pagesByURL: Merged Infinity records/second
>> 060705 154337 Processing pagesByMD5: Sorted 28 instructions in 
>> 0.015seconds.
>> 060705 154337 Processing pagesByMD5: Sorted
>> 1866.6666666666667instructions/second
>> 060705 154337 Processing pagesByMD5: Merged to new DB containing 27 
>> records
>> in 0.016 seconds
>> 060705 154337 Processing pagesByMD5: Merged 1687.5 records/second
>> 060705 154337 Processing linksByMD5: Sorted 27 instructions in 
>> 0.015seconds.
>> 060705 154337 Processing linksByMD5: Sorted 1800.0 instructions/second
>> 060705 154337 Processing linksByMD5: Merged to new DB containing 26 
>> records
>> in 0.0 seconds
>> 060705 154337 Processing linksByMD5: Merged Infinity records/second
>> 060705 154337 Processing linksByURL: Sorted 26 instructions in 
>> 0.015seconds.
>> 060705 154337 Processing linksByURL: Sorted
>> 1733.3333333333335instructions/second
>> 060705 154337 Processing linksByURL: Merged to new DB containing 26 
>> records
>> in 0.0 seconds
>> 060705 154337 Processing linksByURL: Merged Infinity records/second
>> 060705 154337 Processing linksByMD5: Sorted 26 instructions in 
>> 0.031seconds.
>> 060705 154337 Processing linksByMD5: Sorted 
>> 838.7096774193549instructions/second
>> 060705 154337 Processing linksByMD5: Merged to new DB containing 26 
>> records
>> in 0.0 seconds
>> 060705 154337 Processing linksByMD5: Merged Infinity records/second
>> 060705 154337 Update finished
>> 060705 154337 FetchListTool started
>> 060705 154338 Processing pagesByURL: Sorted 26 instructions in 
>> 0.016seconds.
>> 060705 154338 Processing pagesByURL: Sorted 1625.0 instructions/second
>> 060705 154338 Processing pagesByURL: Merged to new DB containing 27 
>> records
>> in 0.0 seconds
>> 060705 154338 Processing pagesByURL: Merged Infinity records/second
>> 060705 154338 Processing pagesByMD5: Sorted 26 instructions in 0.0 
>> seconds.
>> 060705 154338 Processing pagesByMD5: Sorted Infinity instructions/second
>> 060705 154338 Processing pagesByMD5: Merged to new DB containing 27 
>> records
>> in 0.015 seconds
>> 060705 154338 Processing pagesByMD5: Merged 1800.0 records/second
>> 060705 154338 Processing linksByMD5: Copied file (0 bytes) in 0.016 
>> secs.
>> 060705 154338 Processing linksByURL: Copied file (0 bytes) in 0.0 secs.
>> 060705 154338 Processing C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154337\fetchlist.unsorted: Sorted 26
>> entries in 0.0 seconds.
>> 060705 154338 Processing C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154337\fetchlist.unsorted: Sorted
>> Infinity entries/second
>> 060705 154338 Overall processing: Sorted 26 entries in 0.0 seconds.
>> 060705 154338 Overall processing: Sorted 0.0 entries/second
>> 060705 154338 FetchListTool completed
>> 060705 154338 logging at INFO
>> 060705 154338 fetching http://www.haha365.com/gd_joke/20050319084431.htm
>> 060705 154338 fetching http://www.haha365.com/gd_joke/20050319084733.htm
>> 060705 154338 fetching http://www.haha365.com/gd_joke/20050319085110.htm
>> 060705 154338 fetching http://www.haha365.com/gd_joke/20050319084338.htm
>> 060705 154339 fetching http://www.haha365.com/gd_joke/20050319085226.htm
>> 060705 154340 fetching http://www.haha365.com/gd_joke/20050318163740.htm
>> 060705 154341 fetching http://www.haha365.com/gd_joke/20050319085344.htm
>> 060705 154343 fetching http://www.haha365.com/gd_joke/20050318163709.htm
>> 060705 154345 fetching http://www.haha365.com/gd_joke/20050319085310.htm
>> 060705 154347 fetching http://www.haha365.com/gd_joke/20050319085028.htm
>> 060705 154349 fetching http://www.haha365.com/gd_joke/20050319084052.htm
>> 060705 154350 fetching http://www.haha365.com/gd_joke/index.htm
>> 060705 154352 fetching http://www.haha365.com/gd_joke/20050319084902.htm
>> 060705 154353 fetching http://www.haha365.com/gd_joke/20050319084945.htm
>> 060705 154355 fetching http://www.haha365.com/gd_joke/20050319084129.htm
>> 060705 154356 fetching http://www.haha365.com/gd_joke/20050319084202.htm
>> 060705 154358 fetching http://www.haha365.com/gd_joke/20050318163642.htm
>> 060705 154359 fetching http://www.haha365.com/gd_joke/20050319084304.htm
>> 060705 154400 fetching http://www.haha365.com/gd_joke/20050319084822.htm
>> 060705 154402 fetching http://www.haha365.com/gd_joke/20050319085142.htm
>> 060705 154403 fetching http://www.haha365.com/gd_joke/20050319084232.htm
>> 060705 154408 fetching http://www.haha365.com/gd_joke/20050318163829.htm
>> 060705 154411 fetching http://www.haha365.com/gd_joke/20050318163920.htm
>> 060705 154415 fetching http://www.haha365.com/gd_joke/20050319084559.htm
>> 060705 154419 fetching http://www.haha365.com/gd_joke/
>> 060705 154423 fetching http://www.haha365.com/gd_joke/20050318163807.htm
>> 060705 154440 status: segment 20060705154337, 26 pages, 0 errors, 323050
>> bytes, 62047 ms
>> 060705 154440 status: 0.41903716 pages/s, 40.67607 kb/s, 12425.0 
>> bytes/page
>> 060705 154441 Updating C:\cygwin\nutch-0.7.2\bin\crawled2\db
>> 060705 154441 Updating for C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154337
>> 060705 154441 Processing document 0
>> 060705 154441 Finishing update
>> 060705 154441 Processing pagesByURL: Sorted 174 instructions in 
>> 0.016seconds.
>> 060705 154441 Processing pagesByURL: Sorted 10875.0 instructions/second
>> 060705 154441 Processing pagesByURL: Merged to new DB containing 53 
>> records
>> in 0.0 seconds
>> 060705 154441 Processing pagesByURL: Merged Infinity records/second
>> 060705 154441 Processing pagesByMD5: Sorted 78 instructions in 
>> 0.015seconds.
>> 060705 154441 Processing pagesByMD5: Sorted 5200.0 instructions/second
>> 060705 154441 Processing pagesByMD5: Merged to new DB containing 53 
>> records
>> in 0.0 seconds
>> 060705 154441 Processing pagesByMD5: Merged Infinity records/second
>> 060705 154441 Processing linksByMD5: Sorted 174 instructions in 
>> 0.016seconds.
>> 060705 154441 Processing linksByMD5: Sorted 10875.0 instructions/second
>> 060705 154441 Processing linksByMD5: Merged to new DB containing 148 
>> records
>> in 0.015 seconds
>> 060705 154441 Processing linksByMD5: Merged 9866.666666666668 
>> records/second
>> 060705 154441 Processing linksByURL: Sorted 122 instructions in 0.0 
>> seconds.
>> 060705 154441 Processing linksByURL: Sorted Infinity instructions/second
>> 060705 154441 Processing linksByURL: Merged to new DB containing 148 
>> records
>> in 0.015 seconds
>> 060705 154441 Processing linksByURL: Merged 9866.666666666668 
>> records/second
>> 060705 154441 Processing linksByMD5: Sorted 148 instructions in 0.0 
>> seconds.
>> 060705 154441 Processing linksByMD5: Sorted Infinity instructions/second
>> 060705 154441 Processing linksByMD5: Merged to new DB containing 148 
>> records
>> in 0.016 seconds
>> 060705 154441 Processing linksByMD5: Merged 9250.0 records/second
>> 060705 154442 Update finished
>> 060705 154442 FetchListTool started
>> 060705 154442 Processing pagesByURL: Sorted 26 instructions in 
>> 0.016seconds.
>> 060705 154442 Processing pagesByURL: Sorted 1625.0 instructions/second
>> 060705 154442 Processing pagesByURL: Merged to new DB containing 53 
>> records
>> in 0.015 seconds
>> 060705 154442 Processing pagesByURL: Merged 
>> 3533.3333333333335records/second
>> 060705 154442 Processing pagesByMD5: Sorted 26 instructions in 0.0 
>> seconds.
>> 060705 154442 Processing pagesByMD5: Sorted Infinity instructions/second
>> 060705 154442 Processing pagesByMD5: Merged to new DB containing 53 
>> records
>> in 0.0 seconds
>> 060705 154442 Processing pagesByMD5: Merged Infinity records/second
>> 060705 154442 Processing linksByMD5: Copied file (0 bytes) in 0.016 
>> secs.
>> 060705 154442 Processing linksByURL: Copied file (0 bytes) in 0.0 secs.
>> 060705 154442 Processing C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154442\fetchlist.unsorted: Sorted 26
>> entries in 0.093 seconds.
>> 060705 154442 Processing C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154442\fetchlist.unsorted: Sorted
>> 279.5698924731183 entries/second
>> 060705 154442 Overall processing: Sorted 26 entries in 0.093 seconds.
>> 060705 154442 Overall processing: Sorted 0.003576923076923077 
>> entries/second
>> 060705 154443 FetchListTool completed
>> 060705 154443 logging at INFO
>> 060705 154443 fetching http://www.haha365.com/gd_joke/20050815111532.htm
>> 060705 154443 fetching http://www.haha365.com/gd_joke/20050815105800.htm
>> 060705 154443 fetching http://www.haha365.com/gd_joke/20050319085605.htm
>> 060705 154443 fetching http://www.haha365.com/gd_joke/20050815110121.htm
>> 060705 154446 fetching http://www.haha365.com/gd_joke/20060625064748.htm
>> 060705 154448 fetching http://www.haha365.com/gd_joke/20050815105937.htm
>> 060705 154449 fetching http://www.haha365.com/gd_joke/20050815110925.htm
>> 060705 154450 fetching http://www.haha365.com/gd_joke/20050815111651.htm
>> 060705 154452 fetching http://www.haha365.com/gd_joke/20050706110014.htm
>> 060705 154453 fetching http://www.haha365.com/gd_joke/20050318163615.htm
>> 060705 154454 fetching http://www.haha365.com/gd_joke/20050815111228.htm
>> 060705 154456 fetching http://www.haha365.com/gd_joke/20050706105833.htm
>> 060705 154457 fetching http://www.haha365.com/gd_joke/20050815110411.htm
>> 060705 154459 fetching http://www.haha365.com/gd_joke/20050815105527.htm
>> 060705 154500 fetching http://www.haha365.com/gd_joke/20050815111758.htm
>> 060705 154502 fetching http://www.haha365.com/gd_joke/20050706110230.htm
>> 060705 154503 fetching http://www.haha365.com/gd_joke/20050706105453.htm
>> 060705 154504 fetching http://www.haha365.com/gd_joke/20050706110522.htm
>> 060705 154506 fetching http://www.haha365.com/gd_joke/20050706105104.htm
>> 060705 154507 fetching http://www.haha365.com/gd_joke/20050709144044.htm
>> 060705 154509 fetching http://www.haha365.com/gd_joke/20060611112617.htm
>> 060705 154510 fetching http://www.haha365.com/gd_joke/20050815105330.htm
>> 060705 154511 fetching http://www.haha365.com/gd_joke/20050709144708.htm
>> 060705 154513 fetching http://www.haha365.com/gd_joke/20050706105324.htm
>> 060705 154514 fetching http://www.haha365.com/gd_joke/20050815110707.htm
>> 060705 154516 fetching http://www.haha365.com/gd_joke/20050706105218.htm
>> 060705 154523 status: segment 20060705154442, 26 pages, 0 errors, 314308
>> bytes, 40063 ms
>> 060705 154523 status: 0.6489779 pages/s, 61.291748 kb/s, 12088.77 
>> bytes/page
>> 060705 154524 Updating C:\cygwin\nutch-0.7.2\bin\crawled2\db
>> 060705 154524 Updating for C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154442
>> 060705 154524 Processing document 0
>> 060705 154524 Finishing update
>> 060705 154524 Processing pagesByURL: Sorted 127 instructions in 0.0 
>> seconds.
>> 060705 154524 Processing pagesByURL: Sorted Infinity instructions/second
>> 060705 154524 Processing pagesByURL: Merged to new DB containing 56 
>> records
>> in 0.0 seconds
>> 060705 154524 Processing pagesByURL: Merged Infinity records/second
>> 060705 154524 Processing pagesByMD5: Sorted 55 instructions in 
>> 0.016seconds.
>> 060705 154524 Processing pagesByMD5: Sorted 3437.5 instructions/second
>> 060705 154524 Processing pagesByMD5: Merged to new DB containing 56 
>> records
>> in 0.015 seconds
>> 060705 154524 Processing pagesByMD5: Merged 
>> 3733.3333333333335records/second
>> 060705 154524 Processing linksByMD5: Sorted 127 instructions in 
>> 0.016seconds.
>> 060705 154524 Processing linksByMD5: Sorted 7937.5 instructions/second
>> 060705 154524 Processing linksByMD5: Merged to new DB containing 249 
>> records
>> in 0.0 seconds
>> 060705 154524 Processing linksByMD5: Merged Infinity records/second
>> 060705 154524 Processing linksByURL: Sorted 101 instructions in 0.0 
>> seconds.
>> 060705 154524 Processing linksByURL: Sorted Infinity instructions/second
>> 060705 154524 Processing linksByURL: Merged to new DB containing 249 
>> records
>> in 0.016 seconds
>> 060705 154524 Processing linksByURL: Merged 15562.5 records/second
>> 060705 154524 Processing linksByMD5: Sorted 127 instructions in 
>> 0.015seconds.
>> 060705 154524 Processing linksByMD5: Sorted 
>> 8466.666666666668instructions/second
>> 060705 154524 Processing linksByMD5: Merged to new DB containing 249 
>> records
>> in 0.0 seconds
>> 060705 154524 Processing linksByMD5: Merged Infinity records/second
>> 060705 154524 Update finished
>> 060705 154524 Updating C:\cygwin\nutch-0.7.2\bin\crawled2\segments from
>> C:\cygwin\nutch-0.7.2\bin\crawled2\db
>> 060705 154524 reading C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154333
>> 060705 154524 reading C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154337
>> 060705 154524 reading C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154442
>> 060705 154524 Sorting pages by url...
>> 060705 154524 Getting updated scores and anchors from db...
>> 060705 154524 Sorting updates by segment...
>> 060705 154524 Updating segments...
>> 060705 154524 updating C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154333
>> 060705 154525 updating C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154337
>> 060705 154525 updating C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154442
>> 060705 154525 Done updating 
>> C:\cygwin\nutch-0.7.2\bin\crawled2\segments from
>> C:\cygwin\nutch-0.7.2\bin\crawled2\db
>> 060705 154525 indexing segment: C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154333
>> 060705 154525 * Opening segment 20060705154333
>> 060705 154525 * Indexing segment 20060705154333
>> 060705 154525 found resource common-terms.utf8 at file:/C:/cygwin/nutch-
>> 0.7.2/conf/common-terms.utf8
>> 060705 154525 * Optimizing index...
>> 060705 154525 * Moving index to NFS if needed...
>> 060705 154525 DONE indexing segment 20060705154333: total 1 records in
>> 0.187s (Infinity rec/s).
>> 060705 154525 done indexing
>> 060705 154525 indexing segment: C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154337
>> 060705 154525 * Opening segment 20060705154337
>> 060705 154525 * Indexing segment 20060705154337
>> 060705 154525 * Optimizing index...
>> 060705 154525 * Moving index to NFS if needed...
>> 060705 154525 DONE indexing segment 20060705154337: total 26 records in
>> 0.391 s (Infinity rec/s).
>> 060705 154525 done indexing
>> 060705 154525 indexing segment: C:\cygwin\nutch-
>> 0.7.2\bin\crawled2\segments\20060705154442
>> 060705 154525 * Opening segment 20060705154442
>> 060705 154525 * Indexing segment 20060705154442
>> 060705 154525 * Optimizing index...
>> 060705 154525 * Moving index to NFS if needed...
>> 060705 154525 DONE indexing segment 20060705154442: total 26 records in
>> 0.219 s (Infinity rec/s).
>> 060705 154525 done indexing
>> 060705 154526 Reading url hashes...
>> 060705 154526 Sorting url hashes...
>> 060705 154526 Deleting url duplicates...
>> 060705 154526 Deleted 0 url duplicates.
>> 060705 154526 Reading content hashes...
>> 060705 154526 Sorting content hashes...
>> 060705 154526 Deleting content duplicates...
>> 060705 154526 Deleted 1 content duplicates.
>> 060705 154526 Duplicate deletion complete locally. Now returning to 
>> NFS...
>> 060705 154526 DeleteDuplicates complete
>> 060705 154526 Merging segment indexes...
>> 060705 154526 crawl finished: crawled2
>
>


Re: why i can't crawl all the linked pages in the specified page to crawl.

Posted by Stefan Groschupf <sg...@media-style.com>.
Hi,
may be you can try to have a much higher depth something like 20?
However in general check:
+ the regex url filter file.
+ the rebotos.txt
+ nofollow tag in the pages
+ number of out links to extrac in nutch-default.cml

Stefan
On 06.07.2006, at 19:12, kevin pang wrote:

> i set up the nutch to crawl the url: http://www.haha365.com/gd_joke/
> but after crawl complete, only 54 pages were fetched.
>
> here is the log info:
>
> 060705 154332 parsing file:/C:/cygwin/nutch-0.7.2/conf/nutch- 
> default.xml
> 060705 154332 parsing file:/C:/cygwin/nutch-0.7.2/conf/crawl-tool.xml
> 060705 154332 parsing file:/C:/cygwin/nutch-0.7.2/conf/nutch-site.xml
> 060705 154332 No FS indicated, using default:local
> 060705 154332 crawl started in: crawled2
> 060705 154332 rootUrlFile = url.txt
> 060705 154332 threads = 4
> 060705 154332 depth = 3
> 060705 154333 Created webdb at LocalFS,C:\cygwin\nutch-0.7.2\bin 
> \crawled2\db
> 060705 154333 Starting URL processing
> 060705 154333 Plugins: looking in: C:\cygwin\nutch-0.7.2\plugins
> 060705 154333 parsing: C:\cygwin\nutch-
> 0.7.2\plugins\urlfilter-regex\plugin.xml
> 060705 154333 impl: point=org.apache.nutch.net.URLFilter class=
> org.apache.nutch.net.RegexURLFilter
> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins 
> \urlfilter-prefix
> 060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\query-url 
> \plugin.xml
> 060705 154333 impl: point=org.apache.nutch.searcher.QueryFilter class=
> org.apache.nutch.searcher.url.URLQueryFilter
> 060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\query-site 
> \plugin.xml
> 060705 154333 impl: point=org.apache.nutch.searcher.QueryFilter class=
> org.apache.nutch.searcher.site.SiteQueryFilter
> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\query-more
> 060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\query-basic 
> \plugin.xml
> 060705 154333 impl: point=org.apache.nutch.searcher.QueryFilter class=
> org.apache.nutch.searcher.basic.BasicQueryFilter
> 060705 154333 not including: C:\cygwin\nutch-
> 0.7.2\plugins\protocol-httpclient
> 060705 154333 parsing: C:\cygwin\nutch-
> 0.7.2\plugins\protocol-http\plugin.xml
> 060705 154333 impl: point=org.apache.nutch.protocol.Protocol class=
> org.apache.nutch.protocol.http.Http
> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\protocol- 
> ftp
> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\protocol- 
> file
> 060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\parse-text 
> \plugin.xml
> 060705 154333 impl: point=org.apache.nutch.parse.Parser class=
> org.apache.nutch.parse.text.TextParser
> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse-rss
> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse-pdf
> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse- 
> msword
> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse-js
> 060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\parse-html 
> \plugin.xml
> 060705 154333 impl: point=org.apache.nutch.parse.Parser class=
> org.apache.nutch.parse.html.HtmlParser
> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\parse-ext
> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\ontology
> 060705 154333 parsing: C:\cygwin\nutch-
> 0.7.2\plugins\nutch-extensionpoints\plugin.xml
> 060705 154333 not including: C:\cygwin\nutch-
> 0.7.2\plugins\language-identifier
> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins\index-more
> 060705 154333 parsing: C:\cygwin\nutch-0.7.2\plugins\index-basic 
> \plugin.xml
> 060705 154333 impl: point=org.apache.nutch.indexer.IndexingFilter  
> class=
> org.apache.nutch.indexer.basic.BasicIndexingFilter
> 060705 154333 not including: C:\cygwin\nutch-0.7.2\plugins 
> \creativecommons
> 060705 154333 not including: C:\cygwin\nutch-
> 0.7.2\plugins\clustering-carrot2
> 060705 154333 found resource crawl-urlfilter.txt at file:/C:/cygwin/ 
> nutch-
> 0.7.2/conf/crawl-urlfilter.txt
> 060705 154333 Using URL normalizer:  
> org.apache.nutch.net.BasicUrlNormalizer
> 060705 154333 Added 1 pages
> 060705 154333 Processing pagesByURL: Sorted 1 instructions in 0.016  
> seconds.
> 060705 154333 Processing pagesByURL: Sorted 62.5 instructions/second
> 060705 154333 Processing pagesByURL: Merged to new DB containing 1  
> records
> in 0.0 seconds
> 060705 154333 Processing pagesByURL: Merged Infinity records/second
> 060705 154333 Processing pagesByMD5: Sorted 1 instructions in 0.0  
> seconds.
> 060705 154333 Processing pagesByMD5: Sorted Infinity instructions/ 
> second
> 060705 154333 Processing pagesByMD5: Merged to new DB containing 1  
> records
> in 0.0 seconds
> 060705 154333 Processing pagesByMD5: Merged Infinity records/second
> 060705 154333 Processing linksByMD5: Copied file (0 bytes) in 0.016  
> secs.
> 060705 154333 Processing linksByURL: Copied file (0 bytes) in 0.015  
> secs.
> 060705 154333 FetchListTool started
> 060705 154333 Processing pagesByURL: Sorted 1 instructions in 0.0  
> seconds.
> 060705 154333 Processing pagesByURL: Sorted Infinity instructions/ 
> second
> 060705 154333 Processing pagesByURL: Merged to new DB containing 1  
> records
> in 0.0 seconds
> 060705 154333 Processing pagesByURL: Merged Infinity records/second
> 060705 154333 Processing pagesByMD5: Sorted 1 instructions in 0.0  
> seconds.
> 060705 154333 Processing pagesByMD5: Sorted Infinity instructions/ 
> second
> 060705 154334 Processing pagesByMD5: Merged to new DB containing 1  
> records
> in 0.0 seconds
> 060705 154334 Processing pagesByMD5: Merged Infinity records/second
> 060705 154334 Processing linksByMD5: Copied file (0 bytes) in 0.031  
> secs.
> 060705 154334 Processing linksByURL: Copied file (0 bytes) in 0.015  
> secs.
> 060705 154334 Processing C:\cygwin\nutch-
> 0.7.2\bin\crawled2\segments\20060705154333\fetchlist.unsorted:  
> Sorted 1
> entries in 0.015 seconds.
> 060705 154334 Processing C:\cygwin\nutch-
> 0.7.2\bin\crawled2\segments\20060705154333\fetchlist.unsorted: Sorted
> 66.66666666666667 entries/second
> 060705 154334 Overall processing: Sorted 1 entries in 0.015 seconds.
> 060705 154334 Overall processing: Sorted 0.015 entries/second
> 060705 154334 FetchListTool completed
> 060705 154334 logging at INFO
> 060705 154334 fetching http://www.haha365.com/gd_joke/index_3.htm
> 060705 154334 http.proxy.host = null
> 060705 154334 http.proxy.port = 8080
> 060705 154334 http.timeout = 10000
> 060705 154334 http.content.limit = 65536
> 060705 154334 http.agent = NutchCVS/0.7.2 (Nutch;
> http://lucene.apache.org/nutch/bot.html; nutch- 
> agent@lucene.apache.org)
> 060705 154334 fetcher.server.delay = 1000
> 060705 154334 http.max.delays = 100
> 060705 154336 status: segment 20060705154333, 1 pages, 0 errors, 19172
> bytes, 2000 ms
> 060705 154336 status: 0.5 pages/s, 74.890625 kb/s, 19172.0 bytes/page
> 060705 154337 Updating C:\cygwin\nutch-0.7.2\bin\crawled2\db
> 060705 154337 Updating for C:\cygwin\nutch-
> 0.7.2\bin\crawled2\segments\20060705154333
> 060705 154337 Processing document 0
> 060705 154337 Finishing update
> 060705 154337 Processing pagesByURL: Sorted 27 instructions in  
> 0.015seconds.
> 060705 154337 Processing pagesByURL: Sorted 1800.0 instructions/second
> 060705 154337 Processing pagesByURL: Merged to new DB containing 27  
> records
> in 0.0 seconds
> 060705 154337 Processing pagesByURL: Merged Infinity records/second
> 060705 154337 Processing pagesByMD5: Sorted 28 instructions in  
> 0.015seconds.
> 060705 154337 Processing pagesByMD5: Sorted
> 1866.6666666666667instructions/second
> 060705 154337 Processing pagesByMD5: Merged to new DB containing 27  
> records
> in 0.016 seconds
> 060705 154337 Processing pagesByMD5: Merged 1687.5 records/second
> 060705 154337 Processing linksByMD5: Sorted 27 instructions in  
> 0.015seconds.
> 060705 154337 Processing linksByMD5: Sorted 1800.0 instructions/second
> 060705 154337 Processing linksByMD5: Merged to new DB containing 26  
> records
> in 0.0 seconds
> 060705 154337 Processing linksByMD5: Merged Infinity records/second
> 060705 154337 Processing linksByURL: Sorted 26 instructions in  
> 0.015seconds.
> 060705 154337 Processing linksByURL: Sorted
> 1733.3333333333335instructions/second
> 060705 154337 Processing linksByURL: Merged to new DB containing 26  
> records
> in 0.0 seconds
> 060705 154337 Processing linksByURL: Merged Infinity records/second
> 060705 154337 Processing linksByMD5: Sorted 26 instructions in  
> 0.031seconds.
> 060705 154337 Processing linksByMD5: Sorted  
> 838.7096774193549instructions/second
> 060705 154337 Processing linksByMD5: Merged to new DB containing 26  
> records
> in 0.0 seconds
> 060705 154337 Processing linksByMD5: Merged Infinity records/second
> 060705 154337 Update finished
> 060705 154337 FetchListTool started
> 060705 154338 Processing pagesByURL: Sorted 26 instructions in  
> 0.016seconds.
> 060705 154338 Processing pagesByURL: Sorted 1625.0 instructions/second
> 060705 154338 Processing pagesByURL: Merged to new DB containing 27  
> records
> in 0.0 seconds
> 060705 154338 Processing pagesByURL: Merged Infinity records/second
> 060705 154338 Processing pagesByMD5: Sorted 26 instructions in 0.0  
> seconds.
> 060705 154338 Processing pagesByMD5: Sorted Infinity instructions/ 
> second
> 060705 154338 Processing pagesByMD5: Merged to new DB containing 27  
> records
> in 0.015 seconds
> 060705 154338 Processing pagesByMD5: Merged 1800.0 records/second
> 060705 154338 Processing linksByMD5: Copied file (0 bytes) in 0.016  
> secs.
> 060705 154338 Processing linksByURL: Copied file (0 bytes) in 0.0  
> secs.
> 060705 154338 Processing C:\cygwin\nutch-
> 0.7.2\bin\crawled2\segments\20060705154337\fetchlist.unsorted:  
> Sorted 26
> entries in 0.0 seconds.
> 060705 154338 Processing C:\cygwin\nutch-
> 0.7.2\bin\crawled2\segments\20060705154337\fetchlist.unsorted: Sorted
> Infinity entries/second
> 060705 154338 Overall processing: Sorted 26 entries in 0.0 seconds.
> 060705 154338 Overall processing: Sorted 0.0 entries/second
> 060705 154338 FetchListTool completed
> 060705 154338 logging at INFO
> 060705 154338 fetching http://www.haha365.com/gd_joke/ 
> 20050319084431.htm
> 060705 154338 fetching http://www.haha365.com/gd_joke/ 
> 20050319084733.htm
> 060705 154338 fetching http://www.haha365.com/gd_joke/ 
> 20050319085110.htm
> 060705 154338 fetching http://www.haha365.com/gd_joke/ 
> 20050319084338.htm
> 060705 154339 fetching http://www.haha365.com/gd_joke/ 
> 20050319085226.htm
> 060705 154340 fetching http://www.haha365.com/gd_joke/ 
> 20050318163740.htm
> 060705 154341 fetching http://www.haha365.com/gd_joke/ 
> 20050319085344.htm
> 060705 154343 fetching http://www.haha365.com/gd_joke/ 
> 20050318163709.htm
> 060705 154345 fetching http://www.haha365.com/gd_joke/ 
> 20050319085310.htm
> 060705 154347 fetching http://www.haha365.com/gd_joke/ 
> 20050319085028.htm
> 060705 154349 fetching http://www.haha365.com/gd_joke/ 
> 20050319084052.htm
> 060705 154350 fetching http://www.haha365.com/gd_joke/index.htm
> 060705 154352 fetching http://www.haha365.com/gd_joke/ 
> 20050319084902.htm
> 060705 154353 fetching http://www.haha365.com/gd_joke/ 
> 20050319084945.htm
> 060705 154355 fetching http://www.haha365.com/gd_joke/ 
> 20050319084129.htm
> 060705 154356 fetching http://www.haha365.com/gd_joke/ 
> 20050319084202.htm
> 060705 154358 fetching http://www.haha365.com/gd_joke/ 
> 20050318163642.htm
> 060705 154359 fetching http://www.haha365.com/gd_joke/ 
> 20050319084304.htm
> 060705 154400 fetching http://www.haha365.com/gd_joke/ 
> 20050319084822.htm
> 060705 154402 fetching http://www.haha365.com/gd_joke/ 
> 20050319085142.htm
> 060705 154403 fetching http://www.haha365.com/gd_joke/ 
> 20050319084232.htm
> 060705 154408 fetching http://www.haha365.com/gd_joke/ 
> 20050318163829.htm
> 060705 154411 fetching http://www.haha365.com/gd_joke/ 
> 20050318163920.htm
> 060705 154415 fetching http://www.haha365.com/gd_joke/ 
> 20050319084559.htm
> 060705 154419 fetching http://www.haha365.com/gd_joke/
> 060705 154423 fetching http://www.haha365.com/gd_joke/ 
> 20050318163807.htm
> 060705 154440 status: segment 20060705154337, 26 pages, 0 errors,  
> 323050
> bytes, 62047 ms
> 060705 154440 status: 0.41903716 pages/s, 40.67607 kb/s, 12425.0  
> bytes/page
> 060705 154441 Updating C:\cygwin\nutch-0.7.2\bin\crawled2\db
> 060705 154441 Updating for C:\cygwin\nutch-
> 0.7.2\bin\crawled2\segments\20060705154337
> 060705 154441 Processing document 0
> 060705 154441 Finishing update
> 060705 154441 Processing pagesByURL: Sorted 174 instructions in  
> 0.016seconds.
> 060705 154441 Processing pagesByURL: Sorted 10875.0 instructions/ 
> second
> 060705 154441 Processing pagesByURL: Merged to new DB containing 53  
> records
> in 0.0 seconds
> 060705 154441 Processing pagesByURL: Merged Infinity records/second
> 060705 154441 Processing pagesByMD5: Sorted 78 instructions in  
> 0.015seconds.
> 060705 154441 Processing pagesByMD5: Sorted 5200.0 instructions/second
> 060705 154441 Processing pagesByMD5: Merged to new DB containing 53  
> records
> in 0.0 seconds
> 060705 154441 Processing pagesByMD5: Merged Infinity records/second
> 060705 154441 Processing linksByMD5: Sorted 174 instructions in  
> 0.016seconds.
> 060705 154441 Processing linksByMD5: Sorted 10875.0 instructions/ 
> second
> 060705 154441 Processing linksByMD5: Merged to new DB containing  
> 148 records
> in 0.015 seconds
> 060705 154441 Processing linksByMD5: Merged 9866.666666666668  
> records/second
> 060705 154441 Processing linksByURL: Sorted 122 instructions in 0.0  
> seconds.
> 060705 154441 Processing linksByURL: Sorted Infinity instructions/ 
> second
> 060705 154441 Processing linksByURL: Merged to new DB containing  
> 148 records
> in 0.015 seconds
> 060705 154441 Processing linksByURL: Merged 9866.666666666668  
> records/second
> 060705 154441 Processing linksByMD5: Sorted 148 instructions in 0.0  
> seconds.
> 060705 154441 Processing linksByMD5: Sorted Infinity instructions/ 
> second
> 060705 154441 Processing linksByMD5: Merged to new DB containing  
> 148 records
> in 0.016 seconds
> 060705 154441 Processing linksByMD5: Merged 9250.0 records/second
> 060705 154442 Update finished
> 060705 154442 FetchListTool started
> 060705 154442 Processing pagesByURL: Sorted 26 instructions in  
> 0.016seconds.
> 060705 154442 Processing pagesByURL: Sorted 1625.0 instructions/second
> 060705 154442 Processing pagesByURL: Merged to new DB containing 53  
> records
> in 0.015 seconds
> 060705 154442 Processing pagesByURL: Merged  
> 3533.3333333333335records/second
> 060705 154442 Processing pagesByMD5: Sorted 26 instructions in 0.0  
> seconds.
> 060705 154442 Processing pagesByMD5: Sorted Infinity instructions/ 
> second
> 060705 154442 Processing pagesByMD5: Merged to new DB containing 53  
> records
> in 0.0 seconds
> 060705 154442 Processing pagesByMD5: Merged Infinity records/second
> 060705 154442 Processing linksByMD5: Copied file (0 bytes) in 0.016  
> secs.
> 060705 154442 Processing linksByURL: Copied file (0 bytes) in 0.0  
> secs.
> 060705 154442 Processing C:\cygwin\nutch-
> 0.7.2\bin\crawled2\segments\20060705154442\fetchlist.unsorted:  
> Sorted 26
> entries in 0.093 seconds.
> 060705 154442 Processing C:\cygwin\nutch-
> 0.7.2\bin\crawled2\segments\20060705154442\fetchlist.unsorted: Sorted
> 279.5698924731183 entries/second
> 060705 154442 Overall processing: Sorted 26 entries in 0.093 seconds.
> 060705 154442 Overall processing: Sorted 0.003576923076923077  
> entries/second
> 060705 154443 FetchListTool completed
> 060705 154443 logging at INFO
> 060705 154443 fetching http://www.haha365.com/gd_joke/ 
> 20050815111532.htm
> 060705 154443 fetching http://www.haha365.com/gd_joke/ 
> 20050815105800.htm
> 060705 154443 fetching http://www.haha365.com/gd_joke/ 
> 20050319085605.htm
> 060705 154443 fetching http://www.haha365.com/gd_joke/ 
> 20050815110121.htm
> 060705 154446 fetching http://www.haha365.com/gd_joke/ 
> 20060625064748.htm
> 060705 154448 fetching http://www.haha365.com/gd_joke/ 
> 20050815105937.htm
> 060705 154449 fetching http://www.haha365.com/gd_joke/ 
> 20050815110925.htm
> 060705 154450 fetching http://www.haha365.com/gd_joke/ 
> 20050815111651.htm
> 060705 154452 fetching http://www.haha365.com/gd_joke/ 
> 20050706110014.htm
> 060705 154453 fetching http://www.haha365.com/gd_joke/ 
> 20050318163615.htm
> 060705 154454 fetching http://www.haha365.com/gd_joke/ 
> 20050815111228.htm
> 060705 154456 fetching http://www.haha365.com/gd_joke/ 
> 20050706105833.htm
> 060705 154457 fetching http://www.haha365.com/gd_joke/ 
> 20050815110411.htm
> 060705 154459 fetching http://www.haha365.com/gd_joke/ 
> 20050815105527.htm
> 060705 154500 fetching http://www.haha365.com/gd_joke/ 
> 20050815111758.htm
> 060705 154502 fetching http://www.haha365.com/gd_joke/ 
> 20050706110230.htm
> 060705 154503 fetching http://www.haha365.com/gd_joke/ 
> 20050706105453.htm
> 060705 154504 fetching http://www.haha365.com/gd_joke/ 
> 20050706110522.htm
> 060705 154506 fetching http://www.haha365.com/gd_joke/ 
> 20050706105104.htm
> 060705 154507 fetching http://www.haha365.com/gd_joke/ 
> 20050709144044.htm
> 060705 154509 fetching http://www.haha365.com/gd_joke/ 
> 20060611112617.htm
> 060705 154510 fetching http://www.haha365.com/gd_joke/ 
> 20050815105330.htm
> 060705 154511 fetching http://www.haha365.com/gd_joke/ 
> 20050709144708.htm
> 060705 154513 fetching http://www.haha365.com/gd_joke/ 
> 20050706105324.htm
> 060705 154514 fetching http://www.haha365.com/gd_joke/ 
> 20050815110707.htm
> 060705 154516 fetching http://www.haha365.com/gd_joke/ 
> 20050706105218.htm
> 060705 154523 status: segment 20060705154442, 26 pages, 0 errors,  
> 314308
> bytes, 40063 ms
> 060705 154523 status: 0.6489779 pages/s, 61.291748 kb/s, 12088.77  
> bytes/page
> 060705 154524 Updating C:\cygwin\nutch-0.7.2\bin\crawled2\db
> 060705 154524 Updating for C:\cygwin\nutch-
> 0.7.2\bin\crawled2\segments\20060705154442
> 060705 154524 Processing document 0
> 060705 154524 Finishing update
> 060705 154524 Processing pagesByURL: Sorted 127 instructions in 0.0  
> seconds.
> 060705 154524 Processing pagesByURL: Sorted Infinity instructions/ 
> second
> 060705 154524 Processing pagesByURL: Merged to new DB containing 56  
> records
> in 0.0 seconds
> 060705 154524 Processing pagesByURL: Merged Infinity records/second
> 060705 154524 Processing pagesByMD5: Sorted 55 instructions in  
> 0.016seconds.
> 060705 154524 Processing pagesByMD5: Sorted 3437.5 instructions/second
> 060705 154524 Processing pagesByMD5: Merged to new DB containing 56  
> records
> in 0.015 seconds
> 060705 154524 Processing pagesByMD5: Merged  
> 3733.3333333333335records/second
> 060705 154524 Processing linksByMD5: Sorted 127 instructions in  
> 0.016seconds.
> 060705 154524 Processing linksByMD5: Sorted 7937.5 instructions/second
> 060705 154524 Processing linksByMD5: Merged to new DB containing  
> 249 records
> in 0.0 seconds
> 060705 154524 Processing linksByMD5: Merged Infinity records/second
> 060705 154524 Processing linksByURL: Sorted 101 instructions in 0.0  
> seconds.
> 060705 154524 Processing linksByURL: Sorted Infinity instructions/ 
> second
> 060705 154524 Processing linksByURL: Merged to new DB containing  
> 249 records
> in 0.016 seconds
> 060705 154524 Processing linksByURL: Merged 15562.5 records/second
> 060705 154524 Processing linksByMD5: Sorted 127 instructions in  
> 0.015seconds.
> 060705 154524 Processing linksByMD5: Sorted  
> 8466.666666666668instructions/second
> 060705 154524 Processing linksByMD5: Merged to new DB containing  
> 249 records
> in 0.0 seconds
> 060705 154524 Processing linksByMD5: Merged Infinity records/second
> 060705 154524 Update finished
> 060705 154524 Updating C:\cygwin\nutch-0.7.2\bin\crawled2\segments  
> from
> C:\cygwin\nutch-0.7.2\bin\crawled2\db
> 060705 154524  reading C:\cygwin\nutch-
> 0.7.2\bin\crawled2\segments\20060705154333
> 060705 154524  reading C:\cygwin\nutch-
> 0.7.2\bin\crawled2\segments\20060705154337
> 060705 154524  reading C:\cygwin\nutch-
> 0.7.2\bin\crawled2\segments\20060705154442
> 060705 154524 Sorting pages by url...
> 060705 154524 Getting updated scores and anchors from db...
> 060705 154524 Sorting updates by segment...
> 060705 154524 Updating segments...
> 060705 154524  updating C:\cygwin\nutch-
> 0.7.2\bin\crawled2\segments\20060705154333
> 060705 154525  updating C:\cygwin\nutch-
> 0.7.2\bin\crawled2\segments\20060705154337
> 060705 154525  updating C:\cygwin\nutch-
> 0.7.2\bin\crawled2\segments\20060705154442
> 060705 154525 Done updating C:\cygwin\nutch-0.7.2\bin\crawled2 
> \segments from
> C:\cygwin\nutch-0.7.2\bin\crawled2\db
> 060705 154525 indexing segment: C:\cygwin\nutch-
> 0.7.2\bin\crawled2\segments\20060705154333
> 060705 154525 * Opening segment 20060705154333
> 060705 154525 * Indexing segment 20060705154333
> 060705 154525 found resource common-terms.utf8 at file:/C:/cygwin/ 
> nutch-
> 0.7.2/conf/common-terms.utf8
> 060705 154525 * Optimizing index...
> 060705 154525 * Moving index to NFS if needed...
> 060705 154525 DONE indexing segment 20060705154333: total 1 records in
> 0.187s (Infinity rec/s).
> 060705 154525 done indexing
> 060705 154525 indexing segment: C:\cygwin\nutch-
> 0.7.2\bin\crawled2\segments\20060705154337
> 060705 154525 * Opening segment 20060705154337
> 060705 154525 * Indexing segment 20060705154337
> 060705 154525 * Optimizing index...
> 060705 154525 * Moving index to NFS if needed...
> 060705 154525 DONE indexing segment 20060705154337: total 26  
> records in
> 0.391 s (Infinity rec/s).
> 060705 154525 done indexing
> 060705 154525 indexing segment: C:\cygwin\nutch-
> 0.7.2\bin\crawled2\segments\20060705154442
> 060705 154525 * Opening segment 20060705154442
> 060705 154525 * Indexing segment 20060705154442
> 060705 154525 * Optimizing index...
> 060705 154525 * Moving index to NFS if needed...
> 060705 154525 DONE indexing segment 20060705154442: total 26  
> records in
> 0.219 s (Infinity rec/s).
> 060705 154525 done indexing
> 060705 154526 Reading url hashes...
> 060705 154526 Sorting url hashes...
> 060705 154526 Deleting url duplicates...
> 060705 154526 Deleted 0 url duplicates.
> 060705 154526 Reading content hashes...
> 060705 154526 Sorting content hashes...
> 060705 154526 Deleting content duplicates...
> 060705 154526 Deleted 1 content duplicates.
> 060705 154526 Duplicate deletion complete locally.  Now returning  
> to NFS...
> 060705 154526 DeleteDuplicates complete
> 060705 154526 Merging segment indexes...
> 060705 154526 crawl finished: crawled2