You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2011/03/27 15:34:55 UTC
[Nutch Wiki] Update of "Incremental Crawling Scripts Test" by Gabriele Kahlout
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "Incremental Crawling Scripts Test" page has been changed by Gabriele Kahlout.
http://wiki.apache.org/nutch/Incremental%20Crawling%20Scripts%20Test
--------------------------------------------------
New page:
2. Unabridged script with explanations and using nutch index:
{{{
$ ./whole-web-crawling-incremental urls-input/MR6 5 2
rm -r crawl
rm: urls-input/MR6/it_seeds: No such file or directory
2 urls to crawl
rm: urls-input/MR6/it_seeds/urls: No such file or directory
bin/nutch inject crawl/crawldb urls-input/MR6/it_seeds
Injector: starting at 2011-03-27 15:28:07
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls-input/MR6/it_seeds
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-03-27 15:28:22, elapsed: 00:00:15
generate-fetch-updatedb-invertlinks-index-merge iteration 0:
bin/nutch generate crawl/crawldb crawl/segments -topN 5
Generator: starting at 2011-03-27 15:28:29 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20110327152839 Generator: finished at 2011-03-27 15:28:45, elapsed: 00:00:15
bin/nutch fetch crawl/segments/20110327152839
Fetcher: starting at 2011-03-27 15:28:49
Fetcher: segment: crawl/segments/20110327152839
Fetcher: threads: 10
QueueFeeder finished: total 2 records + hit by time limit :0
fetching http://localhost:8080/qui/2.html
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=1
* queue: http://localhost
maxThreads = 1
inProgress = 1
crawlDelay = 5000
minCrawlDelay = 0
nextFetchTime = 1301232536012
now = 1301232538470
0. http://localhost:8080/qui/1.html
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
* queue: http://localhost
maxThreads = 1
inProgress = 0
crawlDelay = 5000
minCrawlDelay = 0
nextFetchTime = 1301232543848
now = 1301232539474
0. http://localhost:8080/qui/1.html
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
* queue: http://localhost
maxThreads = 1
inProgress = 0
crawlDelay = 5000
minCrawlDelay = 0
nextFetchTime = 1301232543848
now = 1301232540479
0. http://localhost:8080/qui/1.html
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
* queue: http://localhost
maxThreads = 1
inProgress = 0
crawlDelay = 5000
minCrawlDelay = 0
nextFetchTime = 1301232543848
now = 1301232541514
0. http://localhost:8080/qui/1.html
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
* queue: http://localhost
maxThreads = 1
inProgress = 0
crawlDelay = 5000
minCrawlDelay = 0
nextFetchTime = 1301232543848
now = 1301232542619
0. http://localhost:8080/qui/1.html
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
* queue: http://localhost
maxThreads = 1
inProgress = 0
crawlDelay = 5000
minCrawlDelay = 0
nextFetchTime = 1301232543848
now = 1301232543640
0. http://localhost:8080/qui/1.html
fetching http://localhost:8080/qui/1.html
-finishing thread FetcherThread, activeThreads=9
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2011-03-27 15:29:07, elapsed: 00:00:17
bin/nutch updatedb crawl/crawldb crawl/segments/20110327152839
CrawlDb update: starting at 2011-03-27 15:29:12
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20110327152839]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2011-03-27 15:29:22, elapsed: 00:00:09
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
LinkDb: starting at 2011-03-27 15:29:27
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/Users/simpatico/nutch-1.2/crawl/segments/20110327152839
LinkDb: finished at 2011-03-27 15:29:34, elapsed: 00:00:06
rm: crawl/new_indexes: No such file or directory
bin/nutch index crawl/new_indexes crawl/crawldb crawl/linkdb crawl/segments/20110327152839
Indexer: starting at 2011-03-27 15:29:39
content:4.0 while state.getLength():4 norm:0.25
host:1.0 while state.getLength():1 norm:1.0
site:1.0 while state.getLength():1 norm:1.0
title:1.0 while state.getLength():0 norm:1.0
url:7.0 while state.getLength():7 norm:0.14285715
content:4.0 while state.getLength():4 norm:0.25
host:1.0 while state.getLength():1 norm:1.0
site:1.0 while state.getLength():1 norm:1.0
title:1.0 while state.getLength():0 norm:1.0
url:7.0 while state.getLength():7 norm:0.14285715
Indexer: finished at 2011-03-27 15:29:57, elapsed: 00:00:18
bin/nutch merge crawl/temp_indexes/part-1 crawl/indexes crawl/new_indexes
IndexMerger: starting at 2011-03-27 15:30:03
IndexMerger: merging indexes to: crawl/temp_indexes/part-1
Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
IndexMerger: finished at 2011-03-27 15:30:05, elapsed: 00:00:02
rm: crawl/indexes: No such file or directory
generate-fetch-updatedb-invertlinks-index-merge iteration 1:
bin/nutch generate crawl/crawldb crawl/segments -topN 5
Generator: starting at 2011-03-27 15:30:10 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ...
bin/nutch readdb crawl/crawldb -stats
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 2
retry 0: 2
min score: 1.0
avg score: 1.0
max score: 1.0
status 2 (db_fetched): 2
CrawlDb statistics: done
bin/nutch mergedb crawl/temp_crawldb crawl/crawldb
CrawlDb merge: starting at 2011-03-27 15:30:37
Adding crawl/crawldb
CrawlDb merge: finished at 2011-03-27 15:30:44, elapsed: 00:00:07
rm: crawl/allcrawldb: No such file or directory
rm: crawl/allcrawldb/dump: No such file or directory
bin/nutch readdb crawl/allcrawldb -dump crawl/allcrawldb/dump
CrawlDb dump: starting
CrawlDb db: crawl/allcrawldb
CrawlDb dump: done
CrawlDb statistics start: crawl/allcrawldb
Statistics for CrawlDb: crawl/allcrawldb
TOTAL urls: 2
retry 0: 2
min score: 1.0
avg score: 1.0
max score: 1.0
status 2 (db_fetched): 2
CrawlDb statistics: done
}}}