You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2011/03/27 15:34:55 UTC

[Nutch Wiki] Update of "Incremental Crawling Scripts Test" by Gabriele Kahlout

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "Incremental Crawling Scripts Test" page has been changed by Gabriele Kahlout.
http://wiki.apache.org/nutch/Incremental%20Crawling%20Scripts%20Test

--------------------------------------------------

New page:
2. Unabridged script with explanations and using nutch index:

{{{
$ ./whole-web-crawling-incremental urls-input/MR6 5 2
rm -r crawl

rm: urls-input/MR6/it_seeds: No such file or directory
2 urls to crawl
rm: urls-input/MR6/it_seeds/urls: No such file or directory

bin/nutch inject crawl/crawldb urls-input/MR6/it_seeds
Injector: starting at 2011-03-27 15:28:07
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls-input/MR6/it_seeds
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-03-27 15:28:22, elapsed: 00:00:15

generate-fetch-updatedb-invertlinks-index-merge iteration 0:

bin/nutch generate crawl/crawldb crawl/segments -topN 5
Generator: starting at 2011-03-27 15:28:29 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: crawl/segments/20110327152839 Generator: finished at 2011-03-27 15:28:45, elapsed: 00:00:15

bin/nutch fetch crawl/segments/20110327152839
Fetcher: starting at 2011-03-27 15:28:49
Fetcher: segment: crawl/segments/20110327152839
Fetcher: threads: 10
QueueFeeder finished: total 2 records + hit by time limit :0
fetching http://localhost:8080/qui/2.html
-activeThreads=10, spinWaiting=9, fetchQueues.totalSize=1
* queue: http://localhost
  maxThreads    = 1
  inProgress    = 1
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1301232536012
  now           = 1301232538470
  0. http://localhost:8080/qui/1.html
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
* queue: http://localhost
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1301232543848
  now           = 1301232539474
  0. http://localhost:8080/qui/1.html
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
* queue: http://localhost
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1301232543848
  now           = 1301232540479
  0. http://localhost:8080/qui/1.html
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
* queue: http://localhost
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1301232543848
  now           = 1301232541514
  0. http://localhost:8080/qui/1.html
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
* queue: http://localhost
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1301232543848
  now           = 1301232542619
  0. http://localhost:8080/qui/1.html
-activeThreads=10, spinWaiting=10, fetchQueues.totalSize=1
* queue: http://localhost
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1301232543848
  now           = 1301232543640
  0. http://localhost:8080/qui/1.html
fetching http://localhost:8080/qui/1.html
-finishing thread FetcherThread, activeThreads=9
-finishing thread FetcherThread, activeThreads=7
-finishing thread FetcherThread, activeThreads=6
-finishing thread FetcherThread, activeThreads=8
-finishing thread FetcherThread, activeThreads=5
-finishing thread FetcherThread, activeThreads=4
-finishing thread FetcherThread, activeThreads=3
-finishing thread FetcherThread, activeThreads=2
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
-activeThreads=0
Fetcher: finished at 2011-03-27 15:29:07, elapsed: 00:00:17

bin/nutch updatedb crawl/crawldb crawl/segments/20110327152839
CrawlDb update: starting at 2011-03-27 15:29:12
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20110327152839]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: Merging segment data into db.
CrawlDb update: finished at 2011-03-27 15:29:22, elapsed: 00:00:09

bin/nutch invertlinks crawl/linkdb -dir crawl/segments
LinkDb: starting at 2011-03-27 15:29:27
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: file:/Users/simpatico/nutch-1.2/crawl/segments/20110327152839
LinkDb: finished at 2011-03-27 15:29:34, elapsed: 00:00:06

rm: crawl/new_indexes: No such file or directory
bin/nutch index crawl/new_indexes crawl/crawldb crawl/linkdb crawl/segments/20110327152839
Indexer: starting at 2011-03-27 15:29:39
content:4.0 while state.getLength():4 norm:0.25
host:1.0 while state.getLength():1 norm:1.0
site:1.0 while state.getLength():1 norm:1.0
title:1.0 while state.getLength():0 norm:1.0
url:7.0 while state.getLength():7 norm:0.14285715
content:4.0 while state.getLength():4 norm:0.25
host:1.0 while state.getLength():1 norm:1.0
site:1.0 while state.getLength():1 norm:1.0
title:1.0 while state.getLength():0 norm:1.0
url:7.0 while state.getLength():7 norm:0.14285715
Indexer: finished at 2011-03-27 15:29:57, elapsed: 00:00:18

bin/nutch merge crawl/temp_indexes/part-1 crawl/indexes crawl/new_indexes
IndexMerger: starting at 2011-03-27 15:30:03
IndexMerger: merging indexes to: crawl/temp_indexes/part-1
Adding file:/Users/simpatico/nutch-1.2/crawl/new_indexes/part-00000
IndexMerger: finished at 2011-03-27 15:30:05, elapsed: 00:00:02

rm: crawl/indexes: No such file or directory

generate-fetch-updatedb-invertlinks-index-merge iteration 1:

bin/nutch generate crawl/crawldb crawl/segments -topN 5
Generator: starting at 2011-03-27 15:30:10 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 5 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ...

bin/nutch readdb crawl/crawldb -stats
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls:	2
retry 0:	2
min score:	1.0
avg score:	1.0
max score:	1.0
status 2 (db_fetched):	2
CrawlDb statistics: done

bin/nutch mergedb crawl/temp_crawldb crawl/crawldb
CrawlDb merge: starting at 2011-03-27 15:30:37
Adding crawl/crawldb
CrawlDb merge: finished at 2011-03-27 15:30:44, elapsed: 00:00:07

rm: crawl/allcrawldb: No such file or directory

rm: crawl/allcrawldb/dump: No such file or directory
bin/nutch readdb crawl/allcrawldb -dump crawl/allcrawldb/dump
CrawlDb dump: starting
CrawlDb db: crawl/allcrawldb
CrawlDb dump: done

CrawlDb statistics start: crawl/allcrawldb
Statistics for CrawlDb: crawl/allcrawldb
TOTAL urls:	2
retry 0:	2
min score:	1.0
avg score:	1.0
max score:	1.0
status 2 (db_fetched):	2
CrawlDb statistics: done
}}}