You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by dominik81 <al...@gmx.net> on 2008/07/05 12:34:50 UTC

Nutch not indexing all fetched sites

Hi,

with nutch-2008-06-26_04-01-58 I'm trying to index a few pages from the
Microsoft support knowledge base. I put the URLs in a file called 'urlall'
which looks like this:


http://support.microsoft.com/kb/317507/en-us
http://support.microsoft.com/kb/295115/en-us
http://support.microsoft.com/kb/295117/en-us
http://support.microsoft.com/kb/840701/en-us
http://support.microsoft.com/kb/924611/en-us
http://support.microsoft.com/kb/158509/en-us
http://support.microsoft.com/kb/259258/en-us
http://support.microsoft.com/kb/287070/en-us

I want to index those 8 pages only. Now I run the following command to crawl
the sites:

bin/nutch crawl /Users/dominik/Documents/MastersThesis/nutch/urls -dir
/Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl -depth 1
-topN 100 -threads 100

When the crawl is finished only 5 of 8 pages are indexed. Can you tell me
why, or what to change so that all sites from 'urlall' get indexed?

Thank you!


Here's the output from the crawl command:

crawl started in:
/Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl
rootUrlDir = /Users/dominik/Documents/MastersThesis/nutch/urls
threads = 100
depth = 1
topN = 100
Injector: starting
Injector: crawlDb:
/Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/crawldb
Injector: urlDir: /Users/dominik/Documents/MastersThesis/nutch/urls
Injector: Converting injected urls to crawl db entries.
Skipping
{\rtf1\ansi\ansicpg1252\cocoartf949\cocoasubrtf330:java.net.MalformedURLException:
no protocol: {\rtf1\ansi\ansicpg1252\cocoartf949\cocoasubrtf330
Skipping {\fonttbl\f0\fswiss\fcharset0
Helvetica;}:java.net.MalformedURLException: no protocol:
{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
Skipping
{\colortbl;\red255\green255\blue255;}:java.net.MalformedURLException: no
protocol: {\colortbl;\red255\green255\blue255;}
Skipping
\paperw11900\paperh16840\margl1440\margr1440\vieww9000\viewh8400\viewkind0:java.net.MalformedURLException:
no protocol:
\paperw11900\paperh16840\margl1440\margr1440\vieww9000\viewh8400\viewkind0
Skipping
\pard\tx566\tx1133\tx1700\tx2267\tx2834\tx3401\tx3968\tx4535\tx5102\tx5669\tx6236\tx6803\ql\qnatural\pardirnatural:java.net.MalformedURLException:
no protocol:
\pard\tx566\tx1133\tx1700\tx2267\tx2834\tx3401\tx3968\tx4535\tx5102\tx5669\tx6236\tx6803\ql\qnatural\pardirnatural
Skipping \f0\fs24 \cf0 \:java.net.MalformedURLException: no protocol:
\f0\fs24 \cf0 \
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment:
/Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/segments/20080705120652
Generator: filtering: true
Generator: topN: 100
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment:
/Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/segments/20080705120652
Fetcher: threads: 100
fetching http://support.microsoft.com/kb/259258/en-us\
fetching http://support.microsoft.com/kb/317507/en-us\
fetching http://support.microsoft.com/kb/295117/en-us\
fetching http://support.microsoft.com/kb/158509/en-us\
fetching http://support.microsoft.com/kb/295115/en-us\
fetching http://support.microsoft.com/kb/287070/en-us}
fetching http://support.microsoft.com/kb/840701/en-us\
fetching http://support.microsoft.com/kb/924611/en-us\
Fetcher: done
CrawlDb update: starting
CrawlDb update: db:
/Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/crawldb
CrawlDb update: segments:
[/Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/segments/20080705120652]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb:
/Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment:
file:/Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/segments/20080705120652
LinkDb: done
Indexer: starting
Indexer: linkdb:
/Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/linkdb
Indexer: adding segment:
file:/Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/segments/20080705120652
IFD [Thread-151]: setInfoStream
deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@79f5f7
IW 0 [Thread-151]: setInfoStream:
dir=org.apache.lucene.store.FSDirectory@/private/tmp/hadoop-dominik/mapred/local/index/_-173514222
autoCommit=true
mergePolicy=org.apache.lucene.index.LogByteSizeMergePolicy@596e13
mergeScheduler=org.apache.lucene.index.ConcurrentMergeScheduler@49d560
ramBufferSizeMB=16.0 maxBuffereDocs=50 maxBuffereDeleteTerms=-1
maxFieldLength=10000 index=
 Indexing [http://support.microsoft.com/kb/287070/en-us}] with analyzer
org.apache.nutch.analysis.NutchDocumentAnalyzer@9f6a5e (null)
 Indexing [http://support.microsoft.com/kb/295115/en-us\] with analyzer
org.apache.nutch.analysis.NutchDocumentAnalyzer@9f6a5e (null)
 Indexing [http://support.microsoft.com/kb/317507/en-us\] with analyzer
org.apache.nutch.analysis.NutchDocumentAnalyzer@9f6a5e (null)
 Indexing [http://support.microsoft.com/kb/840701/en-us\] with analyzer
org.apache.nutch.analysis.NutchDocumentAnalyzer@9f6a5e (null)
 Indexing [http://support.microsoft.com/kb/924611/en-us\] with analyzer
org.apache.nutch.analysis.NutchDocumentAnalyzer@9f6a5e (null)
Optimizing index.
IW 0 [Thread-151]: optimize: index now
IW 0 [Thread-151]:   flush: segment=_0 docStoreSegment=_0 docStoreOffset=0
flushDocs=true flushDeletes=false flushDocStores=true numDocs=5
numBufDelTerms=0
IW 0 [Thread-151]:   index before flush
flush postings as segment _0 numDocs=5

closeDocStore: 2 files to flush to segment _0

oldRAMSize=76608 newFlushedSize=30888 docs/MB=169.738 new/old=40.32%
IW 0 [Thread-151]: checkpoint: wrote segments file "segments_2"
IFD [Thread-151]: now checkpoint "segments_2" [1 segments ; isCommit = true]
IFD [Thread-151]: deleteCommits: now remove commit "segments_1"
IFD [Thread-151]: delete "segments_1"
IW 0 [Thread-151]: LMP: findMerges: 1 segments
IW 0 [Thread-151]: LMP:   level -1.0 to 2.6517506: 1 segments
IW 0 [Thread-151]: CMS: now merge
IW 0 [Thread-151]: CMS:   index: _0:C5
IW 0 [Thread-151]: CMS:   no more merges pending; now return
IW 0 [Thread-151]: CMS: now merge
IW 0 [Thread-151]: CMS:   index: _0:C5
IW 0 [Thread-151]: CMS:   no more merges pending; now return
IW 0 [Thread-151]: now flush at close
IW 0 [Thread-151]:   flush: segment=null docStoreSegment=null
docStoreOffset=0 flushDocs=false flushDeletes=false flushDocStores=false
numDocs=0 numBufDelTerms=0
IW 0 [Thread-151]:   index before flush _0:C5
IW 0 [Thread-151]: at close: _0:C5
Indexer: done
Dedup: starting
Dedup: adding indexes in:
/Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/indexes
Dedup: done
merging indexes to:
/Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/index
Adding
file:/Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl/indexes/part-00000
done merging
crawl finished:
/Users/dominik/ApplicationFolders/nutch-2008-06-26_04-01-58/crawl



-- 
View this message in context: http://www.nabble.com/Nutch-not-indexing-all-fetched-sites-tp18290960p18290960.html
Sent from the Nutch - User mailing list archive at Nabble.com.