You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Arun Kaundal <ar...@gmail.com> on 2005/12/26 13:17:07 UTC

Problem crawling Ms-word , Crawl depth issue, Result problem

Hi,

 It is strange that when I am crawling it is working fine when depth is 2
or <2. But when depth is more than 2
    it stop responding...

   Second problem , I am facing is regarding parsing of MS-WORD file. I got
error message that fetch is okey but cannot parse. Although I specify this
in nutch-site.xml file

  name>plugin.includes</name>
<value>protocol-file|urlfilter-regex|parse-(xml|text|html|*doc*
|pdf|js)|index-basic|query-(basic|site|url)</value>
</property>

Third issue is regarding result. I parse(CMP.pdf) pdf documents. It parse
well. But when I search results based on keywords, I found only one result
per keywords search.... why it is so... Is there any mistake.??

crawl log is below  ....

051226 173413 parsing file:/F:/Module_Index_Management/Atlantis_Tools/nutch-
default.xml
051226 173413 parsing file:/F:/Module_Index_Management/Atlantis_Tools/crawl-
tool.xml
051226 173413 parsing file:/F:/Module_Index_Management/Atlantis_Tools/nutch-
site.xml
051226 173413 No FS indicated, using default:local
051226 173413 crawl started in:
F:\Module_Index_Management\Atlantis_Tools\Crawled
051226 173413 rootUrlFile =
F:\Module_Index_Management\Atlantis_Tools\urls.txt
051226 173413 threads = 5
051226 173413 depth = 2
051226 173413 Created webdb at
LocalFS,F:\Module_Index_Management\Atlantis_Tools\Crawled\db
051226 173413 Starting URL processing
051226 173413 Plugins: looking in:
F:\Module_Index_Management\Atlantis_Tools\plugins
051226 173414 Plugin Auto-activation mode: [true]
051226 173414 Registered Plugins:
051226 173414   URL Query Filter (query-url)
051226 173414   Site Query Filter (query-site)
051226 173414   Html Parse Plug-in (parse-html)
051226 173414   the nutch core extension points (nutch-extensionpoints)
051226 173414   Basic Indexing Filter (index-basic)
051226 173414   Pdf Parse Plug-in (parse-pdf)
051226 173414   File Protocol Plug-in (protocol-file)
051226 173414   Text Parse Plug-in (parse-text)
051226 173414   JavaScript Parser (parse-js)
051226 173414   Regex URL Filter (urlfilter-regex)
051226 173414   Basic Query Filter (query-basic)
051226 173414 Registered Extension-Points:
051226 173414   Nutch Protocol (org.apache.nutch.protocol.Protocol)
051226 173414   Nutch URL Filter (org.apache.nutch.net.URLFilter)
051226 173414   HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
051226 173414   Nutch Online Search Results Clustering Plugin (
org.apache.nutch.clustering.OnlineClusterer)
051226 173414   Nutch Indexing Filter (
org.apache.nutch.indexer.IndexingFilter)
051226 173414   Nutch Content Parser (org.apache.nutch.parse.Parser)
051226 173414   Ontology Model Loader (org.apache.nutch.ontology.Ontology)
051226 173414   Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
051226 173414   Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
051226 173414 found resource crawl-urlfilter.txt at
file:/F:/Module_Index_Management/Atlantis_Tools/crawl-urlfilter.txt
051226 173414 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
051226 173414 Added 2 pages
051226 173414 Processing pagesByURL: Sorted 2 instructions in 0.016 seconds.
051226 173414 Processing pagesByURL: Sorted 125.0 instructions/second
051226 173414 Processing pagesByURL: Merged to new DB containing 2 records
in 0.0 seconds
051226 173414 Processing pagesByURL: Merged Infinity records/second
051226 173414 Processing pagesByMD5: Sorted 2 instructions in 0.016 seconds.
051226 173414 Processing pagesByMD5: Sorted 125.0 instructions/second
051226 173414 Processing pagesByMD5: Merged to new DB containing 2 records
in 0.0 seconds
051226 173414 Processing pagesByMD5: Merged Infinity records/second
051226 173414 Processing linksByMD5: Copied file (0 bytes) in 0.015 secs.
051226 173414 Processing linksByURL: Copied file (0 bytes) in 0.0 secs.
051226 173414 FetchListTool started
051226 173414 Processing pagesByURL: Sorted 2 instructions in 0.0 seconds.
051226 173414 Processing pagesByURL: Sorted Infinity instructions/second
051226 173414 Processing pagesByURL: Merged to new DB containing 2 records
in 0.0 seconds
051226 173414 Processing pagesByURL: Merged Infinity records/second
051226 173414 Processing pagesByMD5: Sorted 2 instructions in 0.016 seconds.
051226 173414 Processing pagesByMD5: Sorted 125.0 instructions/second
051226 173414 Processing pagesByMD5: Merged to new DB containing 2 records
in 0.0 seconds
051226 173414 Processing pagesByMD5: Merged Infinity records/second
051226 173414 Processing linksByMD5: Copied file (0 bytes) in 0.0 secs.
051226 173414 Processing linksByURL: Copied file (0 bytes) in 0.016 secs.
051226 173414 Processing
F:\Module_Index_Management\Atlantis_Tools\Crawled\segments\20051226173414\fetchlist.unsorted:
Sorted 1 entries in 0.015 seconds.
051226 173414 Processing
F:\Module_Index_Management\Atlantis_Tools\Crawled\segments\20051226173414\fetchlist.unsorted:
Sorted 66.66666666666667 entries/second
051226 173414 Overall processing: Sorted 1 entries in 0.015 seconds.
051226 173414 Overall processing: Sorted 0.015 entries/second
051226 173415 FetchListTool completed
051226 173415 found resource parse-plugins.xml at
file:/F:/Module_Index_Management/Atlantis_Tools/parse-plugins.xml
051226 173415 logging at INFO
051226 173415 fetching
file:///F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/
051226 173415 Parsing [
file:///F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/] with [
org.apache.nutch.parse.html.HtmlParser@b9b538]
051226 173416 status: segment 20051226173414, 1 pages, 0 errors, 2693 bytes,
1016 ms
051226 173416 status: 0.9842519 pages/s, 20.707739 kb/s, 2693.0 bytes/page
051226 173417 Updating F:\Module_Index_Management\Atlantis_Tools\Crawled\db
051226 173417 Updating for
F:\Module_Index_Management\Atlantis_Tools\Crawled\segments\20051226173414
051226 173417 Processing document 0
051226 173417 Finishing update
051226 173417 Processing pagesByURL: Sorted 30 instructions in 0.016seconds.
051226 173417 Processing pagesByURL: Sorted 1875.0 instructions/second
051226 173417 Processing pagesByURL: Merged to new DB containing 31 records
in 0.016 seconds
051226 173417 Processing pagesByURL: Merged 1937.5 records/second
051226 173417 Processing pagesByMD5: Sorted 31 instructions in 0.0 seconds.
051226 173417 Processing pagesByMD5: Sorted Infinity instructions/second
051226 173417 Processing pagesByMD5: Merged to new DB containing 31 records
in 0.0 seconds
051226 173417 Processing pagesByMD5: Merged Infinity records/second
051226 173417 Processing linksByMD5: Sorted 30 instructions in 0.016seconds.
051226 173417 Processing linksByMD5: Sorted 1875.0 instructions/second
051226 173417 Processing linksByMD5: Merged to new DB containing 29 records
in 0.0 seconds
051226 173417 Processing linksByMD5: Merged Infinity records/second
051226 173417 Processing linksByURL: Sorted 29 instructions in 0.016seconds.
051226 173417 Processing linksByURL: Sorted 1812.5 instructions/second
051226 173417 Processing linksByURL: Merged to new DB containing 29 records
in 0.016 seconds
051226 173417 Processing linksByURL: Merged 1812.5 records/second
051226 173417 Processing linksByMD5: Sorted 29 instructions in 0.015seconds.
051226 173417 Processing linksByMD5: Sorted
1933.3333333333335instructions/second
051226 173417 Processing linksByMD5: Merged to new DB containing 29 records
in 0.0 seconds
051226 173417 Processing linksByMD5: Merged Infinity records/second
051226 173417 Update finished
051226 173417 FetchListTool started
051226 173417 Processing pagesByURL: Sorted 29 instructions in 0.015seconds.
051226 173417 Processing pagesByURL: Sorted
1933.3333333333335instructions/second
051226 173417 Processing pagesByURL: Merged to new DB containing 31 records
in 0.0 seconds
051226 173417 Processing pagesByURL: Merged Infinity records/second
051226 173417 Processing pagesByMD5: Sorted 29 instructions in 0.016seconds.
051226 173417 Processing pagesByMD5: Sorted 1812.5 instructions/second
051226 173417 Processing pagesByMD5: Merged to new DB containing 31 records
in 0.0 seconds
051226 173417 Processing pagesByMD5: Merged Infinity records/second
051226 173417 Processing linksByMD5: Copied file (0 bytes) in 0.015 secs.
051226 173417 Processing linksByURL: Copied file (0 bytes) in 0.0 secs.
051226 173418 Processing
F:\Module_Index_Management\Atlantis_Tools\Crawled\segments\20051226173417\fetchlist.unsorted:
Sorted 29 entries in 0.016 seconds.
051226 173418 Processing
F:\Module_Index_Management\Atlantis_Tools\Crawled\segments\20051226173417\fetchlist.unsorted:
Sorted 1812.5 entries/second
051226 173418 Overall processing: Sorted 29 entries in 0.016 seconds.
051226 173418 Overall processing: Sorted 5.517241379310345E-4 entries/second
051226 173418 FetchListTool completed
051226 173418 logging at INFO
051226 173418 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/overview-
tree.html
051226 173418 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/Invoker.doc
051226 173418 Parsing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/overview-
tree.html] with [org.apache.nutch.parse.html.HtmlParser@b9b538]
051226 173418 ParserFactory: Plugin: parse-msword mapped to contentType
application/msword via parse-plugins.xml, but not enabled via
plugin.includes in nutch-default.xml
051226 173418 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/FetchListGenTask.html
051226 173418 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/CMP.pdf
051226 173418 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/IndexTask.html
051226 173418 Parsing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/FetchListGenTask.html]
with [org.apache.nutch.parse.html.HtmlParser@b9b538]
051226 173418 Unable to successfully parse content
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/Invoker.doc of
type application/msword
051226 173418 fetch okay, but can't parse
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/Invoker.doc,
reason: notparsed(0,0)
051226 173418 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/index.html
051226 173418 Parsing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/IndexTask.html]
with [org.apache.nutch.parse.html.HtmlParser@b9b538]
051226 173418 ParserFactory:Plugin: parse-text mapped to contentType
application/pdf via parse-plugins.xml, but its plugin.xml file does not
claim to support contentType: application/pdf
051226 173418 Parsing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/index.html]
with [org.apache.nutch.parse.html.HtmlParser@b9b538]
051226 173418 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/OptimisationTask.html
051226 173418 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/Voltix_4n_network.txt
051226 173418 Parsing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/OptimisationTask.html]
with [org.apache.nutch.parse.html.HtmlParser@b9b538]
051226 173418 Parsing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/Voltix_4n_network.txt]
with [org.apache.nutch.parse.text.TextParser@5dcec6]
051226 173418 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/MergeTask.html
051226 173418 Parsing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/MergeTask.html]
with [org.apache.nutch.parse.html.HtmlParser@b9b538]
051226 173418 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/resources/
051226 173418 Parsing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/resources/]
with [org.apache.nutch.parse.html.HtmlParser@b9b538]
051226 173418 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/FetcherTask.html
051226 173418 Parsing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/FetcherTask.html]
with [org.apache.nutch.parse.html.HtmlParser@b9b538]
051226 173418 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/package-list
051226 173418 fetch of
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/package-list
failed with: java.lang.Exception: java.lang.IllegalArgumentException: null
type
051226 173418 Could not clean the content-type [], Reason is [
org.apache.nutch.util.mime.MimeTypeException: The type can not be null or
empty]. Using its raw version...
051226 173418 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/constant-
values.html
051226 173418 Parsing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/CMP.pdf] with [
org.apache.nutch.parse.pdf.PdfParser@bcda2d]
051226 173418 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/deprecated-
list.html
051226 173418 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/overview-
frame.html
051226 173418 Parsing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/package-list]
with [org.apache.nutch.parse.text.TextParser@5dcec6]
051226 173418 Parsing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/constant-
values.html] with [org.apache.nutch.parse.html.HtmlParser@b9b538]
051226 173418 Parsing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/deprecated-
list.html] with [org.apache.nutch.parse.html.HtmlParser@b9b538]
051226 173418 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/allclasses-
noframe.html
051226 173418 Parsing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/overview-
frame.html] with [org.apache.nutch.parse.html.HtmlParser@b9b538]
051226 173418 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/Scheduler.doc
051226 173418 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/XMLRPCServer.html
051226 173418 Parsing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/allclasses-
noframe.html] with [org.apache.nutch.parse.html.HtmlParser@b9b538]
051226 173418 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/index-all.html
051226 173418 Parsing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/XMLRPCServer.html]
with [org.apache.nutch.parse.html.HtmlParser@b9b538]
051226 173418 Parsing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/index-all.html]
with [org.apache.nutch.parse.html.HtmlParser@b9b538]
051226 173418 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/com/
051226 173418 Parsing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/com/] with [
org.apache.nutch.parse.html.HtmlParser@b9b538]
051226 173418 Unable to successfully parse content
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/Scheduler.doc of
type application/msword
051226 173419 fetch okay, but can't parse
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/Scheduler.doc,
reason: notparsed(0,0)
051226 173419 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/J2EETutorial.pdf
051226 173419 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/allclasses-
frame.html
051226 173419 Parsing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/allclasses-
frame.html] with [org.apache.nutch.parse.html.HtmlParser@b9b538]
051226 173419 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/overview-
summary.html
051226 173419 Parsing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/overview-
summary.html] with [org.apache.nutch.parse.html.HtmlParser@b9b538]
051226 173419 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/ParseTask.html
051226 173419 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/XMLRPCClient.html
051226 173419 Parsing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/ParseTask.html]
with [org.apache.nutch.parse.html.HtmlParser@b9b538]
051226 173419 Parsing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/XMLRPCClient.html]
with [org.apache.nutch.parse.html.HtmlParser@b9b538]
051226 173419 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/AtlantisXMLRPCHandler.html
051226 173419 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/help-doc.html
051226 173419 Parsing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/AtlantisXMLRPCHandler.html]
with [org.apache.nutch.parse.html.HtmlParser@b9b538]
051226 173419 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/SecurityFacade.html
051226 173419 Parsing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/help-doc.html]
with [org.apache.nutch.parse.html.HtmlParser@b9b538]
051226 173419 Parsing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/SecurityFacade.html]
with [org.apache.nutch.parse.html.HtmlParser@b9b538]
051226 173419 fetching
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/packages.html
051226 173419 Parsing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/packages.html]
with [org.apache.nutch.parse.html.HtmlParser@b9b538]
051226 173420 Parsing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/J2EETutorial.pdf]
with [org.apache.nutch.parse.pdf.PdfParser@bcda2d]
051226 173649 fetch of
file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/J2EETutorial.pdf
failed with: java.lang.OutOfMemoryError
051226 173649 Could not clean the content-type [], Reason is [
org.apache.nutch.util.mime.MimeTypeException: The type can not be null or
empty]. Using its raw version...
051226 173649 Parsing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/J2EETutorial.pdf]
with [org.apache.nutch.parse.text.TextParser@5dcec6]
051226 173650 status: segment 20051226173417, 28 pages, 2 errors, 17724257
bytes, 152000 ms
051226 173650 status: 0.18421052 pages/s, 910.99176 kb/s, 633009.1bytes/page
051226 173651 Updating F:\Module_Index_Management\Atlantis_Tools\Crawled\db
051226 173651 Updating for
F:\Module_Index_Management\Atlantis_Tools\Crawled\segments\20051226173417
051226 173651 Processing document 0
051226 173651 Finishing update
051226 173651 Processing pagesByURL: Sorted 564 instructions in 0.016seconds.
051226 173651 Processing pagesByURL: Sorted 35250.0 instructions/second
051226 173651 Processing pagesByURL: Merged to new DB containing 153 records
in 0.0 seconds
051226 173651 Processing pagesByURL: Merged Infinity records/second
051226 173651 Processing pagesByMD5: Sorted 178 instructions in 0.0 seconds.
051226 173651 Processing pagesByMD5: Sorted Infinity instructions/second
051226 173651 Processing pagesByMD5: Merged to new DB containing 153 records
in 0.015 seconds
051226 173651 Processing pagesByMD5: Merged 10200.0 records/second
051226 173651 Processing linksByMD5: Sorted 562 instructions in 0.016seconds.
051226 173651 Processing linksByMD5: Sorted 35125.0 instructions/second
051226 173651 Processing linksByMD5: Merged to new DB containing 322 records
in 0.016 seconds
051226 173651 Processing linksByMD5: Merged 20125.0 records/second
051226 173651 Processing linksByURL: Sorted 293 instructions in 0.015seconds.
051226 173651 Processing linksByURL: Sorted
19533.333333333336instructions/second
051226 173652 Processing linksByURL: Merged to new DB containing 322 records
in 0.0 seconds
051226 173652 Processing linksByURL: Merged Infinity records/second
051226 173652 Processing linksByMD5: Sorted 316 instructions in 0.0 seconds.
051226 173652 Processing linksByMD5: Sorted Infinity instructions/second
051226 173652 Processing linksByMD5: Merged to new DB containing 322 records
in 0.016 seconds
051226 173652 Processing linksByMD5: Merged 20125.0 records/second
051226 173652 Update finished
051226 173652 Updating
F:\Module_Index_Management\Atlantis_Tools\Crawled\segments from
F:\Module_Index_Management\Atlantis_Tools\Crawled\db
051226 173652  reading
F:\Module_Index_Management\Atlantis_Tools\Crawled\segments\20051226173414
051226 173652  reading
F:\Module_Index_Management\Atlantis_Tools\Crawled\segments\20051226173417
051226 173652 Sorting pages by url...
051226 173652 Getting updated scores and anchors from db...
051226 173652 Sorting updates by segment...
051226 173652 Updating segments...
051226 173652  updating
F:\Module_Index_Management\Atlantis_Tools\Crawled\segments\20051226173414
051226 173652  updating
F:\Module_Index_Management\Atlantis_Tools\Crawled\segments\20051226173417
051226 173652 Done updating
F:\Module_Index_Management\Atlantis_Tools\Crawled\segments from
F:\Module_Index_Management\Atlantis_Tools\Crawled\db
051226 173652 * Opening 2 segments:
051226 173652 %%&&************************SEGMENT DIR
:F:\Module_Index_Management\Atlantis_Tools\Crawled\segments\20051226173414****************&&%%
051226 173652  - segment 20051226173414: 1 records.
051226 173652 %%&&************************SEGMENT DIR
:F:\Module_Index_Management\Atlantis_Tools\Crawled\segments\20051226173417****************&&%%
051226 173652  - segment 20051226173417: 29 records.
051226 173652 * TOTAL 30 input records in 2 segments.
051226 173652 * Creating master index...
051226 173652 * Creating index took 219 ms
051226 173652 * Optimizing index took 0 ms
051226 173652 * Removing duplicate entries...
051226 173652 * Deduplicating took 15 ms
051226 173652 * Merging all segments into segments
051226 173653 * Merging took 219 ms
051226 173653 * Creating new segment index(es)...
051226 173653 * Opening segment 20051226173652
051226 173653 * Indexing segment 20051226173652
051226 173653  Indexing [
file:///F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/] with
analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173653 found resource common-terms.utf8 at
file:/F:/Module_Index_Management/Atlantis_Tools/common-terms.utf8
051226 173653  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/Invoker.doc]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173653  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/index.html]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173653  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/overview-
tree.html] with analyzer
org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173653  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/Voltix_4n_network.txt]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173653  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/OptimisationTask.html]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173653  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/resources/]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173653  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/IndexTask.html]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173653  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/FetchListGenTask.html]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173653  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/MergeTask.html]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173653  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/FetcherTask.html]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173653  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/deprecated-
list.html] with analyzer
org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173653  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/constant-
values.html] with analyzer
org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173653  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/allclasses-
noframe.html] with analyzer
org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173653  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/overview-
frame.html] with analyzer
org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173653  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/com/] with
analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173653  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/Scheduler.doc]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173653  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/XMLRPCServer.html]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173653  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/allclasses-
frame.html] with analyzer
org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173653  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/index-all.html]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173653  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/ParseTask.html]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173653  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/overview-
summary.html] with analyzer
org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173653  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/XMLRPCClient.html]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173653  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/help-doc.html]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173653  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/AtlantisXMLRPCHandler.html]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173653  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/packages.html]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173653  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/SecurityFacade.html]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173653  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/CMP.pdf] with
analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173653 * Optimizing index...
051226 173654 * Moving index to NFS if needed...
051226 173654 DONE indexing segment 20051226173652: total 30 records in 1.0s (
30.0 rec/s).
051226 173654 * Deleting old segments...
051226 173654 Finished SegmentMergeTool: INPUT: 30 -> OUTPUT: 30 entries in
1.75 s (30.0 entries/sec).
051226 173654 indexing segment:
F:\Module_Index_Management\Atlantis_Tools\Crawled\segments\20051226173652
051226 173654 * Opening segment 20051226173652
051226 173654 * Indexing segment 20051226173652
051226 173654  Indexing [
file:///F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/] with
analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173654  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/Invoker.doc]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173654  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/index.html]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173654  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/overview-
tree.html] with analyzer
org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173654  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/Voltix_4n_network.txt]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173654  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/OptimisationTask.html]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173654  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/resources/]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173654  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/IndexTask.html]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173654  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/FetchListGenTask.html]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173654  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/MergeTask.html]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173654  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/FetcherTask.html]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173654  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/deprecated-
list.html] with analyzer
org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173654  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/constant-
values.html] with analyzer
org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173654  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/allclasses-
noframe.html] with analyzer
org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173654  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/overview-
frame.html] with analyzer
org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173654  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/com/] with
analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173654  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/Scheduler.doc]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173654  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/XMLRPCServer.html]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173654  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/allclasses-
frame.html] with analyzer
org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173654  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/index-all.html]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173654  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/ParseTask.html]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173654  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/overview-
summary.html] with analyzer
org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173654  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/XMLRPCClient.html]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173654  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/help-doc.html]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173654  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/AtlantisXMLRPCHandler.html]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173654  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/packages.html]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173655  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/SecurityFacade.html]
with analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173655  Indexing
[file:/F:/Module_Index_Management/Atlantis_Tools/Crawl_Files/CMP.pdf] with
analyzer org.apache.nutch.analysis.NutchDocumentAnalyzer@1acfa31 (null)
051226 173655 * Optimizing index...
051226 173655 * Moving index to NFS if needed...
051226 173655 DONE indexing segment 20051226173652: total 30 records in
0.954 s (Infinity rec/s).
051226 173655 done indexing
051226 173655 Reading url hashes...
051226 173655 Sorting url hashes...
051226 173655 Deleting url duplicates...
051226 173655 Deleted 0 url duplicates.
051226 173655 Reading content hashes...
051226 173655 Sorting content hashes...
051226 173655 Deleting content duplicates...
051226 173655 Deleted 0 content duplicates.
051226 173655 Duplicate deletion complete locally.  Now returning to NFS...
051226 173655 DeleteDuplicates complete
051226 173655 Merging segment indexes...
051226 173655 crawl finished:
F:\Module_Index_Management\Atlantis_Tools\Crawled