You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Arun Kumar Sharma <sh...@yahoo.co.in> on 2005/12/04 08:39:05 UTC

Unable to load parser from parser factory for html and text files.

  I am parsing two local hard -disk system files. I am getting error :
   
  java.lang.ExceptionInInitializerError
java.lang.NoClassDefFoundError
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:58)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:252)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:204)
   
    The problem is with parser. I think , that my system is unable to load parser from parser factory.  But configuration are required for that? I have attached error log for you.
   
  051204 125954 parsing file:/F:/Atalntis_scheduler/nutch-default.xml
051204 125954 parsing file:/F:/Atalntis_scheduler/crawl-tool.xml
051204 125954 parsing file:/F:/Atalntis_scheduler/nutch-site.xml
051204 125954 No FS indicated, using default:local
051204 125954 crawl started in: /F:/Atalntis_scheduler/Crawled
051204 125954 rootUrlFile = /F:/Atalntis_scheduler/urls.txt
051204 125954 threads = 10
051204 125954 depth = 5
051204 125954 Created webdb at LocalFS,F:\Atalntis_scheduler\Crawled\db
051204 125954 Starting URL processing
051204 125954 Plugins: looking in: F:\Atalntis_scheduler\plugins
051204 125954 Plugin Auto-activation mode: [true]
051204 125954 Registered Plugins:
051204 125954   URL Query Filter (query-url)
051204 125954   Site Query Filter (query-site)
051204 125954   Html Parse Plug-in (parse-html)
051204 125954   the nutch core extension points (nutch-extensionpoints)
051204 125954   Basic Indexing Filter (index-basic)
051204 125954   Pdf Parse Plug-in (parse-pdf)
051204 125954   File Protocol Plug-in (protocol-file)
051204 125954   Text Parse Plug-in (parse-text)
051204 125954   JavaScript Parser (parse-js)
051204 125955   Regex URL Filter (urlfilter-regex)
051204 125955   Basic Query Filter (query-basic)
051204 125955 Registered Extension-Points:
051204 125955   Nutch Protocol (org.apache.nutch.protocol.Protocol)
051204 125955   Nutch URL Filter (org.apache.nutch.net.URLFilter)
051204 125955   HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
051204 125955   Nutch Online Search Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
051204 125955   Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
051204 125955   Nutch Content Parser (org.apache.nutch.parse.Parser)
051204 125955   Ontology Model Loader (org.apache.nutch.ontology.Ontology)
051204 125955   Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
051204 125955   Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
051204 125955 found resource crawl-urlfilter.txt at file:/F:/Atalntis_scheduler/crawl-urlfilter.txt
051204 125955 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
051204 125955 Added 2 pages
051204 125955 Processing pagesByURL: Sorted 2 instructions in 0.016 seconds.
051204 125955 Processing pagesByURL: Sorted 125.0 instructions/second
051204 125955 Processing pagesByURL: Merged to new DB containing 2 records in 0.0 seconds
051204 125955 Processing pagesByURL: Merged Infinity records/second
051204 125955 Processing pagesByMD5: Sorted 2 instructions in 0.0 seconds.
051204 125955 Processing pagesByMD5: Sorted Infinity instructions/second
051204 125955 Processing pagesByMD5: Merged to new DB containing 2 records in 0.0 seconds
051204 125955 Processing pagesByMD5: Merged Infinity records/second
051204 125955 Processing linksByMD5: Copied file (0 bytes) in 0.0 secs.
051204 125955 Processing linksByURL: Copied file (0 bytes) in 0.016 secs.
051204 125955 FetchListTool started
051204 125955 Processing pagesByURL: Sorted 2 instructions in 0.0 seconds.
051204 125955 Processing pagesByURL: Sorted Infinity instructions/second
051204 125955 Processing pagesByURL: Merged to new DB containing 2 records in 0.0 seconds
051204 125955 Processing pagesByURL: Merged Infinity records/second
051204 125955 Processing pagesByMD5: Sorted 2 instructions in 0.016 seconds.
051204 125955 Processing pagesByMD5: Sorted 125.0 instructions/second
051204 125955 Processing pagesByMD5: Merged to new DB containing 2 records in 0.0 seconds
051204 125955 Processing pagesByMD5: Merged Infinity records/second
051204 125955 Processing linksByMD5: Copied file (0 bytes) in 0.0 secs.
051204 125955 Processing linksByURL: Copied file (0 bytes) in 0.016 secs.
051204 125955 Processing F:\Atalntis_scheduler\Crawled\segments\20051204125955\fetchlist.unsorted: Sorted 1 entries in 0.016 seconds.
051204 125955 Processing F:\Atalntis_scheduler\Crawled\segments\20051204125955\fetchlist.unsorted: Sorted 62.5 entries/second
051204 125955 Overall processing: Sorted 1 entries in 0.016 seconds.
051204 125955 Overall processing: Sorted 0.016 entries/second
051204 125955 FetchListTool completed
051204 125955 fetching file:///F:/Atalntis_scheduler/Crawl_Files/Voltix_4n_network.txt
051204 125956 Unable to parse [null].Reason is [java.net.MalformedURLException]
051204 125956 fetch of file:///F:/Atalntis_scheduler/Crawl_Files/Voltix_4n_network.txt failed with: java.lang.ExceptionInInitializerError
java.lang.NoClassDefFoundError
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:58)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:252)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:204)
051204 125956 status: segment 20051204125955, 1 pages, 1 errors, 230 bytes, 1000 ms
051204 125956 status: 1.0 pages/s, 1.796875 kb/s, 230.0 bytes/page
051204 125957 Updating F:\Atalntis_scheduler\Crawled\db
051204 125958 Updating for F:\Atalntis_scheduler\Crawled\segments\20051204125955
051204 125958 Finishing update
051204 125958 Update finished
051204 125958 FetchListTool started
051204 125958 Overall processing: Sorted 0 entries in 0.0 seconds.
051204 125958 Overall processing: Sorted NaN entries/second
051204 125958 FetchListTool completed
051204 125959 Updating F:\Atalntis_scheduler\Crawled\db
051204 125959 Updating for F:\Atalntis_scheduler\Crawled\segments\20051204125958
051204 125959 Finishing update
051204 125959 Update finished
051204 125959 FetchListTool started
051204 125959 Overall processing: Sorted 0 entries in 0.0 seconds.
051204 125959 Overall processing: Sorted NaN entries/second
051204 125959 FetchListTool completed
051204 130000 Updating F:\Atalntis_scheduler\Crawled\db
051204 130000 Updating for F:\Atalntis_scheduler\Crawled\segments\20051204125959
051204 130000 Finishing update
051204 130000 Update finished
051204 130000 FetchListTool started
051204 130000 Overall processing: Sorted 0 entries in 0.0 seconds.
051204 130000 Overall processing: Sorted NaN entries/second
051204 130000 FetchListTool completed
051204 130001 Updating F:\Atalntis_scheduler\Crawled\db
051204 130001 Updating for F:\Atalntis_scheduler\Crawled\segments\20051204130000
051204 130001 Finishing update
051204 130001 Update finished
051204 130001 FetchListTool started
051204 130002 Overall processing: Sorted 0 entries in 0.0 seconds.
051204 130002 Overall processing: Sorted NaN entries/second
051204 130002 FetchListTool completed
051204 130003 Updating F:\Atalntis_scheduler\Crawled\db
051204 130003 Updating for F:\Atalntis_scheduler\Crawled\segments\20051204130001
051204 130003 Finishing update
051204 130003 Update finished
051204 130003 Updating F:\Atalntis_scheduler\Crawled\segments from F:\Atalntis_scheduler\Crawled\db
051204 130003  reading F:\Atalntis_scheduler\Crawled\segments\20051204125955
051204 130003  reading F:\Atalntis_scheduler\Crawled\segments\20051204125958
051204 130003  reading F:\Atalntis_scheduler\Crawled\segments\20051204125959
051204 130003  reading F:\Atalntis_scheduler\Crawled\segments\20051204130000
051204 130003  reading F:\Atalntis_scheduler\Crawled\segments\20051204130001
051204 130003 Sorting pages by url...
051204 130003 Getting updated scores and anchors from db...
051204 130003 Sorting updates by segment...
051204 130003 Updating segments...
051204 130003 Done updating F:\Atalntis_scheduler\Crawled\segments from F:\Atalntis_scheduler\Crawled\db
051204 130003 indexing segment: F:\Atalntis_scheduler\Crawled\segments\20051204125955
051204 130003 * Opening segment 20051204125955
051204 130003 * Indexing segment 20051204125955
051204 130003 * Optimizing index...
051204 130003 * Moving index to NFS if needed...
051204 130003 DONE indexing segment 20051204125955: total 0 records in 0.063 s (NaN rec/s).
051204 130003 done indexing
051204 130003 indexing segment: F:\Atalntis_scheduler\Crawled\segments\20051204125958
051204 130003 * Opening segment 20051204125958
051204 130003 * Indexing segment 20051204125958
051204 130003 * Optimizing index...
051204 130003 * Moving index to NFS if needed...
051204 130003 DONE indexing segment 20051204125958: total 0 records in 0.032 s (NaN rec/s).
051204 130003 done indexing
051204 130003 indexing segment: F:\Atalntis_scheduler\Crawled\segments\20051204125959
051204 130003 * Opening segment 20051204125959
051204 130003 * Indexing segment 20051204125959
051204 130003 * Optimizing index...
051204 130003 * Moving index to NFS if needed...
051204 130003 DONE indexing segment 20051204125959: total 0 records in 0.015 s (NaN rec/s).
051204 130003 done indexing
051204 130003 indexing segment: F:\Atalntis_scheduler\Crawled\segments\20051204130000
051204 130003 * Opening segment 20051204130000
051204 130003 * Indexing segment 20051204130000
051204 130003 * Optimizing index...
051204 130003 * Moving index to NFS if needed...
051204 130003 DONE indexing segment 20051204130000: total 0 records in 0.047 s (NaN rec/s).
051204 130003 done indexing
051204 130003 indexing segment: F:\Atalntis_scheduler\Crawled\segments\20051204130001
051204 130003 * Opening segment 20051204130001
051204 130003 * Indexing segment 20051204130001
051204 130003 * Optimizing index...
051204 130003 * Moving index to NFS if needed...
051204 130003 DONE indexing segment 20051204130001: total 0 records in 0.016 s (NaN rec/s).
051204 130003 done indexing
051204 130003 Reading url hashes...
051204 130003 Sorting url hashes...
051204 130003 Deleting url duplicates...
051204 130003 Deleted 0 url duplicates.
051204 130003 Reading content hashes...
051204 130003 Sorting content hashes...

   
  




Regards,
 
Arun Kumar Sharma (Tech Lead -Java/J2EE)
Mob: +91.981.529.5761




		
---------------------------------
 Enjoy this Diwali with Y! India Click here