You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Daniel Varela Santoalla <dv...@ecmwf.int> on 2006/07/28 18:18:05 UTC

"unknown protocol" and some other problems in 0.8.

Hello

In this extract of "hadoop.log" we can find three problems I'm finding 
with the freshly downloaded 0.8 version.

- Something happens with the links in some PDF files. This didn't happen 
with 0.7 using the same version of PDFBox.
- I get "Aborting with hung threads" from time to time
- And also some NullPointerException here and there.

I'm using java1.5 running on Linux.

Regards
Daniel


2006-07-28 16:49:04,799 ERROR parse.OutlinkExtractor - getOutlinks
java.net.MalformedURLException: unknown protocol: roles
         at java.net.URL.<init>(URL.java:574)
         at java.net.URL.<init>(URL.java:464)
         at java.net.URL.<init>(URL.java:413)
         at 
org.apache.nutch.net.BasicUrlNormalizer.normalize(BasicUrlNormalizer.java:78)
         at org.apache.nutch.parse.Outlink.<init>(Outlink.java:35)
         at 
org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:111)
         at 
org.apache.nutch.parse.OutlinkExtractor.getOutlinks(OutlinkExtractor.java:70)
         at 
org.apache.nutch.parse.pdf.PdfParser.getParse(PdfParser.java:150)
         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
         at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:276)
         at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:152)
2006-07-28 16:53:12,686 WARN  fetcher.Fetcher - Aborting with 7 hung 
threads.
2006-07-28 16:53:54,871 FATAL fetcher.Fetcher - 
java.lang.NullPointerException
2006-07-28 16:53:54,871 FATAL fetcher.Fetcher - at 
org.apache.hadoop.fs.FSDataInputStream$Buffer.getPos(FSDataInputStream.java:205)
2006-07-28 16:53:54,871 FATAL fetcher.Fetcher - at 
org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:248)
2006-07-28 16:53:54,871 FATAL fetcher.Fetcher - at 
org.apache.hadoop.io.SequenceFile$Reader.getPosition(SequenceFile.java:462)
2006-07-28 16:53:54,871 FATAL fetcher.Fetcher - at 
org.apache.hadoop.mapred.SequenceFileRecordReader.getPos(SequenceFileRecordReader.ja
va:68)
2006-07-28 16:53:54,872 FATAL fetcher.Fetcher - at 
org.apache.hadoop.mapred.MapTask$3.next(MapTask.java:115)
2006-07-28 16:53:54,872 FATAL fetcher.Fetcher - at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:114)
2006-07-28 16:53:54,872 FATAL fetcher.Fetcher - fetcher 
caught:java.lang.NullPointerException
2006-07-28 16:54:02,805 INFO  fetcher.Fetcher - Fetcher: done
2006-07-28 16:54:02,805 INFO  crawl.CrawlDb - CrawlDb update: starting
2006-07-28 16:54:02,805 INFO  crawl.CrawlDb - CrawlDb update: db: 
crawl-20060728164608/crawldb

-- 

Daniel Varela Santoalla
European Centre for Medium-Range Weather Forecasts (ECMWF)