You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Armel T. Nene" <ar...@idna-solutions.com> on 2007/02/13 01:11:49 UTC

Nutch 0.8 FATAL fetcher.Fetcher: java.lang.NullPointerException

Hi guys,

 

I have been getting a nullpointerexception for the last two days. I am
trying to crawl a very large collection of files (about 40Gb). The crawler
will fetch and index about 2000 files (included folders) and there will be
no issues with parsing. Now I know there are more files than that in the
directory but the crawler will fail with the following error:

 

INFO parser.custom: Custom-parse: Parsing content
file:/C:/TeamBinder/AddressBook/9100/(65)E110_ST A0 (1).pdf

07/02/12 22:09:16 INFO fetcher.Fetcher: fetch of
file:/C:/TeamBinder/AddressBook/9100/(65)E110_ST A0 (1).pdf failed with:
java.lang.NullPointerException

07/02/12 22:09:17 INFO mapred.LocalJobRunner: 0 pages, 0 errors, 0.0
pages/s, 0 kb/s, 

07/02/12 22:09:17 FATAL fetcher.Fetcher: java.lang.NullPointerException

07/02/12 22:09:17 FATAL fetcher.Fetcher: at
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:198)

07/02/12 22:09:17 FATAL fetcher.Fetcher: at
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:189)

07/02/12 22:09:17 FATAL fetcher.Fetcher: at
org.apache.hadoop.mapred.MapTask$2.collect(MapTask.java:91)

07/02/12 22:09:17 FATAL fetcher.Fetcher: at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:314)

07/02/12 22:09:17 FATAL fetcher.Fetcher: at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:232)

07/02/12 22:09:17 FATAL fetcher.Fetcher: fetcher
caught:java.lang.NullPointerException

The error also occurs with different file formats not just pdf files. Now I
understand that this is a known issues as there were a similar issue open a
while ago:

 

HYPERLINK
"http://issues.apache.org/jira/browse/NUTCH-220"http://issues.apache.org/jir
a/browse/NUTCH-220.

 

At first I thought the error was caused by the parser but I was able to
fetch-parse-index this file type before and now in this crawl.  The problem
is not caused by any parsers or protocol plugins. I am crawling a local
drive, therefore if there were a problem with the protocol, a 404 file
protocol error (file not found) should be thrown instead. I am trying to get
to the bottom of this as I am trying to build an index but this causes the
all process to abort. If there is someone from the community that can help,
I will be opened to any suggestions.

 

It seems that the error is caused by hadoop process. If this is the case can
someone point me to the right direction. Also some plugins have major issues
with multi-threads in nutch such the parse-xml plugins, is there anybody who
has experienced those issues before.

 

I am looking forward to your views on this issue. I am using Nutch 0.8.2 dev
from the branch. 

 

Best Regards,

 

Armel

_________________________

Armel T. Nene

iDNA Solutions LTD

Tel: +44 (20) 7257 6124

Mobile: +44 (7886)950 483 

Web: http://www.idna-solutions.com

Blog: HYPERLINK
"http://blog.idna-solutions.com"http://blog.idna-solutions.com

 


-- 
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.441 / Virus Database: 268.17.37/682 - Release Date: 12/02/2007
13:23