You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by AJ Chen <aj...@web2express.org> on 2010/09/29 01:40:12 UTC

hadoop or nutch problem?

I'm doing web crawling using nutch, which runs on hadoop in distributed
mode. When the crawldb has tens of millions of urls, I have started to see
strange failure in generating new segment and updating crawldb.
For generating segment, the hadoop job for select is completed successfully
and generate-temp-1285641291765 is created. but it does not start the
partition job and the segment is not created in segments directory. I try to
understand where it fails. There is no error message except for a few WARN
messages about connection reset by peer. Hadoop fsck and dfsadmin show the
nodes and directories are healthy. Is this a hadoop problem or nutch
problem? I'll appreciate any suggestion for how to debug this fatal
problem.

Similar problem is seen for updatedb step, which creates the temp dir but
never actually update the crawldb.

thanks,
aj
-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
web2express.org
twitter: @web2express
Palo Alto, CA, USA

Re: hadoop or nutch problem?

Posted by AJ Chen <aj...@web2express.org>.

More observations: during hadoop job running, this "filesystem closed" error
happens consistently.
2010-10-02 05:29:58,951 WARN  mapred.TaskTracker - Error running child
java.io.IOException: Filesystem closed
        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:226)
        at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:67)
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.close(DFSClient.java:1678)
        at java.io.FilterInputStream.close(FilterInputStream.java:155)
        at
org.apache.hadoop.io.SequenceFile$Reader.close(SequenceFile.java:1584)
        at
org.apache.hadoop.mapred.SequenceFileRecordReader.close(SequenceFileRecordReader.java:125)
        at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close(MapTask.java:198)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:362)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)
2010-10-02 05:29:58,979 WARN  mapred.TaskRunner - Parent died.  Exiting
attempt_201009301134_0006_m_000074_1

Could this error turns on the safemode in hadoop? I suspect this because the
next hadoop job is supposed to create a segment directory and write out
segment results, but it does not create the directory.  Anything else could
happen to hdfs?

thanks,
-aj

On Tue, Sep 28, 2010 at 4:40 PM, AJ Chen <aj...@web2express.org> wrote:

> I'm doing web crawling using nutch, which runs on hadoop in distributed
> mode. When the crawldb has tens of millions of urls, I have started to see
> strange failure in generating new segment and updating crawldb.
> For generating segment, the hadoop job for select is completed successfully
> and generate-temp-1285641291765 is created. but it does not start the
> partition job and the segment is not created in segments directory. I try to
> understand where it fails. There is no error message except for a few WARN
> messages about connection reset by peer. Hadoop fsck and dfsadmin show the
> nodes and directories are healthy. Is this a hadoop problem or nutch
> problem? I'll appreciate any suggestion for how to debug this fatal
> problem.
>
> Similar problem is seen for updatedb step, which creates the temp dir but
> never actually update the crawldb.
>
> thanks,
> aj
> --
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> web2express.org
> twitter: @web2express
> Palo Alto, CA, USA
>

Re: hadoop or nutch problem?

Posted by AJ Chen <aj...@web2express.org>.

More observations: during hadoop job running, this "filesystem closed" error
happens consistently.
2010-10-02 05:29:58,951 WARN  mapred.TaskTracker - Error running child
java.io.IOException: Filesystem closed
        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:226)
        at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:67)
        at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.close(DFSClient.java:1678)
        at java.io.FilterInputStream.close(FilterInputStream.java:155)
        at
org.apache.hadoop.io.SequenceFile$Reader.close(SequenceFile.java:1584)
        at
org.apache.hadoop.mapred.SequenceFileRecordReader.close(SequenceFileRecordReader.java:125)
        at
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close(MapTask.java:198)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:362)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)
2010-10-02 05:29:58,979 WARN  mapred.TaskRunner - Parent died.  Exiting
attempt_201009301134_0006_m_000074_1

Could this error turns on the safemode in hadoop? I suspect this because the
next hadoop job is supposed to create a segment directory and write out
segment results, but it does not create the directory.  Anything else could
happen to hdfs?

thanks,
-aj

On Tue, Sep 28, 2010 at 4:40 PM, AJ Chen <aj...@web2express.org> wrote:

> I'm doing web crawling using nutch, which runs on hadoop in distributed
> mode. When the crawldb has tens of millions of urls, I have started to see
> strange failure in generating new segment and updating crawldb.
> For generating segment, the hadoop job for select is completed successfully
> and generate-temp-1285641291765 is created. but it does not start the
> partition job and the segment is not created in segments directory. I try to
> understand where it fails. There is no error message except for a few WARN
> messages about connection reset by peer. Hadoop fsck and dfsadmin show the
> nodes and directories are healthy. Is this a hadoop problem or nutch
> problem? I'll appreciate any suggestion for how to debug this fatal
> problem.
>
> Similar problem is seen for updatedb step, which creates the temp dir but
> never actually update the crawldb.
>
> thanks,
> aj
> --
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> web2express.org
> twitter: @web2express
> Palo Alto, CA, USA
>