You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Bolle, Jeffrey F." <jb...@mitre.org> on 2007/11/26 21:08:00 UTC

Crash in Parser

All,
I'm having some trouble with the Nutch nightly.  It has been a while
since I last updated my crawl of our intranet.  I was attempting to run
the crawl today and it failed with this:
Exception in thread "main" java.io.IOException: Job failed!
        at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:831)
        at
org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:142)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:126)

In the web interface it says that:
Task task_200711261211_0026_m_000015_0 filed to report status for 602
seconds. Killing!
Task task_200711261211_0026_m_000015_1 filed to report status for 601
seconds. Killing!
Task task_200711261211_0026_m_000015_2 filed to report status for 601
seconds. Killing!
Task task_200711261211_0026_m_000015_3 filed to report status for 602
seconds. Killing!
 
I don't have the fetchers set to parse.  Nutch and hadoop are running
on a 3 node cluster.  I've attached the job configuration file as saved
from the web interface.
 
Is there any way I can get more information on which file or url the
parse is failing on?  Why doesn't the parsing of a file or URL fail
more cleanly?
 
Any recommendations on helping nutch avoid whatever is causing the hang
and allowing it to index the rest of the content?
 
Thanks.
 
 
Jeff Bolle

Newbie question: fetching specific files only.

Posted by "Jose C. Lacal" <Jo...@OpenPHI.com>.

Dear all:

First of all, I am impressed with Nutch's capabilities. In less than 24
hours of work I have a nice system up and running, doing what I thought
would have taken me months to build. Congrats to the community members.

I have RTFM, the tutorials, and the lists. This may be a regex question
more than a Nutch issue. Yet here's the newbie question:

a.) I need to crawl a particular website where the files of interest are
all named as follows: PPPxxxxxxxx ('PPP' followed by 08 digits)

b.) The files are stored under
./show/PPPxxxxxxxx
./show/record/PPPxxxxxxxx
./show/locn/PPPxxxxxxxx
./show/related/PPPxxxxxxxx


After RTFM, I have tried the following with no success:

* regex-urlfilter.txt (+^http://*.*/show/)
* URLs file (http://*.*/show/)

Any pointers appreciated. Thanks.


-- 

José C. Lacal, Founder & Chief Vision Officer
Open Personalized Health Informatics _OpenPHI
15625 NW  15th Avenue; Suite 15
Miami, FL 33169-5601  USA     www.OpenPHI.com
+1 (954) 553-1984      Jose.Lacal@OpenPHI.com

Newbie question: fetching specific files only.

Posted by "Jose C. Lacal" <Jo...@OpenPHI.com>.

Dear all:

First of all, I am impressed with Nutch's capabilities. In less than 24
hours of work I have a nice system up and running, doing what I thought
would have taken me months to build. Congrats to the community members.

I have RTFM, the tutorials, and the lists. This may be a regex question
more than a Nutch issue. Yet here's the newbie question:

a.) I need to crawl a particular website where the files of interest are
all named as follows: PPPxxxxxxxx ('PPP' followed by 08 digits)

b.) The files are stored under
./show/PPPxxxxxxxx
./show/record/PPPxxxxxxxx
./show/locn/PPPxxxxxxxx
./show/related/PPPxxxxxxxx


After RTFM, I have tried the following with no success:

* regex-urlfilter.txt (+^http://*.*/show/)
* URLs file (http://*.*/show/)

Any pointers appreciated. Thanks.


-- 

José C. Lacal, Founder & Chief Vision Officer
Open Personalized Health Informatics _OpenPHI
15625 NW  15th Avenue; Suite 15
Miami, FL 33169-5601  USA     www.OpenPHI.com
+1 (954) 553-1984      Jose.Lacal@OpenPHI.com

Re: Crash in Parser

Posted by Karol Rybak <ka...@gmail.com>.

Hello I had the same issue, seems that there's a problem with neko html
parser i solved it by using tagsoup parser.

-- 
Karol Rybak
Programista / Programmer
Sekcja aplikacji / Applications section
Wyższa Szkoła Informatyki i Zarządzania / University of Internet Technology
and Management
+48(17)8661277

RE: Crash in Parser

Posted by "Bolle, Jeffrey F." <jb...@mitre.org>.

Ned,
Thanks for the hint, I found the advice of using kill -s SIGQUIT in an
earlier post.  Luckily, I just saw the hung thread on the machine and
managed to get the command in before Nutch killed it.

It doesn't appear that I am stuck in the regexp.  I ran the command a
few times, here are the last two iterations:

2007-11-27 17:46:29
Full thread dump Java HotSpot(TM) Server VM (1.6.0_01-b06 mixed mode):

"Comm thread for task_200711270828_0031_m_000016_1" daemon prio=10
tid=0x52229c00 nid=0x33ce waiting on condition [0x5209a000..0x5209afb0]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at org.apache.hadoop.mapred.Task$1.run(Task.java:281)
	at java.lang.Thread.run(Thread.java:619)

"org.apache.hadoop.dfs.DFSClient$LeaseChecker@1b15692" daemon prio=10
tid=0x52231800 nid=0x33cd waiting on condition [0x520ec000..0x520ec130]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at
org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:558)
	at java.lang.Thread.run(Thread.java:619)

"IPC Client connection to cisserver/192.168.100.215:9000" daemon
prio=10 tid=0x52203800 nid=0x33cc in Object.wait()
[0x5213c000..0x5213d0b0]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x572c5fe8> (a
org.apache.hadoop.ipc.Client$Connection)
	at java.lang.Object.wait(Object.java:485)
	at
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:216)
	- locked <0x572c5fe8> (a
org.apache.hadoop.ipc.Client$Connection)
	at org.apache.hadoop.ipc.Client$Connection.run(Client.java:255)

"IPC Client connection to /127.0.0.1:51728" daemon prio=10
tid=0x52238c00 nid=0x33cb in Object.wait() [0x521a0000..0x521a0e30]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x572ca5e0> (a
org.apache.hadoop.ipc.Client$Connection)
	at java.lang.Object.wait(Object.java:485)
	at
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:216)
	- locked <0x572ca5e0> (a
org.apache.hadoop.ipc.Client$Connection)
	at org.apache.hadoop.ipc.Client$Connection.run(Client.java:255)

"org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=10
tid=0x52235000 nid=0x33ca waiting on condition [0x521f1000..0x521f1db0]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at
org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:404)

"Low Memory Detector" daemon prio=10 tid=0x08ad9400 nid=0x33c7 runnable
[0x00000000..0x00000000]
   java.lang.Thread.State: RUNNABLE

"CompilerThread1" daemon prio=10 tid=0x08ad7800 nid=0x33c6 waiting on
condition [0x00000000..0x525d2688]
   java.lang.Thread.State: RUNNABLE

"CompilerThread0" daemon prio=10 tid=0x08ad6400 nid=0x33c5 waiting on
condition [0x00000000..0x526535c8]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x08ad5000 nid=0x33c4 runnable
[0x00000000..0x00000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x08ac2000 nid=0x33c3 in Object.wait()
[0x528f5000..0x528f60b0]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x5727ddc8> (a java.lang.ref.ReferenceQueue$Lock)
	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
	- locked <0x5727ddc8> (a java.lang.ref.ReferenceQueue$Lock)
	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
	at
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=10 tid=0x08ac1400 nid=0x33c2 in
Object.wait() [0x52946000..0x52946e30]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x5727dde8> (a java.lang.ref.Reference$Lock)
	at java.lang.Object.wait(Object.java:485)
	at
java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
	- locked <0x5727dde8> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x089fd000 nid=0x33be runnable
[0xb7fab000..0xb7fac208]
   java.lang.Thread.State: RUNNABLE
	at java.util.Arrays.copyOf(Arrays.java:2882)
	at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.ja
va:100)
	at
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
	at java.lang.StringBuffer.append(StringBuffer.java:224)
	- locked <0xab27c430> (a java.lang.StringBuffer)
	at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown
Source)
	at
org.apache.nutch.parse.html.DOMBuilder.characters(DOMBuilder.java:405)
	at org.ccil.cowan.tagsoup.Parser.pcdata(Parser.java:461)
	at
org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:451)
	at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:210)
	at
org.apache.nutch.parse.html.HtmlParser.parseTagSoup(HtmlParser.java:222
)
	at
org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:209)
	at
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145)
	at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
	at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
	at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)

"VM Thread" prio=10 tid=0x08abe800 nid=0x33c1 runnable 

"GC task thread#0 (ParallelGC)" prio=10 tid=0x08a03c00 nid=0x33bf
runnable 

"GC task thread#1 (ParallelGC)" prio=10 tid=0x08a04c00 nid=0x33c0
runnable 

"VM Periodic Task Thread" prio=10 tid=0x08adac00 nid=0x33c8 waiting on
condition 

JNI global references: 1196

Heap
 PSYoungGen      total 159424K, used 155715K [0xaa7a0000, 0xb4e40000,
0xb4e40000)
  eden space 148224K, 97% used [0xaa7a0000,0xb34ccfa0,0xb3860000)
  from space 11200K, 99% used [0xb3860000,0xb4343f40,0xb4350000)
  to   space 11200K, 0% used [0xb4350000,0xb4350000,0xb4e40000)
 PSOldGen        total 369088K, used 120964K [0x57240000, 0x6dab0000,
0xaa7a0000)
  object space 369088K, 32% used [0x57240000,0x5e861268,0x6dab0000)
 PSPermGen       total 16384K, used 8760K [0x53240000, 0x54240000,
0x57240000)
  object space 16384K, 53% used [0x53240000,0x53ace250,0x54240000)

2007-11-27 17:47:25
Full thread dump Java HotSpot(TM) Server VM (1.6.0_01-b06 mixed mode):

"Comm thread for task_200711270828_0031_m_000016_1" daemon prio=10
tid=0x52229c00 nid=0x33ce waiting on condition [0x5209a000..0x5209afb0]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at org.apache.hadoop.mapred.Task$1.run(Task.java:281)
	at java.lang.Thread.run(Thread.java:619)

"org.apache.hadoop.dfs.DFSClient$LeaseChecker@1b15692" daemon prio=10
tid=0x52231800 nid=0x33cd waiting on condition [0x520ec000..0x520ec130]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at
org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:558)
	at java.lang.Thread.run(Thread.java:619)

"IPC Client connection to cisserver/192.168.100.215:9000" daemon
prio=10 tid=0x52203800 nid=0x33cc in Object.wait()
[0x5213c000..0x5213d0b0]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x572c5fe8> (a
org.apache.hadoop.ipc.Client$Connection)
	at java.lang.Object.wait(Object.java:485)
	at
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:216)
	- locked <0x572c5fe8> (a
org.apache.hadoop.ipc.Client$Connection)
	at org.apache.hadoop.ipc.Client$Connection.run(Client.java:255)

"IPC Client connection to /127.0.0.1:51728" daemon prio=10
tid=0x52238c00 nid=0x33cb in Object.wait() [0x521a0000..0x521a0e30]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x572ca5e0> (a
org.apache.hadoop.ipc.Client$Connection)
	at java.lang.Object.wait(Object.java:485)
	at
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:216)
	- locked <0x572ca5e0> (a
org.apache.hadoop.ipc.Client$Connection)
	at org.apache.hadoop.ipc.Client$Connection.run(Client.java:255)

"org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=10
tid=0x52235000 nid=0x33ca waiting on condition [0x521f1000..0x521f1db0]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at
org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:404)

"Low Memory Detector" daemon prio=10 tid=0x08ad9400 nid=0x33c7 runnable
[0x00000000..0x00000000]
   java.lang.Thread.State: RUNNABLE

"CompilerThread1" daemon prio=10 tid=0x08ad7800 nid=0x33c6 waiting on
condition [0x00000000..0x525d2688]
   java.lang.Thread.State: RUNNABLE

"CompilerThread0" daemon prio=10 tid=0x08ad6400 nid=0x33c5 waiting on
condition [0x00000000..0x526535c8]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x08ad5000 nid=0x33c4 waiting on
condition [0x00000000..0x00000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x08ac2000 nid=0x33c3 in Object.wait()
[0x528f5000..0x528f60b0]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x5727ddc8> (a java.lang.ref.ReferenceQueue$Lock)
	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
	- locked <0x5727ddc8> (a java.lang.ref.ReferenceQueue$Lock)
	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
	at
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=10 tid=0x08ac1400 nid=0x33c2 in
Object.wait() [0x52946000..0x52946e30]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x5727dde8> (a java.lang.ref.Reference$Lock)
	at java.lang.Object.wait(Object.java:485)
	at
java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
	- locked <0x5727dde8> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x089fd000 nid=0x33be runnable
[0xb7fab000..0xb7fac208]
   java.lang.Thread.State: RUNNABLE
	at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown
Source)
	at
org.apache.nutch.parse.html.DOMBuilder.characters(DOMBuilder.java:405)
	at org.ccil.cowan.tagsoup.Parser.pcdata(Parser.java:461)
	at
org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:451)
	at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:210)
	at
org.apache.nutch.parse.html.HtmlParser.parseTagSoup(HtmlParser.java:222
)
	at
org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:209)
	at
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145)
	at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
	at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
	at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)

"VM Thread" prio=10 tid=0x08abe800 nid=0x33c1 runnable 

"GC task thread#0 (ParallelGC)" prio=10 tid=0x08a03c00 nid=0x33bf
runnable 

"GC task thread#1 (ParallelGC)" prio=10 tid=0x08a04c00 nid=0x33c0
runnable 

"VM Periodic Task Thread" prio=10 tid=0x08adac00 nid=0x33c8 waiting on
condition 

JNI global references: 1196

Heap
 PSYoungGen      total 159104K, used 137234K [0xaa7a0000, 0xb4e40000,
0xb4e40000)
  eden space 147584K, 85% used [0xaa7a0000,0xb2272700,0xb37c0000)
  from space 11520K, 99% used [0xb4300000,0xb4e32480,0xb4e40000)
  to   space 11520K, 0% used [0xb37c0000,0xb37c0000,0xb4300000)
 PSOldGen        total 412672K, used 132811K [0x57240000, 0x70540000,
0xaa7a0000)
  object space 412672K, 32% used [0x57240000,0x5f3f2f78,0x70540000)
 PSPermGen       total 16384K, used 8760K [0x53240000, 0x54240000,
0x57240000)
  object space 16384K, 53% used [0x53240000,0x53ace250,0x54240000) 


For a long time it sat at the Java Arrays.copyOf, but it does appear to
have eventually returned from that.  I think my problem may lie more in
making sure the thread JVMs have the necessary memory and that they
have the time to parse larger documents (10MB).  Even so, it is
frustrating that this failure to parse one document kills the whole
parse job.  Is there a way to make this more granualr on the document
level, and even as information is being added and try to return
whatever has been parsed already before the job hangs / times out /
throws an exception?

Thanks.

Jeff


-----Original Message-----
From: Ned Rockson [mailto:ned@discoveryengine.com] 
Sent: Tuesday, November 27, 2007 2:25 PM
To: nutch-user@lucene.apache.org
Subject: Re: Crash in Parser

This is a problem with Regex parsing.  It has happened for me in 
urlnormalizer where the URL was parsed incorrectly and for some reason 
is extremely long or contains control characters.  What happens is that

if the URL is really long (say thousands of characters) it goes into a 
very inefficient algorithm (I believe O(n^3) but not sure) to find 
certain features.  I fixed this by having prefix-urlnormalizer check 
first to see if the length of the URL is less than some constant (I
have 
defined as 1024).  I also saw this problem happen the other day with
the 
.js parser.  Essentially there was a page: http://www.magic-cadeaux.fr/

that had a javascript line that was 150000 slashes in a row.  It parses

fine in a browser, but again this led to and endless regex loop.

If you find these are problems you can find the stuck task and do a
kill 
-SIGQUIT which will dump stack traces to stdout (redirected to 
logs/userlogs/[task name]/stdout) and check to see if it's stuck in a 
regex loop and what put it there.

--Ned

Bolle, Jeffrey F. wrote:
> Apparently the job configuration file didn't make it through the
> listserv.  Here it is in the body of the e-mail.
>  
> Jeff
>  
>
> Job Configuration: JobId - job_200711261211_0026
>
>
> name	 value	
> dfs.secondary.info.bindAddress	 0.0.0.0	
> dfs.datanode.port	 50010	
> dfs.client.buffer.dir	 ${hadoop.tmp.dir}/dfs/tmp	
> searcher.summary.length	 20	
> generate.update.crawldb	 false	
> lang.ngram.max.length	 4	
> tasktracker.http.port	 50060	
> searcher.filter.cache.size	 16	
> ftp.timeout	 60000	
> hadoop.tmp.dir	 /tmp/hadoop-${user.name}	
> hadoop.native.lib	 true	
> map.sort.class	 org.apache.hadoop.mapred.MergeSorter	
> ftp.follow.talk	 false	
> indexer.mergeFactor	 50	
> ipc.client.idlethreshold	 4000	
> query.host.boost	 2.0	
> mapred.system.dir	 /nutch/filesystem/mapreduce/system	
> ftp.password	 anonymous@example.com	
> http.agent.version	 Nutch-1.0-dev	
> query.tag.boost	 1.0	
> dfs.namenode.logging.level	 info	
> db.fetch.schedule.adaptive.sync_delta_rate	 0.3	
> io.skip.checksum.errors	 false	
> urlfilter.automaton.file	 automaton-urlfilter.txt	
> fs.default.name	 cisserver:9000	
> db.ignore.external.links	 false	
> extension.ontology.urls	 	
> dfs.safemode.threshold.pct	 0.999f	
> dfs.namenode.handler.count	 10	
> plugin.folders	 plugins	
> mapred.tasktracker.dns.nameserver	 default	
> io.sort.factor	 10	
> fetcher.threads.per.host.by.ip	 false	
> parser.html.impl	 neko	
> mapred.task.timeout	 600000	
> mapred.max.tracker.failures	 4	
> hadoop.rpc.socket.factory.class.default
> org.apache.hadoop.net.StandardSocketFactory	
> db.update.additions.allowed	 true	
> fs.hdfs.impl	 org.apache.hadoop.dfs.DistributedFileSystem	
> indexer.score.power	 0.5	
> ipc.client.maxidletime	 120000	
> db.fetch.schedule.class
org.apache.nutch.crawl.DefaultFetchSchedule
>
> mapred.output.key.class	 org.apache.hadoop.io.Text	
> file.content.limit	 10485760	
> http.agent.url	 http://poisk/index.php/Category:Systems

> dfs.safemode.extension	 30000	
> tasktracker.http.threads	 40	
> db.fetch.schedule.adaptive.dec_rate	 0.2	
> user.name	 nutch	
> mapred.output.compress	 false	
> io.bytes.per.checksum	 512	
> fetcher.server.delay	 0.2	
> searcher.summary.context	 5	
> db.fetch.interval.default	 2592000	
> searcher.max.time.tick_count	 -1	
> parser.html.form.use_action	 false	
> fs.trash.root	 ${hadoop.tmp.dir}/Trash	
> mapred.reduce.max.attempts	 4	
> fs.ramfs.impl	 org.apache.hadoop.fs.InMemoryFileSystem	
> db.score.count.filtered	 false	
> fetcher.max.crawl.delay	 30	
> dfs.info.port	 50070	
> indexer.maxMergeDocs	 2147483647	
> mapred.jar
>
/nutch/filesystem/mapreduce/local/jobTracker/job_200711261211_0026.jar
>
> fs.s3.buffer.dir	 ${hadoop.tmp.dir}/s3	
> dfs.block.size	 67108864	
> http.robots.403.allow	 true	
> ftp.content.limit	 10485760	
> job.end.retry.attempts	 0	
> fs.file.impl	 org.apache.hadoop.fs.LocalFileSystem	
> query.title.boost	 1.5	
> mapred.speculative.execution	 true	
> mapred.local.dir.minspacestart	 0	
> mapred.output.compression.type	 RECORD	
> mime.types.file	 tika-mimetypes.xml	
> generate.max.per.host.by.ip	 false	
> fetcher.parse	 false	
> db.default.fetch.interval	 30	
> db.max.outlinks.per.page	 -1	
> analysis.common.terms.file	 common-terms.utf8	
> mapred.userlog.retain.hours	 24	
> dfs.replication.max	 512	
> http.redirect.max	 5	
> local.cache.size	 10737418240	
> mapred.min.split.size	 0	
> mapred.map.tasks	 18	
> fetcher.threads.fetch	 10	
> mapred.child.java.opts	 -Xmx1500m	
> mapred.output.value.class	 org.apache.nutch.parse.ParseImpl
>
> http.timeout	 10000	
> http.content.limit	 10485760	
> dfs.secondary.info.port	 50090	
> ipc.server.listen.queue.size	 128	
> encodingdetector.charset.min.confidence	 -1	
> mapred.inmem.merge.threshold	 1000	
> job.end.retry.interval	 30000	
> fs.checkpoint.dir	 ${hadoop.tmp.dir}/dfs/namesecondary	
> query.url.boost	 4.0	
> mapred.reduce.tasks	 6	
> db.score.link.external	 1.0	
> query.anchor.boost	 2.0	
> mapred.userlog.limit.kb	 0	
> webinterface.private.actions	 false	
> db.max.inlinks	 10000000	
> mapred.job.split.file
> /nutch/filesystem/mapreduce/system/job_200711261211_0026/job.split
>
> mapred.job.name	 parse crawl20071126/segments/20071126123442

> dfs.datanode.dns.nameserver	 default	
> dfs.blockreport.intervalMsec	 3600000	
> ftp.username	 anonymous	
> db.fetch.schedule.adaptive.inc_rate	 0.4	
> searcher.max.hits	 -1	
> mapred.map.max.attempts	 4	
> urlnormalizer.regex.file	 regex-normalize.xml	
> ftp.keep.connection	 false	
> searcher.filter.cache.threshold	 0.05	
> mapred.job.tracker.handler.count	 10	
> dfs.client.block.write.retries	 3	
> mapred.input.format.class
> org.apache.hadoop.mapred.SequenceFileInputFormat	
> http.verbose	 true	
> fetcher.threads.per.host	 8	
> mapred.tasktracker.expiry.interval	 600000	
> mapred.job.tracker.info.bindAddress	 0.0.0.0	
> ipc.client.timeout	 60000	
> keep.failed.task.files	 false	
> mapred.output.format.class
> org.apache.nutch.parse.ParseOutputFormat	
> mapred.output.compression.codec
> org.apache.hadoop.io.compress.DefaultCodec	
> io.map.index.skip	 0	
> mapred.working.dir	 /user/nutch	
> tasktracker.http.bindAddress	 0.0.0.0	
> io.seqfile.compression.type	 RECORD	
> mapred.reducer.class	 org.apache.nutch.parse.ParseSegment	
> lang.analyze.max.length	 2048	
> db.fetch.schedule.adaptive.min_interval	 60.0	
> http.agent.name	 Jeffcrawler	
> dfs.default.chunk.view.size	 32768	
> hadoop.logfile.size	 10000000	
> dfs.datanode.du.pct	 0.98f	
> parser.caching.forbidden.policy	 content	
> http.useHttp11	 false	
> fs.inmemory.size.mb	 75	
> db.fetch.schedule.adaptive.sync_delta	 true	
> dfs.datanode.du.reserved	 0	
> mapred.job.tracker.info.port	 50030	
> plugin.auto-activation	 true	
> fs.checkpoint.period	 3600	
> mapred.jobtracker.completeuserjobs.maximum	 100	
> mapred.task.tracker.report.bindAddress	 127.0.0.1	
> db.signature.text_profile.min_token_len	 2	
> query.phrase.boost	 1.0	
> lang.ngram.min.length	 1	
> dfs.df.interval	 60000	
> dfs.data.dir	 /nutch/filesystem/data	
> dfs.datanode.bindAddress	 0.0.0.0	
> fs.s3.maxRetries	 4	
> dfs.datanode.dns.interface	 default	
> http.agent.email	 Jeff	
> extension.clustering.hits-to-cluster	 100	
> searcher.max.time.tick_length	 200	
> http.agent.description	 Jeff's Crawler	
> query.lang.boost	 0.0	
> mapred.local.dir	 /nutch/filesystem/mapreduce/local	
> fs.hftp.impl	 org.apache.hadoop.dfs.HftpFileSystem	
> mapred.mapper.class	 org.apache.nutch.parse.ParseSegment	
> fs.trash.interval	 0	
> fs.s3.sleepTimeSeconds	 10	
> dfs.replication.min	 1	
> mapred.submit.replication	 10	
> indexer.max.title.length	 100	
> parser.character.encoding.default	 windows-1252	
> mapred.map.output.compression.codec
> org.apache.hadoop.io.compress.DefaultCodec	
> mapred.tasktracker.dns.interface	 default	
> http.robots.agents	 Jeffcrawler,*	
> mapred.job.tracker	 cisserver:9001	
> dfs.heartbeat.interval	 3	
> urlfilter.regex.file	 crawl-urlfilter.txt	
> io.seqfile.sorter.recordlimit	 1000000	
> fetcher.store.content	 true	
> urlfilter.suffix.file	 suffix-urlfilter.txt	
> dfs.name.dir	 /nutch/filesystem/name	
> fetcher.verbose	 true	
> db.signature.class	 org.apache.nutch.crawl.MD5Signature	
> db.max.anchor.length	 100	
> parse.plugin.file	 parse-plugins.xml	
> nutch.segment.name	 20071126123442	
> mapred.local.dir.minspacekill	 0	
> searcher.dir	 /var/nutch/crawl	
> fs.kfs.impl	 org.apache.hadoop.fs.kfs.KosmosFileSystem	
> plugin.includes
>
protocol-(httpclient|file|ftp)|creativecommons|clustering-carrot2|urlfi
>
lter-regex|parse-(html|js|text|pdf|rss|msword|msexcel|mspowerpoint|oo|s
>
wf|ext)|index-(basic|more|anchor)|language-identifier|analysis-(en|ru)|
>
query-(basic|more|site|url)|summary-basic|microformats-reltag|scoring-o
> pic|subcollection	
> mapred.map.output.compression.type	 RECORD	
> mapred.temp.dir	 ${hadoop.tmp.dir}/mapred/temp	
> db.fetch.retry.max	 3	
> query.cc.boost	 0.0	
> dfs.replication	 2	
> db.ignore.internal.links	 false	
> dfs.info.bindAddress	 0.0.0.0	
> query.site.boost	 0.0	
> searcher.hostgrouping.rawhits.factor	 2.0	
> fetcher.server.min.delay	 0.0	
> hadoop.logfile.count	 10	
> indexer.termIndexInterval	 128	
> file.content.ignored	 true	
> db.score.link.internal	 1.0	
> io.seqfile.compress.blocksize	 1000000	
> fs.s3.block.size	 67108864	
> ftp.server.timeout	 100000	
> http.max.delays	 1000	
> indexer.minMergeDocs	 50	
> mapred.reduce.parallel.copies	 5	
> io.seqfile.lazydecompress	 true	
> mapred.output.dir
> /user/nutch/crawl20071126/segments/20071126123442	
> indexer.max.tokens	 10000000	
> io.sort.mb	 100	
> ipc.client.connection.maxidletime	 1000	
> db.fetch.schedule.adaptive.max_interval	 31536000.0	
> mapred.compress.map.output	 false	
> ipc.client.kill.max	 10	
> urlnormalizer.order
> org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer	
> ipc.client.connect.max.retries	 10	
> urlfilter.prefix.file	 prefix-urlfilter.txt	
> db.signature.text_profile.quant_rate	 0.01	
> query.type.boost	 0.0	
> fs.s3.impl	 org.apache.hadoop.fs.s3.S3FileSystem	
> mime.type.magic	 true	
> generate.max.per.host	 -1	
> db.fetch.interval.max	 7776000	
> urlnormalizer.loop.count	 1	
> mapred.input.dir
> /user/nutch/crawl20071126/segments/20071126123442/content	
> io.file.buffer.size	 4096	
> db.score.injected	 1.0	
> dfs.replication.considerLoad	 true	
> jobclient.output.filter	 FAILED	
> mapred.tasktracker.tasks.maximum	 2	
> io.compression.codecs
>
org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compres
> s.GzipCodec	
> fs.checkpoint.size	 67108864	
>
>
> ________________________________
>
> 	From: Bolle, Jeffrey F. [mailto:jbolle@mitre.org] 
> 	Sent: Monday, November 26, 2007 3:08 PM
> 	To: nutch-user@lucene.apache.org
> 	Subject: Crash in Parser
> 	
> 	
> 	All,
> 	I'm having some trouble with the Nutch nightly.  It has been a
> while since I last updated my crawl of our intranet.  I was
attempting
> to run the crawl today and it failed with this:
> 	Exception in thread "main" java.io.IOException: Job failed!
> 	        at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:831)
> 	        at
> org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:142)
> 	        at org.apache.nutch.crawl.Crawl.main(Crawl.java:126)
> 	
> 	In the web interface it says that:
> 	Task task_200711261211_0026_m_000015_0 filed to report status
> for 602 seconds. Killing!
> 	
> 	Task task_200711261211_0026_m_000015_1 filed to report status
> for 601 seconds. Killing!
> 	
> 	Task task_200711261211_0026_m_000015_2 filed to report status
> for 601 seconds. Killing!
> 	
> 	Task task_200711261211_0026_m_000015_3 filed to report status
> for 602 seconds. Killing!
> 	 
> 	I don't have the fetchers set to parse.  Nutch and hadoop are
> running on a 3 node cluster.  I've attached the job configuration
file
> as saved from the web interface.
> 	 
> 	Is there any way I can get more information on which file or
> url the parse is failing on?  Why doesn't the parsing of a file or
URL
> fail more cleanly?
> 	 
> 	Any recommendations on helping nutch avoid whatever is causing
> the hang and allowing it to index the rest of the content?
> 	 
> 	Thanks.
> 	 
> 	 
> 	Jeff Bolle
> 	 
>
>
>

Re: Crash in Parser

Posted by Ned Rockson <ne...@discoveryengine.com>.

This is a problem with Regex parsing.  It has happened for me in 
urlnormalizer where the URL was parsed incorrectly and for some reason 
is extremely long or contains control characters.  What happens is that 
if the URL is really long (say thousands of characters) it goes into a 
very inefficient algorithm (I believe O(n^3) but not sure) to find 
certain features.  I fixed this by having prefix-urlnormalizer check 
first to see if the length of the URL is less than some constant (I have 
defined as 1024).  I also saw this problem happen the other day with the 
.js parser.  Essentially there was a page: http://www.magic-cadeaux.fr/ 
that had a javascript line that was 150000 slashes in a row.  It parses 
fine in a browser, but again this led to and endless regex loop.

If you find these are problems you can find the stuck task and do a kill 
-SIGQUIT which will dump stack traces to stdout (redirected to 
logs/userlogs/[task name]/stdout) and check to see if it's stuck in a 
regex loop and what put it there.

--Ned

Bolle, Jeffrey F. wrote:
> Apparently the job configuration file didn't make it through the
> listserv.  Here it is in the body of the e-mail.
>  
> Jeff
>  
>
> Job Configuration: JobId - job_200711261211_0026
>
>
> name	 value	
> dfs.secondary.info.bindAddress	 0.0.0.0	
> dfs.datanode.port	 50010	
> dfs.client.buffer.dir	 ${hadoop.tmp.dir}/dfs/tmp	
> searcher.summary.length	 20	
> generate.update.crawldb	 false	
> lang.ngram.max.length	 4	
> tasktracker.http.port	 50060	
> searcher.filter.cache.size	 16	
> ftp.timeout	 60000	
> hadoop.tmp.dir	 /tmp/hadoop-${user.name}	
> hadoop.native.lib	 true	
> map.sort.class	 org.apache.hadoop.mapred.MergeSorter	
> ftp.follow.talk	 false	
> indexer.mergeFactor	 50	
> ipc.client.idlethreshold	 4000	
> query.host.boost	 2.0	
> mapred.system.dir	 /nutch/filesystem/mapreduce/system	
> ftp.password	 anonymous@example.com	
> http.agent.version	 Nutch-1.0-dev	
> query.tag.boost	 1.0	
> dfs.namenode.logging.level	 info	
> db.fetch.schedule.adaptive.sync_delta_rate	 0.3	
> io.skip.checksum.errors	 false	
> urlfilter.automaton.file	 automaton-urlfilter.txt	
> fs.default.name	 cisserver:9000	
> db.ignore.external.links	 false	
> extension.ontology.urls	 	
> dfs.safemode.threshold.pct	 0.999f	
> dfs.namenode.handler.count	 10	
> plugin.folders	 plugins	
> mapred.tasktracker.dns.nameserver	 default	
> io.sort.factor	 10	
> fetcher.threads.per.host.by.ip	 false	
> parser.html.impl	 neko	
> mapred.task.timeout	 600000	
> mapred.max.tracker.failures	 4	
> hadoop.rpc.socket.factory.class.default
> org.apache.hadoop.net.StandardSocketFactory	
> db.update.additions.allowed	 true	
> fs.hdfs.impl	 org.apache.hadoop.dfs.DistributedFileSystem	
> indexer.score.power	 0.5	
> ipc.client.maxidletime	 120000	
> db.fetch.schedule.class	 org.apache.nutch.crawl.DefaultFetchSchedule
>
> mapred.output.key.class	 org.apache.hadoop.io.Text	
> file.content.limit	 10485760	
> http.agent.url	 http://poisk/index.php/Category:Systems	
> dfs.safemode.extension	 30000	
> tasktracker.http.threads	 40	
> db.fetch.schedule.adaptive.dec_rate	 0.2	
> user.name	 nutch	
> mapred.output.compress	 false	
> io.bytes.per.checksum	 512	
> fetcher.server.delay	 0.2	
> searcher.summary.context	 5	
> db.fetch.interval.default	 2592000	
> searcher.max.time.tick_count	 -1	
> parser.html.form.use_action	 false	
> fs.trash.root	 ${hadoop.tmp.dir}/Trash	
> mapred.reduce.max.attempts	 4	
> fs.ramfs.impl	 org.apache.hadoop.fs.InMemoryFileSystem	
> db.score.count.filtered	 false	
> fetcher.max.crawl.delay	 30	
> dfs.info.port	 50070	
> indexer.maxMergeDocs	 2147483647	
> mapred.jar
> /nutch/filesystem/mapreduce/local/jobTracker/job_200711261211_0026.jar
>
> fs.s3.buffer.dir	 ${hadoop.tmp.dir}/s3	
> dfs.block.size	 67108864	
> http.robots.403.allow	 true	
> ftp.content.limit	 10485760	
> job.end.retry.attempts	 0	
> fs.file.impl	 org.apache.hadoop.fs.LocalFileSystem	
> query.title.boost	 1.5	
> mapred.speculative.execution	 true	
> mapred.local.dir.minspacestart	 0	
> mapred.output.compression.type	 RECORD	
> mime.types.file	 tika-mimetypes.xml	
> generate.max.per.host.by.ip	 false	
> fetcher.parse	 false	
> db.default.fetch.interval	 30	
> db.max.outlinks.per.page	 -1	
> analysis.common.terms.file	 common-terms.utf8	
> mapred.userlog.retain.hours	 24	
> dfs.replication.max	 512	
> http.redirect.max	 5	
> local.cache.size	 10737418240	
> mapred.min.split.size	 0	
> mapred.map.tasks	 18	
> fetcher.threads.fetch	 10	
> mapred.child.java.opts	 -Xmx1500m	
> mapred.output.value.class	 org.apache.nutch.parse.ParseImpl
>
> http.timeout	 10000	
> http.content.limit	 10485760	
> dfs.secondary.info.port	 50090	
> ipc.server.listen.queue.size	 128	
> encodingdetector.charset.min.confidence	 -1	
> mapred.inmem.merge.threshold	 1000	
> job.end.retry.interval	 30000	
> fs.checkpoint.dir	 ${hadoop.tmp.dir}/dfs/namesecondary	
> query.url.boost	 4.0	
> mapred.reduce.tasks	 6	
> db.score.link.external	 1.0	
> query.anchor.boost	 2.0	
> mapred.userlog.limit.kb	 0	
> webinterface.private.actions	 false	
> db.max.inlinks	 10000000	
> mapred.job.split.file
> /nutch/filesystem/mapreduce/system/job_200711261211_0026/job.split
>
> mapred.job.name	 parse crawl20071126/segments/20071126123442	
> dfs.datanode.dns.nameserver	 default	
> dfs.blockreport.intervalMsec	 3600000	
> ftp.username	 anonymous	
> db.fetch.schedule.adaptive.inc_rate	 0.4	
> searcher.max.hits	 -1	
> mapred.map.max.attempts	 4	
> urlnormalizer.regex.file	 regex-normalize.xml	
> ftp.keep.connection	 false	
> searcher.filter.cache.threshold	 0.05	
> mapred.job.tracker.handler.count	 10	
> dfs.client.block.write.retries	 3	
> mapred.input.format.class
> org.apache.hadoop.mapred.SequenceFileInputFormat	
> http.verbose	 true	
> fetcher.threads.per.host	 8	
> mapred.tasktracker.expiry.interval	 600000	
> mapred.job.tracker.info.bindAddress	 0.0.0.0	
> ipc.client.timeout	 60000	
> keep.failed.task.files	 false	
> mapred.output.format.class
> org.apache.nutch.parse.ParseOutputFormat	
> mapred.output.compression.codec
> org.apache.hadoop.io.compress.DefaultCodec	
> io.map.index.skip	 0	
> mapred.working.dir	 /user/nutch	
> tasktracker.http.bindAddress	 0.0.0.0	
> io.seqfile.compression.type	 RECORD	
> mapred.reducer.class	 org.apache.nutch.parse.ParseSegment	
> lang.analyze.max.length	 2048	
> db.fetch.schedule.adaptive.min_interval	 60.0	
> http.agent.name	 Jeffcrawler	
> dfs.default.chunk.view.size	 32768	
> hadoop.logfile.size	 10000000	
> dfs.datanode.du.pct	 0.98f	
> parser.caching.forbidden.policy	 content	
> http.useHttp11	 false	
> fs.inmemory.size.mb	 75	
> db.fetch.schedule.adaptive.sync_delta	 true	
> dfs.datanode.du.reserved	 0	
> mapred.job.tracker.info.port	 50030	
> plugin.auto-activation	 true	
> fs.checkpoint.period	 3600	
> mapred.jobtracker.completeuserjobs.maximum	 100	
> mapred.task.tracker.report.bindAddress	 127.0.0.1	
> db.signature.text_profile.min_token_len	 2	
> query.phrase.boost	 1.0	
> lang.ngram.min.length	 1	
> dfs.df.interval	 60000	
> dfs.data.dir	 /nutch/filesystem/data	
> dfs.datanode.bindAddress	 0.0.0.0	
> fs.s3.maxRetries	 4	
> dfs.datanode.dns.interface	 default	
> http.agent.email	 Jeff	
> extension.clustering.hits-to-cluster	 100	
> searcher.max.time.tick_length	 200	
> http.agent.description	 Jeff's Crawler	
> query.lang.boost	 0.0	
> mapred.local.dir	 /nutch/filesystem/mapreduce/local	
> fs.hftp.impl	 org.apache.hadoop.dfs.HftpFileSystem	
> mapred.mapper.class	 org.apache.nutch.parse.ParseSegment	
> fs.trash.interval	 0	
> fs.s3.sleepTimeSeconds	 10	
> dfs.replication.min	 1	
> mapred.submit.replication	 10	
> indexer.max.title.length	 100	
> parser.character.encoding.default	 windows-1252	
> mapred.map.output.compression.codec
> org.apache.hadoop.io.compress.DefaultCodec	
> mapred.tasktracker.dns.interface	 default	
> http.robots.agents	 Jeffcrawler,*	
> mapred.job.tracker	 cisserver:9001	
> dfs.heartbeat.interval	 3	
> urlfilter.regex.file	 crawl-urlfilter.txt	
> io.seqfile.sorter.recordlimit	 1000000	
> fetcher.store.content	 true	
> urlfilter.suffix.file	 suffix-urlfilter.txt	
> dfs.name.dir	 /nutch/filesystem/name	
> fetcher.verbose	 true	
> db.signature.class	 org.apache.nutch.crawl.MD5Signature	
> db.max.anchor.length	 100	
> parse.plugin.file	 parse-plugins.xml	
> nutch.segment.name	 20071126123442	
> mapred.local.dir.minspacekill	 0	
> searcher.dir	 /var/nutch/crawl	
> fs.kfs.impl	 org.apache.hadoop.fs.kfs.KosmosFileSystem	
> plugin.includes
> protocol-(httpclient|file|ftp)|creativecommons|clustering-carrot2|urlfi
> lter-regex|parse-(html|js|text|pdf|rss|msword|msexcel|mspowerpoint|oo|s
> wf|ext)|index-(basic|more|anchor)|language-identifier|analysis-(en|ru)|
> query-(basic|more|site|url)|summary-basic|microformats-reltag|scoring-o
> pic|subcollection	
> mapred.map.output.compression.type	 RECORD	
> mapred.temp.dir	 ${hadoop.tmp.dir}/mapred/temp	
> db.fetch.retry.max	 3	
> query.cc.boost	 0.0	
> dfs.replication	 2	
> db.ignore.internal.links	 false	
> dfs.info.bindAddress	 0.0.0.0	
> query.site.boost	 0.0	
> searcher.hostgrouping.rawhits.factor	 2.0	
> fetcher.server.min.delay	 0.0	
> hadoop.logfile.count	 10	
> indexer.termIndexInterval	 128	
> file.content.ignored	 true	
> db.score.link.internal	 1.0	
> io.seqfile.compress.blocksize	 1000000	
> fs.s3.block.size	 67108864	
> ftp.server.timeout	 100000	
> http.max.delays	 1000	
> indexer.minMergeDocs	 50	
> mapred.reduce.parallel.copies	 5	
> io.seqfile.lazydecompress	 true	
> mapred.output.dir
> /user/nutch/crawl20071126/segments/20071126123442	
> indexer.max.tokens	 10000000	
> io.sort.mb	 100	
> ipc.client.connection.maxidletime	 1000	
> db.fetch.schedule.adaptive.max_interval	 31536000.0	
> mapred.compress.map.output	 false	
> ipc.client.kill.max	 10	
> urlnormalizer.order
> org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer	
> ipc.client.connect.max.retries	 10	
> urlfilter.prefix.file	 prefix-urlfilter.txt	
> db.signature.text_profile.quant_rate	 0.01	
> query.type.boost	 0.0	
> fs.s3.impl	 org.apache.hadoop.fs.s3.S3FileSystem	
> mime.type.magic	 true	
> generate.max.per.host	 -1	
> db.fetch.interval.max	 7776000	
> urlnormalizer.loop.count	 1	
> mapred.input.dir
> /user/nutch/crawl20071126/segments/20071126123442/content	
> io.file.buffer.size	 4096	
> db.score.injected	 1.0	
> dfs.replication.considerLoad	 true	
> jobclient.output.filter	 FAILED	
> mapred.tasktracker.tasks.maximum	 2	
> io.compression.codecs
> org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compres
> s.GzipCodec	
> fs.checkpoint.size	 67108864	
>
>
> ________________________________
>
> 	From: Bolle, Jeffrey F. [mailto:jbolle@mitre.org] 
> 	Sent: Monday, November 26, 2007 3:08 PM
> 	To: nutch-user@lucene.apache.org
> 	Subject: Crash in Parser
> 	
> 	
> 	All,
> 	I'm having some trouble with the Nutch nightly.  It has been a
> while since I last updated my crawl of our intranet.  I was attempting
> to run the crawl today and it failed with this:
> 	Exception in thread "main" java.io.IOException: Job failed!
> 	        at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:831)
> 	        at
> org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:142)
> 	        at org.apache.nutch.crawl.Crawl.main(Crawl.java:126)
> 	
> 	In the web interface it says that:
> 	Task task_200711261211_0026_m_000015_0 filed to report status
> for 602 seconds. Killing!
> 	
> 	Task task_200711261211_0026_m_000015_1 filed to report status
> for 601 seconds. Killing!
> 	
> 	Task task_200711261211_0026_m_000015_2 filed to report status
> for 601 seconds. Killing!
> 	
> 	Task task_200711261211_0026_m_000015_3 filed to report status
> for 602 seconds. Killing!
> 	 
> 	I don't have the fetchers set to parse.  Nutch and hadoop are
> running on a 3 node cluster.  I've attached the job configuration file
> as saved from the web interface.
> 	 
> 	Is there any way I can get more information on which file or
> url the parse is failing on?  Why doesn't the parsing of a file or URL
> fail more cleanly?
> 	 
> 	Any recommendations on helping nutch avoid whatever is causing
> the hang and allowing it to index the rest of the content?
> 	 
> 	Thanks.
> 	 
> 	 
> 	Jeff Bolle
> 	 
>
>
>

RE: Crash in Parser

Posted by "Bolle, Jeffrey F." <jb...@mitre.org>.

Apparently the job configuration file didn't make it through the
listserv.  Here it is in the body of the e-mail.
 
Jeff
 

Job Configuration: JobId - job_200711261211_0026


name	 value	
dfs.secondary.info.bindAddress	 0.0.0.0	
dfs.datanode.port	 50010	
dfs.client.buffer.dir	 ${hadoop.tmp.dir}/dfs/tmp	
searcher.summary.length	 20	
generate.update.crawldb	 false	
lang.ngram.max.length	 4	
tasktracker.http.port	 50060	
searcher.filter.cache.size	 16	
ftp.timeout	 60000	
hadoop.tmp.dir	 /tmp/hadoop-${user.name}	
hadoop.native.lib	 true	
map.sort.class	 org.apache.hadoop.mapred.MergeSorter	
ftp.follow.talk	 false	
indexer.mergeFactor	 50	
ipc.client.idlethreshold	 4000	
query.host.boost	 2.0	
mapred.system.dir	 /nutch/filesystem/mapreduce/system	
ftp.password	 anonymous@example.com	
http.agent.version	 Nutch-1.0-dev	
query.tag.boost	 1.0	
dfs.namenode.logging.level	 info	
db.fetch.schedule.adaptive.sync_delta_rate	 0.3	
io.skip.checksum.errors	 false	
urlfilter.automaton.file	 automaton-urlfilter.txt	
fs.default.name	 cisserver:9000	
db.ignore.external.links	 false	
extension.ontology.urls	 	
dfs.safemode.threshold.pct	 0.999f	
dfs.namenode.handler.count	 10	
plugin.folders	 plugins	
mapred.tasktracker.dns.nameserver	 default	
io.sort.factor	 10	
fetcher.threads.per.host.by.ip	 false	
parser.html.impl	 neko	
mapred.task.timeout	 600000	
mapred.max.tracker.failures	 4	
hadoop.rpc.socket.factory.class.default
org.apache.hadoop.net.StandardSocketFactory	
db.update.additions.allowed	 true	
fs.hdfs.impl	 org.apache.hadoop.dfs.DistributedFileSystem	
indexer.score.power	 0.5	
ipc.client.maxidletime	 120000	
db.fetch.schedule.class	 org.apache.nutch.crawl.DefaultFetchSchedule

mapred.output.key.class	 org.apache.hadoop.io.Text	
file.content.limit	 10485760	
http.agent.url	 http://poisk/index.php/Category:Systems	
dfs.safemode.extension	 30000	
tasktracker.http.threads	 40	
db.fetch.schedule.adaptive.dec_rate	 0.2	
user.name	 nutch	
mapred.output.compress	 false	
io.bytes.per.checksum	 512	
fetcher.server.delay	 0.2	
searcher.summary.context	 5	
db.fetch.interval.default	 2592000	
searcher.max.time.tick_count	 -1	
parser.html.form.use_action	 false	
fs.trash.root	 ${hadoop.tmp.dir}/Trash	
mapred.reduce.max.attempts	 4	
fs.ramfs.impl	 org.apache.hadoop.fs.InMemoryFileSystem	
db.score.count.filtered	 false	
fetcher.max.crawl.delay	 30	
dfs.info.port	 50070	
indexer.maxMergeDocs	 2147483647	
mapred.jar
/nutch/filesystem/mapreduce/local/jobTracker/job_200711261211_0026.jar

fs.s3.buffer.dir	 ${hadoop.tmp.dir}/s3	
dfs.block.size	 67108864	
http.robots.403.allow	 true	
ftp.content.limit	 10485760	
job.end.retry.attempts	 0	
fs.file.impl	 org.apache.hadoop.fs.LocalFileSystem	
query.title.boost	 1.5	
mapred.speculative.execution	 true	
mapred.local.dir.minspacestart	 0	
mapred.output.compression.type	 RECORD	
mime.types.file	 tika-mimetypes.xml	
generate.max.per.host.by.ip	 false	
fetcher.parse	 false	
db.default.fetch.interval	 30	
db.max.outlinks.per.page	 -1	
analysis.common.terms.file	 common-terms.utf8	
mapred.userlog.retain.hours	 24	
dfs.replication.max	 512	
http.redirect.max	 5	
local.cache.size	 10737418240	
mapred.min.split.size	 0	
mapred.map.tasks	 18	
fetcher.threads.fetch	 10	
mapred.child.java.opts	 -Xmx1500m	
mapred.output.value.class	 org.apache.nutch.parse.ParseImpl

http.timeout	 10000	
http.content.limit	 10485760	
dfs.secondary.info.port	 50090	
ipc.server.listen.queue.size	 128	
encodingdetector.charset.min.confidence	 -1	
mapred.inmem.merge.threshold	 1000	
job.end.retry.interval	 30000	
fs.checkpoint.dir	 ${hadoop.tmp.dir}/dfs/namesecondary	
query.url.boost	 4.0	
mapred.reduce.tasks	 6	
db.score.link.external	 1.0	
query.anchor.boost	 2.0	
mapred.userlog.limit.kb	 0	
webinterface.private.actions	 false	
db.max.inlinks	 10000000	
mapred.job.split.file
/nutch/filesystem/mapreduce/system/job_200711261211_0026/job.split

mapred.job.name	 parse crawl20071126/segments/20071126123442	
dfs.datanode.dns.nameserver	 default	
dfs.blockreport.intervalMsec	 3600000	
ftp.username	 anonymous	
db.fetch.schedule.adaptive.inc_rate	 0.4	
searcher.max.hits	 -1	
mapred.map.max.attempts	 4	
urlnormalizer.regex.file	 regex-normalize.xml	
ftp.keep.connection	 false	
searcher.filter.cache.threshold	 0.05	
mapred.job.tracker.handler.count	 10	
dfs.client.block.write.retries	 3	
mapred.input.format.class
org.apache.hadoop.mapred.SequenceFileInputFormat	
http.verbose	 true	
fetcher.threads.per.host	 8	
mapred.tasktracker.expiry.interval	 600000	
mapred.job.tracker.info.bindAddress	 0.0.0.0	
ipc.client.timeout	 60000	
keep.failed.task.files	 false	
mapred.output.format.class
org.apache.nutch.parse.ParseOutputFormat	
mapred.output.compression.codec
org.apache.hadoop.io.compress.DefaultCodec	
io.map.index.skip	 0	
mapred.working.dir	 /user/nutch	
tasktracker.http.bindAddress	 0.0.0.0	
io.seqfile.compression.type	 RECORD	
mapred.reducer.class	 org.apache.nutch.parse.ParseSegment	
lang.analyze.max.length	 2048	
db.fetch.schedule.adaptive.min_interval	 60.0	
http.agent.name	 Jeffcrawler	
dfs.default.chunk.view.size	 32768	
hadoop.logfile.size	 10000000	
dfs.datanode.du.pct	 0.98f	
parser.caching.forbidden.policy	 content	
http.useHttp11	 false	
fs.inmemory.size.mb	 75	
db.fetch.schedule.adaptive.sync_delta	 true	
dfs.datanode.du.reserved	 0	
mapred.job.tracker.info.port	 50030	
plugin.auto-activation	 true	
fs.checkpoint.period	 3600	
mapred.jobtracker.completeuserjobs.maximum	 100	
mapred.task.tracker.report.bindAddress	 127.0.0.1	
db.signature.text_profile.min_token_len	 2	
query.phrase.boost	 1.0	
lang.ngram.min.length	 1	
dfs.df.interval	 60000	
dfs.data.dir	 /nutch/filesystem/data	
dfs.datanode.bindAddress	 0.0.0.0	
fs.s3.maxRetries	 4	
dfs.datanode.dns.interface	 default	
http.agent.email	 Jeff	
extension.clustering.hits-to-cluster	 100	
searcher.max.time.tick_length	 200	
http.agent.description	 Jeff's Crawler	
query.lang.boost	 0.0	
mapred.local.dir	 /nutch/filesystem/mapreduce/local	
fs.hftp.impl	 org.apache.hadoop.dfs.HftpFileSystem	
mapred.mapper.class	 org.apache.nutch.parse.ParseSegment	
fs.trash.interval	 0	
fs.s3.sleepTimeSeconds	 10	
dfs.replication.min	 1	
mapred.submit.replication	 10	
indexer.max.title.length	 100	
parser.character.encoding.default	 windows-1252	
mapred.map.output.compression.codec
org.apache.hadoop.io.compress.DefaultCodec	
mapred.tasktracker.dns.interface	 default	
http.robots.agents	 Jeffcrawler,*	
mapred.job.tracker	 cisserver:9001	
dfs.heartbeat.interval	 3	
urlfilter.regex.file	 crawl-urlfilter.txt	
io.seqfile.sorter.recordlimit	 1000000	
fetcher.store.content	 true	
urlfilter.suffix.file	 suffix-urlfilter.txt	
dfs.name.dir	 /nutch/filesystem/name	
fetcher.verbose	 true	
db.signature.class	 org.apache.nutch.crawl.MD5Signature	
db.max.anchor.length	 100	
parse.plugin.file	 parse-plugins.xml	
nutch.segment.name	 20071126123442	
mapred.local.dir.minspacekill	 0	
searcher.dir	 /var/nutch/crawl	
fs.kfs.impl	 org.apache.hadoop.fs.kfs.KosmosFileSystem	
plugin.includes
protocol-(httpclient|file|ftp)|creativecommons|clustering-carrot2|urlfi
lter-regex|parse-(html|js|text|pdf|rss|msword|msexcel|mspowerpoint|oo|s
wf|ext)|index-(basic|more|anchor)|language-identifier|analysis-(en|ru)|
query-(basic|more|site|url)|summary-basic|microformats-reltag|scoring-o
pic|subcollection	
mapred.map.output.compression.type	 RECORD	
mapred.temp.dir	 ${hadoop.tmp.dir}/mapred/temp	
db.fetch.retry.max	 3	
query.cc.boost	 0.0	
dfs.replication	 2	
db.ignore.internal.links	 false	
dfs.info.bindAddress	 0.0.0.0	
query.site.boost	 0.0	
searcher.hostgrouping.rawhits.factor	 2.0	
fetcher.server.min.delay	 0.0	
hadoop.logfile.count	 10	
indexer.termIndexInterval	 128	
file.content.ignored	 true	
db.score.link.internal	 1.0	
io.seqfile.compress.blocksize	 1000000	
fs.s3.block.size	 67108864	
ftp.server.timeout	 100000	
http.max.delays	 1000	
indexer.minMergeDocs	 50	
mapred.reduce.parallel.copies	 5	
io.seqfile.lazydecompress	 true	
mapred.output.dir
/user/nutch/crawl20071126/segments/20071126123442	
indexer.max.tokens	 10000000	
io.sort.mb	 100	
ipc.client.connection.maxidletime	 1000	
db.fetch.schedule.adaptive.max_interval	 31536000.0	
mapred.compress.map.output	 false	
ipc.client.kill.max	 10	
urlnormalizer.order
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer	
ipc.client.connect.max.retries	 10	
urlfilter.prefix.file	 prefix-urlfilter.txt	
db.signature.text_profile.quant_rate	 0.01	
query.type.boost	 0.0	
fs.s3.impl	 org.apache.hadoop.fs.s3.S3FileSystem	
mime.type.magic	 true	
generate.max.per.host	 -1	
db.fetch.interval.max	 7776000	
urlnormalizer.loop.count	 1	
mapred.input.dir
/user/nutch/crawl20071126/segments/20071126123442/content	
io.file.buffer.size	 4096	
db.score.injected	 1.0	
dfs.replication.considerLoad	 true	
jobclient.output.filter	 FAILED	
mapred.tasktracker.tasks.maximum	 2	
io.compression.codecs
org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compres
s.GzipCodec	
fs.checkpoint.size	 67108864	


________________________________

	From: Bolle, Jeffrey F. [mailto:jbolle@mitre.org] 
	Sent: Monday, November 26, 2007 3:08 PM
	To: nutch-user@lucene.apache.org
	Subject: Crash in Parser
	
	
	All,
	I'm having some trouble with the Nutch nightly.  It has been a
while since I last updated my crawl of our intranet.  I was attempting
to run the crawl today and it failed with this:
	Exception in thread "main" java.io.IOException: Job failed!
	        at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:831)
	        at
org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:142)
	        at org.apache.nutch.crawl.Crawl.main(Crawl.java:126)
	
	In the web interface it says that:
	Task task_200711261211_0026_m_000015_0 filed to report status
for 602 seconds. Killing!
	
	Task task_200711261211_0026_m_000015_1 filed to report status
for 601 seconds. Killing!
	
	Task task_200711261211_0026_m_000015_2 filed to report status
for 601 seconds. Killing!
	
	Task task_200711261211_0026_m_000015_3 filed to report status
for 602 seconds. Killing!
	 
	I don't have the fetchers set to parse.  Nutch and hadoop are
running on a 3 node cluster.  I've attached the job configuration file
as saved from the web interface.
	 
	Is there any way I can get more information on which file or
url the parse is failing on?  Why doesn't the parsing of a file or URL
fail more cleanly?
	 
	Any recommendations on helping nutch avoid whatever is causing
the hang and allowing it to index the rest of the content?
	 
	Thanks.
	 
	 
	Jeff Bolle