You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Bolle, Jeffrey F." <jb...@mitre.org> on 2007/11/26 21:08:00 UTC
Crash in Parser
All,
I'm having some trouble with the Nutch nightly. It has been a while
since I last updated my crawl of our intranet. I was attempting to run
the crawl today and it failed with this:
Exception in thread "main" java.io.IOException: Job failed!
at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:831)
at
org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:142)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:126)
In the web interface it says that:
Task task_200711261211_0026_m_000015_0 filed to report status for 602
seconds. Killing!
Task task_200711261211_0026_m_000015_1 filed to report status for 601
seconds. Killing!
Task task_200711261211_0026_m_000015_2 filed to report status for 601
seconds. Killing!
Task task_200711261211_0026_m_000015_3 filed to report status for 602
seconds. Killing!
I don't have the fetchers set to parse. Nutch and hadoop are running
on a 3 node cluster. I've attached the job configuration file as saved
from the web interface.
Is there any way I can get more information on which file or url the
parse is failing on? Why doesn't the parsing of a file or URL fail
more cleanly?
Any recommendations on helping nutch avoid whatever is causing the hang
and allowing it to index the rest of the content?
Thanks.
Jeff Bolle
Newbie question: fetching specific files only.
Posted by "Jose C. Lacal" <Jo...@OpenPHI.com>.
Dear all:
First of all, I am impressed with Nutch's capabilities. In less than 24
hours of work I have a nice system up and running, doing what I thought
would have taken me months to build. Congrats to the community members.
I have RTFM, the tutorials, and the lists. This may be a regex question
more than a Nutch issue. Yet here's the newbie question:
a.) I need to crawl a particular website where the files of interest are
all named as follows: PPPxxxxxxxx ('PPP' followed by 08 digits)
b.) The files are stored under
./show/PPPxxxxxxxx
./show/record/PPPxxxxxxxx
./show/locn/PPPxxxxxxxx
./show/related/PPPxxxxxxxx
After RTFM, I have tried the following with no success:
* regex-urlfilter.txt (+^http://*.*/show/)
* URLs file (http://*.*/show/)
Any pointers appreciated. Thanks.
--
José C. Lacal, Founder & Chief Vision Officer
Open Personalized Health Informatics _OpenPHI
15625 NW 15th Avenue; Suite 15
Miami, FL 33169-5601 USA www.OpenPHI.com
+1 (954) 553-1984 Jose.Lacal@OpenPHI.com
Newbie question: fetching specific files only.
Posted by "Jose C. Lacal" <Jo...@OpenPHI.com>.
Dear all:
First of all, I am impressed with Nutch's capabilities. In less than 24
hours of work I have a nice system up and running, doing what I thought
would have taken me months to build. Congrats to the community members.
I have RTFM, the tutorials, and the lists. This may be a regex question
more than a Nutch issue. Yet here's the newbie question:
a.) I need to crawl a particular website where the files of interest are
all named as follows: PPPxxxxxxxx ('PPP' followed by 08 digits)
b.) The files are stored under
./show/PPPxxxxxxxx
./show/record/PPPxxxxxxxx
./show/locn/PPPxxxxxxxx
./show/related/PPPxxxxxxxx
After RTFM, I have tried the following with no success:
* regex-urlfilter.txt (+^http://*.*/show/)
* URLs file (http://*.*/show/)
Any pointers appreciated. Thanks.
--
José C. Lacal, Founder & Chief Vision Officer
Open Personalized Health Informatics _OpenPHI
15625 NW 15th Avenue; Suite 15
Miami, FL 33169-5601 USA www.OpenPHI.com
+1 (954) 553-1984 Jose.Lacal@OpenPHI.com
Re: Crash in Parser
Posted by Karol Rybak <ka...@gmail.com>.
Hello I had the same issue, seems that there's a problem with neko html
parser i solved it by using tagsoup parser.
--
Karol Rybak
Programista / Programmer
Sekcja aplikacji / Applications section
Wyższa Szkoła Informatyki i Zarządzania / University of Internet Technology
and Management
+48(17)8661277
RE: Crash in Parser
Posted by "Bolle, Jeffrey F." <jb...@mitre.org>.
Ned,
Thanks for the hint, I found the advice of using kill -s SIGQUIT in an
earlier post. Luckily, I just saw the hung thread on the machine and
managed to get the command in before Nutch killed it.
It doesn't appear that I am stuck in the regexp. I ran the command a
few times, here are the last two iterations:
2007-11-27 17:46:29
Full thread dump Java HotSpot(TM) Server VM (1.6.0_01-b06 mixed mode):
"Comm thread for task_200711270828_0031_m_000016_1" daemon prio=10
tid=0x52229c00 nid=0x33ce waiting on condition [0x5209a000..0x5209afb0]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.mapred.Task$1.run(Task.java:281)
at java.lang.Thread.run(Thread.java:619)
"org.apache.hadoop.dfs.DFSClient$LeaseChecker@1b15692" daemon prio=10
tid=0x52231800 nid=0x33cd waiting on condition [0x520ec000..0x520ec130]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at
org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:558)
at java.lang.Thread.run(Thread.java:619)
"IPC Client connection to cisserver/192.168.100.215:9000" daemon
prio=10 tid=0x52203800 nid=0x33cc in Object.wait()
[0x5213c000..0x5213d0b0]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x572c5fe8> (a
org.apache.hadoop.ipc.Client$Connection)
at java.lang.Object.wait(Object.java:485)
at
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:216)
- locked <0x572c5fe8> (a
org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:255)
"IPC Client connection to /127.0.0.1:51728" daemon prio=10
tid=0x52238c00 nid=0x33cb in Object.wait() [0x521a0000..0x521a0e30]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x572ca5e0> (a
org.apache.hadoop.ipc.Client$Connection)
at java.lang.Object.wait(Object.java:485)
at
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:216)
- locked <0x572ca5e0> (a
org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:255)
"org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=10
tid=0x52235000 nid=0x33ca waiting on condition [0x521f1000..0x521f1db0]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at
org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:404)
"Low Memory Detector" daemon prio=10 tid=0x08ad9400 nid=0x33c7 runnable
[0x00000000..0x00000000]
java.lang.Thread.State: RUNNABLE
"CompilerThread1" daemon prio=10 tid=0x08ad7800 nid=0x33c6 waiting on
condition [0x00000000..0x525d2688]
java.lang.Thread.State: RUNNABLE
"CompilerThread0" daemon prio=10 tid=0x08ad6400 nid=0x33c5 waiting on
condition [0x00000000..0x526535c8]
java.lang.Thread.State: RUNNABLE
"Signal Dispatcher" daemon prio=10 tid=0x08ad5000 nid=0x33c4 runnable
[0x00000000..0x00000000]
java.lang.Thread.State: RUNNABLE
"Finalizer" daemon prio=10 tid=0x08ac2000 nid=0x33c3 in Object.wait()
[0x528f5000..0x528f60b0]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x5727ddc8> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
- locked <0x5727ddc8> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
at
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
"Reference Handler" daemon prio=10 tid=0x08ac1400 nid=0x33c2 in
Object.wait() [0x52946000..0x52946e30]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x5727dde8> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:485)
at
java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
- locked <0x5727dde8> (a java.lang.ref.Reference$Lock)
"main" prio=10 tid=0x089fd000 nid=0x33be runnable
[0xb7fab000..0xb7fac208]
java.lang.Thread.State: RUNNABLE
at java.util.Arrays.copyOf(Arrays.java:2882)
at
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.ja
va:100)
at
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
at java.lang.StringBuffer.append(StringBuffer.java:224)
- locked <0xab27c430> (a java.lang.StringBuffer)
at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown
Source)
at
org.apache.nutch.parse.html.DOMBuilder.characters(DOMBuilder.java:405)
at org.ccil.cowan.tagsoup.Parser.pcdata(Parser.java:461)
at
org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:451)
at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:210)
at
org.apache.nutch.parse.html.HtmlParser.parseTagSoup(HtmlParser.java:222
)
at
org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:209)
at
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)
"VM Thread" prio=10 tid=0x08abe800 nid=0x33c1 runnable
"GC task thread#0 (ParallelGC)" prio=10 tid=0x08a03c00 nid=0x33bf
runnable
"GC task thread#1 (ParallelGC)" prio=10 tid=0x08a04c00 nid=0x33c0
runnable
"VM Periodic Task Thread" prio=10 tid=0x08adac00 nid=0x33c8 waiting on
condition
JNI global references: 1196
Heap
PSYoungGen total 159424K, used 155715K [0xaa7a0000, 0xb4e40000,
0xb4e40000)
eden space 148224K, 97% used [0xaa7a0000,0xb34ccfa0,0xb3860000)
from space 11200K, 99% used [0xb3860000,0xb4343f40,0xb4350000)
to space 11200K, 0% used [0xb4350000,0xb4350000,0xb4e40000)
PSOldGen total 369088K, used 120964K [0x57240000, 0x6dab0000,
0xaa7a0000)
object space 369088K, 32% used [0x57240000,0x5e861268,0x6dab0000)
PSPermGen total 16384K, used 8760K [0x53240000, 0x54240000,
0x57240000)
object space 16384K, 53% used [0x53240000,0x53ace250,0x54240000)
2007-11-27 17:47:25
Full thread dump Java HotSpot(TM) Server VM (1.6.0_01-b06 mixed mode):
"Comm thread for task_200711270828_0031_m_000016_1" daemon prio=10
tid=0x52229c00 nid=0x33ce waiting on condition [0x5209a000..0x5209afb0]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.mapred.Task$1.run(Task.java:281)
at java.lang.Thread.run(Thread.java:619)
"org.apache.hadoop.dfs.DFSClient$LeaseChecker@1b15692" daemon prio=10
tid=0x52231800 nid=0x33cd waiting on condition [0x520ec000..0x520ec130]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at
org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:558)
at java.lang.Thread.run(Thread.java:619)
"IPC Client connection to cisserver/192.168.100.215:9000" daemon
prio=10 tid=0x52203800 nid=0x33cc in Object.wait()
[0x5213c000..0x5213d0b0]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x572c5fe8> (a
org.apache.hadoop.ipc.Client$Connection)
at java.lang.Object.wait(Object.java:485)
at
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:216)
- locked <0x572c5fe8> (a
org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:255)
"IPC Client connection to /127.0.0.1:51728" daemon prio=10
tid=0x52238c00 nid=0x33cb in Object.wait() [0x521a0000..0x521a0e30]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x572ca5e0> (a
org.apache.hadoop.ipc.Client$Connection)
at java.lang.Object.wait(Object.java:485)
at
org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:216)
- locked <0x572ca5e0> (a
org.apache.hadoop.ipc.Client$Connection)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:255)
"org.apache.hadoop.io.ObjectWritable Connection Culler" daemon prio=10
tid=0x52235000 nid=0x33ca waiting on condition [0x521f1000..0x521f1db0]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at
org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.java:404)
"Low Memory Detector" daemon prio=10 tid=0x08ad9400 nid=0x33c7 runnable
[0x00000000..0x00000000]
java.lang.Thread.State: RUNNABLE
"CompilerThread1" daemon prio=10 tid=0x08ad7800 nid=0x33c6 waiting on
condition [0x00000000..0x525d2688]
java.lang.Thread.State: RUNNABLE
"CompilerThread0" daemon prio=10 tid=0x08ad6400 nid=0x33c5 waiting on
condition [0x00000000..0x526535c8]
java.lang.Thread.State: RUNNABLE
"Signal Dispatcher" daemon prio=10 tid=0x08ad5000 nid=0x33c4 waiting on
condition [0x00000000..0x00000000]
java.lang.Thread.State: RUNNABLE
"Finalizer" daemon prio=10 tid=0x08ac2000 nid=0x33c3 in Object.wait()
[0x528f5000..0x528f60b0]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x5727ddc8> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
- locked <0x5727ddc8> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
at
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
"Reference Handler" daemon prio=10 tid=0x08ac1400 nid=0x33c2 in
Object.wait() [0x52946000..0x52946e30]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x5727dde8> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:485)
at
java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
- locked <0x5727dde8> (a java.lang.ref.Reference$Lock)
"main" prio=10 tid=0x089fd000 nid=0x33be runnable
[0xb7fab000..0xb7fac208]
java.lang.Thread.State: RUNNABLE
at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown
Source)
at
org.apache.nutch.parse.html.DOMBuilder.characters(DOMBuilder.java:405)
at org.ccil.cowan.tagsoup.Parser.pcdata(Parser.java:461)
at
org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:451)
at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:210)
at
org.apache.nutch.parse.html.HtmlParser.parseTagSoup(HtmlParser.java:222
)
at
org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:209)
at
org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
at
org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:75)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1760)
"VM Thread" prio=10 tid=0x08abe800 nid=0x33c1 runnable
"GC task thread#0 (ParallelGC)" prio=10 tid=0x08a03c00 nid=0x33bf
runnable
"GC task thread#1 (ParallelGC)" prio=10 tid=0x08a04c00 nid=0x33c0
runnable
"VM Periodic Task Thread" prio=10 tid=0x08adac00 nid=0x33c8 waiting on
condition
JNI global references: 1196
Heap
PSYoungGen total 159104K, used 137234K [0xaa7a0000, 0xb4e40000,
0xb4e40000)
eden space 147584K, 85% used [0xaa7a0000,0xb2272700,0xb37c0000)
from space 11520K, 99% used [0xb4300000,0xb4e32480,0xb4e40000)
to space 11520K, 0% used [0xb37c0000,0xb37c0000,0xb4300000)
PSOldGen total 412672K, used 132811K [0x57240000, 0x70540000,
0xaa7a0000)
object space 412672K, 32% used [0x57240000,0x5f3f2f78,0x70540000)
PSPermGen total 16384K, used 8760K [0x53240000, 0x54240000,
0x57240000)
object space 16384K, 53% used [0x53240000,0x53ace250,0x54240000)
For a long time it sat at the Java Arrays.copyOf, but it does appear to
have eventually returned from that. I think my problem may lie more in
making sure the thread JVMs have the necessary memory and that they
have the time to parse larger documents (10MB). Even so, it is
frustrating that this failure to parse one document kills the whole
parse job. Is there a way to make this more granualr on the document
level, and even as information is being added and try to return
whatever has been parsed already before the job hangs / times out /
throws an exception?
Thanks.
Jeff
-----Original Message-----
From: Ned Rockson [mailto:ned@discoveryengine.com]
Sent: Tuesday, November 27, 2007 2:25 PM
To: nutch-user@lucene.apache.org
Subject: Re: Crash in Parser
This is a problem with Regex parsing. It has happened for me in
urlnormalizer where the URL was parsed incorrectly and for some reason
is extremely long or contains control characters. What happens is that
if the URL is really long (say thousands of characters) it goes into a
very inefficient algorithm (I believe O(n^3) but not sure) to find
certain features. I fixed this by having prefix-urlnormalizer check
first to see if the length of the URL is less than some constant (I
have
defined as 1024). I also saw this problem happen the other day with
the
.js parser. Essentially there was a page: http://www.magic-cadeaux.fr/
that had a javascript line that was 150000 slashes in a row. It parses
fine in a browser, but again this led to and endless regex loop.
If you find these are problems you can find the stuck task and do a
kill
-SIGQUIT which will dump stack traces to stdout (redirected to
logs/userlogs/[task name]/stdout) and check to see if it's stuck in a
regex loop and what put it there.
--Ned
Bolle, Jeffrey F. wrote:
> Apparently the job configuration file didn't make it through the
> listserv. Here it is in the body of the e-mail.
>
> Jeff
>
>
> Job Configuration: JobId - job_200711261211_0026
>
>
> name value
> dfs.secondary.info.bindAddress 0.0.0.0
> dfs.datanode.port 50010
> dfs.client.buffer.dir ${hadoop.tmp.dir}/dfs/tmp
> searcher.summary.length 20
> generate.update.crawldb false
> lang.ngram.max.length 4
> tasktracker.http.port 50060
> searcher.filter.cache.size 16
> ftp.timeout 60000
> hadoop.tmp.dir /tmp/hadoop-${user.name}
> hadoop.native.lib true
> map.sort.class org.apache.hadoop.mapred.MergeSorter
> ftp.follow.talk false
> indexer.mergeFactor 50
> ipc.client.idlethreshold 4000
> query.host.boost 2.0
> mapred.system.dir /nutch/filesystem/mapreduce/system
> ftp.password anonymous@example.com
> http.agent.version Nutch-1.0-dev
> query.tag.boost 1.0
> dfs.namenode.logging.level info
> db.fetch.schedule.adaptive.sync_delta_rate 0.3
> io.skip.checksum.errors false
> urlfilter.automaton.file automaton-urlfilter.txt
> fs.default.name cisserver:9000
> db.ignore.external.links false
> extension.ontology.urls
> dfs.safemode.threshold.pct 0.999f
> dfs.namenode.handler.count 10
> plugin.folders plugins
> mapred.tasktracker.dns.nameserver default
> io.sort.factor 10
> fetcher.threads.per.host.by.ip false
> parser.html.impl neko
> mapred.task.timeout 600000
> mapred.max.tracker.failures 4
> hadoop.rpc.socket.factory.class.default
> org.apache.hadoop.net.StandardSocketFactory
> db.update.additions.allowed true
> fs.hdfs.impl org.apache.hadoop.dfs.DistributedFileSystem
> indexer.score.power 0.5
> ipc.client.maxidletime 120000
> db.fetch.schedule.class
org.apache.nutch.crawl.DefaultFetchSchedule
>
> mapred.output.key.class org.apache.hadoop.io.Text
> file.content.limit 10485760
> http.agent.url http://poisk/index.php/Category:Systems
> dfs.safemode.extension 30000
> tasktracker.http.threads 40
> db.fetch.schedule.adaptive.dec_rate 0.2
> user.name nutch
> mapred.output.compress false
> io.bytes.per.checksum 512
> fetcher.server.delay 0.2
> searcher.summary.context 5
> db.fetch.interval.default 2592000
> searcher.max.time.tick_count -1
> parser.html.form.use_action false
> fs.trash.root ${hadoop.tmp.dir}/Trash
> mapred.reduce.max.attempts 4
> fs.ramfs.impl org.apache.hadoop.fs.InMemoryFileSystem
> db.score.count.filtered false
> fetcher.max.crawl.delay 30
> dfs.info.port 50070
> indexer.maxMergeDocs 2147483647
> mapred.jar
>
/nutch/filesystem/mapreduce/local/jobTracker/job_200711261211_0026.jar
>
> fs.s3.buffer.dir ${hadoop.tmp.dir}/s3
> dfs.block.size 67108864
> http.robots.403.allow true
> ftp.content.limit 10485760
> job.end.retry.attempts 0
> fs.file.impl org.apache.hadoop.fs.LocalFileSystem
> query.title.boost 1.5
> mapred.speculative.execution true
> mapred.local.dir.minspacestart 0
> mapred.output.compression.type RECORD
> mime.types.file tika-mimetypes.xml
> generate.max.per.host.by.ip false
> fetcher.parse false
> db.default.fetch.interval 30
> db.max.outlinks.per.page -1
> analysis.common.terms.file common-terms.utf8
> mapred.userlog.retain.hours 24
> dfs.replication.max 512
> http.redirect.max 5
> local.cache.size 10737418240
> mapred.min.split.size 0
> mapred.map.tasks 18
> fetcher.threads.fetch 10
> mapred.child.java.opts -Xmx1500m
> mapred.output.value.class org.apache.nutch.parse.ParseImpl
>
> http.timeout 10000
> http.content.limit 10485760
> dfs.secondary.info.port 50090
> ipc.server.listen.queue.size 128
> encodingdetector.charset.min.confidence -1
> mapred.inmem.merge.threshold 1000
> job.end.retry.interval 30000
> fs.checkpoint.dir ${hadoop.tmp.dir}/dfs/namesecondary
> query.url.boost 4.0
> mapred.reduce.tasks 6
> db.score.link.external 1.0
> query.anchor.boost 2.0
> mapred.userlog.limit.kb 0
> webinterface.private.actions false
> db.max.inlinks 10000000
> mapred.job.split.file
> /nutch/filesystem/mapreduce/system/job_200711261211_0026/job.split
>
> mapred.job.name parse crawl20071126/segments/20071126123442
> dfs.datanode.dns.nameserver default
> dfs.blockreport.intervalMsec 3600000
> ftp.username anonymous
> db.fetch.schedule.adaptive.inc_rate 0.4
> searcher.max.hits -1
> mapred.map.max.attempts 4
> urlnormalizer.regex.file regex-normalize.xml
> ftp.keep.connection false
> searcher.filter.cache.threshold 0.05
> mapred.job.tracker.handler.count 10
> dfs.client.block.write.retries 3
> mapred.input.format.class
> org.apache.hadoop.mapred.SequenceFileInputFormat
> http.verbose true
> fetcher.threads.per.host 8
> mapred.tasktracker.expiry.interval 600000
> mapred.job.tracker.info.bindAddress 0.0.0.0
> ipc.client.timeout 60000
> keep.failed.task.files false
> mapred.output.format.class
> org.apache.nutch.parse.ParseOutputFormat
> mapred.output.compression.codec
> org.apache.hadoop.io.compress.DefaultCodec
> io.map.index.skip 0
> mapred.working.dir /user/nutch
> tasktracker.http.bindAddress 0.0.0.0
> io.seqfile.compression.type RECORD
> mapred.reducer.class org.apache.nutch.parse.ParseSegment
> lang.analyze.max.length 2048
> db.fetch.schedule.adaptive.min_interval 60.0
> http.agent.name Jeffcrawler
> dfs.default.chunk.view.size 32768
> hadoop.logfile.size 10000000
> dfs.datanode.du.pct 0.98f
> parser.caching.forbidden.policy content
> http.useHttp11 false
> fs.inmemory.size.mb 75
> db.fetch.schedule.adaptive.sync_delta true
> dfs.datanode.du.reserved 0
> mapred.job.tracker.info.port 50030
> plugin.auto-activation true
> fs.checkpoint.period 3600
> mapred.jobtracker.completeuserjobs.maximum 100
> mapred.task.tracker.report.bindAddress 127.0.0.1
> db.signature.text_profile.min_token_len 2
> query.phrase.boost 1.0
> lang.ngram.min.length 1
> dfs.df.interval 60000
> dfs.data.dir /nutch/filesystem/data
> dfs.datanode.bindAddress 0.0.0.0
> fs.s3.maxRetries 4
> dfs.datanode.dns.interface default
> http.agent.email Jeff
> extension.clustering.hits-to-cluster 100
> searcher.max.time.tick_length 200
> http.agent.description Jeff's Crawler
> query.lang.boost 0.0
> mapred.local.dir /nutch/filesystem/mapreduce/local
> fs.hftp.impl org.apache.hadoop.dfs.HftpFileSystem
> mapred.mapper.class org.apache.nutch.parse.ParseSegment
> fs.trash.interval 0
> fs.s3.sleepTimeSeconds 10
> dfs.replication.min 1
> mapred.submit.replication 10
> indexer.max.title.length 100
> parser.character.encoding.default windows-1252
> mapred.map.output.compression.codec
> org.apache.hadoop.io.compress.DefaultCodec
> mapred.tasktracker.dns.interface default
> http.robots.agents Jeffcrawler,*
> mapred.job.tracker cisserver:9001
> dfs.heartbeat.interval 3
> urlfilter.regex.file crawl-urlfilter.txt
> io.seqfile.sorter.recordlimit 1000000
> fetcher.store.content true
> urlfilter.suffix.file suffix-urlfilter.txt
> dfs.name.dir /nutch/filesystem/name
> fetcher.verbose true
> db.signature.class org.apache.nutch.crawl.MD5Signature
> db.max.anchor.length 100
> parse.plugin.file parse-plugins.xml
> nutch.segment.name 20071126123442
> mapred.local.dir.minspacekill 0
> searcher.dir /var/nutch/crawl
> fs.kfs.impl org.apache.hadoop.fs.kfs.KosmosFileSystem
> plugin.includes
>
protocol-(httpclient|file|ftp)|creativecommons|clustering-carrot2|urlfi
>
lter-regex|parse-(html|js|text|pdf|rss|msword|msexcel|mspowerpoint|oo|s
>
wf|ext)|index-(basic|more|anchor)|language-identifier|analysis-(en|ru)|
>
query-(basic|more|site|url)|summary-basic|microformats-reltag|scoring-o
> pic|subcollection
> mapred.map.output.compression.type RECORD
> mapred.temp.dir ${hadoop.tmp.dir}/mapred/temp
> db.fetch.retry.max 3
> query.cc.boost 0.0
> dfs.replication 2
> db.ignore.internal.links false
> dfs.info.bindAddress 0.0.0.0
> query.site.boost 0.0
> searcher.hostgrouping.rawhits.factor 2.0
> fetcher.server.min.delay 0.0
> hadoop.logfile.count 10
> indexer.termIndexInterval 128
> file.content.ignored true
> db.score.link.internal 1.0
> io.seqfile.compress.blocksize 1000000
> fs.s3.block.size 67108864
> ftp.server.timeout 100000
> http.max.delays 1000
> indexer.minMergeDocs 50
> mapred.reduce.parallel.copies 5
> io.seqfile.lazydecompress true
> mapred.output.dir
> /user/nutch/crawl20071126/segments/20071126123442
> indexer.max.tokens 10000000
> io.sort.mb 100
> ipc.client.connection.maxidletime 1000
> db.fetch.schedule.adaptive.max_interval 31536000.0
> mapred.compress.map.output false
> ipc.client.kill.max 10
> urlnormalizer.order
> org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
> ipc.client.connect.max.retries 10
> urlfilter.prefix.file prefix-urlfilter.txt
> db.signature.text_profile.quant_rate 0.01
> query.type.boost 0.0
> fs.s3.impl org.apache.hadoop.fs.s3.S3FileSystem
> mime.type.magic true
> generate.max.per.host -1
> db.fetch.interval.max 7776000
> urlnormalizer.loop.count 1
> mapred.input.dir
> /user/nutch/crawl20071126/segments/20071126123442/content
> io.file.buffer.size 4096
> db.score.injected 1.0
> dfs.replication.considerLoad true
> jobclient.output.filter FAILED
> mapred.tasktracker.tasks.maximum 2
> io.compression.codecs
>
org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compres
> s.GzipCodec
> fs.checkpoint.size 67108864
>
>
> ________________________________
>
> From: Bolle, Jeffrey F. [mailto:jbolle@mitre.org]
> Sent: Monday, November 26, 2007 3:08 PM
> To: nutch-user@lucene.apache.org
> Subject: Crash in Parser
>
>
> All,
> I'm having some trouble with the Nutch nightly. It has been a
> while since I last updated my crawl of our intranet. I was
attempting
> to run the crawl today and it failed with this:
> Exception in thread "main" java.io.IOException: Job failed!
> at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:831)
> at
> org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:142)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:126)
>
> In the web interface it says that:
> Task task_200711261211_0026_m_000015_0 filed to report status
> for 602 seconds. Killing!
>
> Task task_200711261211_0026_m_000015_1 filed to report status
> for 601 seconds. Killing!
>
> Task task_200711261211_0026_m_000015_2 filed to report status
> for 601 seconds. Killing!
>
> Task task_200711261211_0026_m_000015_3 filed to report status
> for 602 seconds. Killing!
>
> I don't have the fetchers set to parse. Nutch and hadoop are
> running on a 3 node cluster. I've attached the job configuration
file
> as saved from the web interface.
>
> Is there any way I can get more information on which file or
> url the parse is failing on? Why doesn't the parsing of a file or
URL
> fail more cleanly?
>
> Any recommendations on helping nutch avoid whatever is causing
> the hang and allowing it to index the rest of the content?
>
> Thanks.
>
>
> Jeff Bolle
>
>
>
>
Re: Crash in Parser
Posted by Ned Rockson <ne...@discoveryengine.com>.
This is a problem with Regex parsing. It has happened for me in
urlnormalizer where the URL was parsed incorrectly and for some reason
is extremely long or contains control characters. What happens is that
if the URL is really long (say thousands of characters) it goes into a
very inefficient algorithm (I believe O(n^3) but not sure) to find
certain features. I fixed this by having prefix-urlnormalizer check
first to see if the length of the URL is less than some constant (I have
defined as 1024). I also saw this problem happen the other day with the
.js parser. Essentially there was a page: http://www.magic-cadeaux.fr/
that had a javascript line that was 150000 slashes in a row. It parses
fine in a browser, but again this led to and endless regex loop.
If you find these are problems you can find the stuck task and do a kill
-SIGQUIT which will dump stack traces to stdout (redirected to
logs/userlogs/[task name]/stdout) and check to see if it's stuck in a
regex loop and what put it there.
--Ned
Bolle, Jeffrey F. wrote:
> Apparently the job configuration file didn't make it through the
> listserv. Here it is in the body of the e-mail.
>
> Jeff
>
>
> Job Configuration: JobId - job_200711261211_0026
>
>
> name value
> dfs.secondary.info.bindAddress 0.0.0.0
> dfs.datanode.port 50010
> dfs.client.buffer.dir ${hadoop.tmp.dir}/dfs/tmp
> searcher.summary.length 20
> generate.update.crawldb false
> lang.ngram.max.length 4
> tasktracker.http.port 50060
> searcher.filter.cache.size 16
> ftp.timeout 60000
> hadoop.tmp.dir /tmp/hadoop-${user.name}
> hadoop.native.lib true
> map.sort.class org.apache.hadoop.mapred.MergeSorter
> ftp.follow.talk false
> indexer.mergeFactor 50
> ipc.client.idlethreshold 4000
> query.host.boost 2.0
> mapred.system.dir /nutch/filesystem/mapreduce/system
> ftp.password anonymous@example.com
> http.agent.version Nutch-1.0-dev
> query.tag.boost 1.0
> dfs.namenode.logging.level info
> db.fetch.schedule.adaptive.sync_delta_rate 0.3
> io.skip.checksum.errors false
> urlfilter.automaton.file automaton-urlfilter.txt
> fs.default.name cisserver:9000
> db.ignore.external.links false
> extension.ontology.urls
> dfs.safemode.threshold.pct 0.999f
> dfs.namenode.handler.count 10
> plugin.folders plugins
> mapred.tasktracker.dns.nameserver default
> io.sort.factor 10
> fetcher.threads.per.host.by.ip false
> parser.html.impl neko
> mapred.task.timeout 600000
> mapred.max.tracker.failures 4
> hadoop.rpc.socket.factory.class.default
> org.apache.hadoop.net.StandardSocketFactory
> db.update.additions.allowed true
> fs.hdfs.impl org.apache.hadoop.dfs.DistributedFileSystem
> indexer.score.power 0.5
> ipc.client.maxidletime 120000
> db.fetch.schedule.class org.apache.nutch.crawl.DefaultFetchSchedule
>
> mapred.output.key.class org.apache.hadoop.io.Text
> file.content.limit 10485760
> http.agent.url http://poisk/index.php/Category:Systems
> dfs.safemode.extension 30000
> tasktracker.http.threads 40
> db.fetch.schedule.adaptive.dec_rate 0.2
> user.name nutch
> mapred.output.compress false
> io.bytes.per.checksum 512
> fetcher.server.delay 0.2
> searcher.summary.context 5
> db.fetch.interval.default 2592000
> searcher.max.time.tick_count -1
> parser.html.form.use_action false
> fs.trash.root ${hadoop.tmp.dir}/Trash
> mapred.reduce.max.attempts 4
> fs.ramfs.impl org.apache.hadoop.fs.InMemoryFileSystem
> db.score.count.filtered false
> fetcher.max.crawl.delay 30
> dfs.info.port 50070
> indexer.maxMergeDocs 2147483647
> mapred.jar
> /nutch/filesystem/mapreduce/local/jobTracker/job_200711261211_0026.jar
>
> fs.s3.buffer.dir ${hadoop.tmp.dir}/s3
> dfs.block.size 67108864
> http.robots.403.allow true
> ftp.content.limit 10485760
> job.end.retry.attempts 0
> fs.file.impl org.apache.hadoop.fs.LocalFileSystem
> query.title.boost 1.5
> mapred.speculative.execution true
> mapred.local.dir.minspacestart 0
> mapred.output.compression.type RECORD
> mime.types.file tika-mimetypes.xml
> generate.max.per.host.by.ip false
> fetcher.parse false
> db.default.fetch.interval 30
> db.max.outlinks.per.page -1
> analysis.common.terms.file common-terms.utf8
> mapred.userlog.retain.hours 24
> dfs.replication.max 512
> http.redirect.max 5
> local.cache.size 10737418240
> mapred.min.split.size 0
> mapred.map.tasks 18
> fetcher.threads.fetch 10
> mapred.child.java.opts -Xmx1500m
> mapred.output.value.class org.apache.nutch.parse.ParseImpl
>
> http.timeout 10000
> http.content.limit 10485760
> dfs.secondary.info.port 50090
> ipc.server.listen.queue.size 128
> encodingdetector.charset.min.confidence -1
> mapred.inmem.merge.threshold 1000
> job.end.retry.interval 30000
> fs.checkpoint.dir ${hadoop.tmp.dir}/dfs/namesecondary
> query.url.boost 4.0
> mapred.reduce.tasks 6
> db.score.link.external 1.0
> query.anchor.boost 2.0
> mapred.userlog.limit.kb 0
> webinterface.private.actions false
> db.max.inlinks 10000000
> mapred.job.split.file
> /nutch/filesystem/mapreduce/system/job_200711261211_0026/job.split
>
> mapred.job.name parse crawl20071126/segments/20071126123442
> dfs.datanode.dns.nameserver default
> dfs.blockreport.intervalMsec 3600000
> ftp.username anonymous
> db.fetch.schedule.adaptive.inc_rate 0.4
> searcher.max.hits -1
> mapred.map.max.attempts 4
> urlnormalizer.regex.file regex-normalize.xml
> ftp.keep.connection false
> searcher.filter.cache.threshold 0.05
> mapred.job.tracker.handler.count 10
> dfs.client.block.write.retries 3
> mapred.input.format.class
> org.apache.hadoop.mapred.SequenceFileInputFormat
> http.verbose true
> fetcher.threads.per.host 8
> mapred.tasktracker.expiry.interval 600000
> mapred.job.tracker.info.bindAddress 0.0.0.0
> ipc.client.timeout 60000
> keep.failed.task.files false
> mapred.output.format.class
> org.apache.nutch.parse.ParseOutputFormat
> mapred.output.compression.codec
> org.apache.hadoop.io.compress.DefaultCodec
> io.map.index.skip 0
> mapred.working.dir /user/nutch
> tasktracker.http.bindAddress 0.0.0.0
> io.seqfile.compression.type RECORD
> mapred.reducer.class org.apache.nutch.parse.ParseSegment
> lang.analyze.max.length 2048
> db.fetch.schedule.adaptive.min_interval 60.0
> http.agent.name Jeffcrawler
> dfs.default.chunk.view.size 32768
> hadoop.logfile.size 10000000
> dfs.datanode.du.pct 0.98f
> parser.caching.forbidden.policy content
> http.useHttp11 false
> fs.inmemory.size.mb 75
> db.fetch.schedule.adaptive.sync_delta true
> dfs.datanode.du.reserved 0
> mapred.job.tracker.info.port 50030
> plugin.auto-activation true
> fs.checkpoint.period 3600
> mapred.jobtracker.completeuserjobs.maximum 100
> mapred.task.tracker.report.bindAddress 127.0.0.1
> db.signature.text_profile.min_token_len 2
> query.phrase.boost 1.0
> lang.ngram.min.length 1
> dfs.df.interval 60000
> dfs.data.dir /nutch/filesystem/data
> dfs.datanode.bindAddress 0.0.0.0
> fs.s3.maxRetries 4
> dfs.datanode.dns.interface default
> http.agent.email Jeff
> extension.clustering.hits-to-cluster 100
> searcher.max.time.tick_length 200
> http.agent.description Jeff's Crawler
> query.lang.boost 0.0
> mapred.local.dir /nutch/filesystem/mapreduce/local
> fs.hftp.impl org.apache.hadoop.dfs.HftpFileSystem
> mapred.mapper.class org.apache.nutch.parse.ParseSegment
> fs.trash.interval 0
> fs.s3.sleepTimeSeconds 10
> dfs.replication.min 1
> mapred.submit.replication 10
> indexer.max.title.length 100
> parser.character.encoding.default windows-1252
> mapred.map.output.compression.codec
> org.apache.hadoop.io.compress.DefaultCodec
> mapred.tasktracker.dns.interface default
> http.robots.agents Jeffcrawler,*
> mapred.job.tracker cisserver:9001
> dfs.heartbeat.interval 3
> urlfilter.regex.file crawl-urlfilter.txt
> io.seqfile.sorter.recordlimit 1000000
> fetcher.store.content true
> urlfilter.suffix.file suffix-urlfilter.txt
> dfs.name.dir /nutch/filesystem/name
> fetcher.verbose true
> db.signature.class org.apache.nutch.crawl.MD5Signature
> db.max.anchor.length 100
> parse.plugin.file parse-plugins.xml
> nutch.segment.name 20071126123442
> mapred.local.dir.minspacekill 0
> searcher.dir /var/nutch/crawl
> fs.kfs.impl org.apache.hadoop.fs.kfs.KosmosFileSystem
> plugin.includes
> protocol-(httpclient|file|ftp)|creativecommons|clustering-carrot2|urlfi
> lter-regex|parse-(html|js|text|pdf|rss|msword|msexcel|mspowerpoint|oo|s
> wf|ext)|index-(basic|more|anchor)|language-identifier|analysis-(en|ru)|
> query-(basic|more|site|url)|summary-basic|microformats-reltag|scoring-o
> pic|subcollection
> mapred.map.output.compression.type RECORD
> mapred.temp.dir ${hadoop.tmp.dir}/mapred/temp
> db.fetch.retry.max 3
> query.cc.boost 0.0
> dfs.replication 2
> db.ignore.internal.links false
> dfs.info.bindAddress 0.0.0.0
> query.site.boost 0.0
> searcher.hostgrouping.rawhits.factor 2.0
> fetcher.server.min.delay 0.0
> hadoop.logfile.count 10
> indexer.termIndexInterval 128
> file.content.ignored true
> db.score.link.internal 1.0
> io.seqfile.compress.blocksize 1000000
> fs.s3.block.size 67108864
> ftp.server.timeout 100000
> http.max.delays 1000
> indexer.minMergeDocs 50
> mapred.reduce.parallel.copies 5
> io.seqfile.lazydecompress true
> mapred.output.dir
> /user/nutch/crawl20071126/segments/20071126123442
> indexer.max.tokens 10000000
> io.sort.mb 100
> ipc.client.connection.maxidletime 1000
> db.fetch.schedule.adaptive.max_interval 31536000.0
> mapred.compress.map.output false
> ipc.client.kill.max 10
> urlnormalizer.order
> org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
> ipc.client.connect.max.retries 10
> urlfilter.prefix.file prefix-urlfilter.txt
> db.signature.text_profile.quant_rate 0.01
> query.type.boost 0.0
> fs.s3.impl org.apache.hadoop.fs.s3.S3FileSystem
> mime.type.magic true
> generate.max.per.host -1
> db.fetch.interval.max 7776000
> urlnormalizer.loop.count 1
> mapred.input.dir
> /user/nutch/crawl20071126/segments/20071126123442/content
> io.file.buffer.size 4096
> db.score.injected 1.0
> dfs.replication.considerLoad true
> jobclient.output.filter FAILED
> mapred.tasktracker.tasks.maximum 2
> io.compression.codecs
> org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compres
> s.GzipCodec
> fs.checkpoint.size 67108864
>
>
> ________________________________
>
> From: Bolle, Jeffrey F. [mailto:jbolle@mitre.org]
> Sent: Monday, November 26, 2007 3:08 PM
> To: nutch-user@lucene.apache.org
> Subject: Crash in Parser
>
>
> All,
> I'm having some trouble with the Nutch nightly. It has been a
> while since I last updated my crawl of our intranet. I was attempting
> to run the crawl today and it failed with this:
> Exception in thread "main" java.io.IOException: Job failed!
> at
> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:831)
> at
> org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:142)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:126)
>
> In the web interface it says that:
> Task task_200711261211_0026_m_000015_0 filed to report status
> for 602 seconds. Killing!
>
> Task task_200711261211_0026_m_000015_1 filed to report status
> for 601 seconds. Killing!
>
> Task task_200711261211_0026_m_000015_2 filed to report status
> for 601 seconds. Killing!
>
> Task task_200711261211_0026_m_000015_3 filed to report status
> for 602 seconds. Killing!
>
> I don't have the fetchers set to parse. Nutch and hadoop are
> running on a 3 node cluster. I've attached the job configuration file
> as saved from the web interface.
>
> Is there any way I can get more information on which file or
> url the parse is failing on? Why doesn't the parsing of a file or URL
> fail more cleanly?
>
> Any recommendations on helping nutch avoid whatever is causing
> the hang and allowing it to index the rest of the content?
>
> Thanks.
>
>
> Jeff Bolle
>
>
>
>
RE: Crash in Parser
Posted by "Bolle, Jeffrey F." <jb...@mitre.org>.
Apparently the job configuration file didn't make it through the
listserv. Here it is in the body of the e-mail.
Jeff
Job Configuration: JobId - job_200711261211_0026
name value
dfs.secondary.info.bindAddress 0.0.0.0
dfs.datanode.port 50010
dfs.client.buffer.dir ${hadoop.tmp.dir}/dfs/tmp
searcher.summary.length 20
generate.update.crawldb false
lang.ngram.max.length 4
tasktracker.http.port 50060
searcher.filter.cache.size 16
ftp.timeout 60000
hadoop.tmp.dir /tmp/hadoop-${user.name}
hadoop.native.lib true
map.sort.class org.apache.hadoop.mapred.MergeSorter
ftp.follow.talk false
indexer.mergeFactor 50
ipc.client.idlethreshold 4000
query.host.boost 2.0
mapred.system.dir /nutch/filesystem/mapreduce/system
ftp.password anonymous@example.com
http.agent.version Nutch-1.0-dev
query.tag.boost 1.0
dfs.namenode.logging.level info
db.fetch.schedule.adaptive.sync_delta_rate 0.3
io.skip.checksum.errors false
urlfilter.automaton.file automaton-urlfilter.txt
fs.default.name cisserver:9000
db.ignore.external.links false
extension.ontology.urls
dfs.safemode.threshold.pct 0.999f
dfs.namenode.handler.count 10
plugin.folders plugins
mapred.tasktracker.dns.nameserver default
io.sort.factor 10
fetcher.threads.per.host.by.ip false
parser.html.impl neko
mapred.task.timeout 600000
mapred.max.tracker.failures 4
hadoop.rpc.socket.factory.class.default
org.apache.hadoop.net.StandardSocketFactory
db.update.additions.allowed true
fs.hdfs.impl org.apache.hadoop.dfs.DistributedFileSystem
indexer.score.power 0.5
ipc.client.maxidletime 120000
db.fetch.schedule.class org.apache.nutch.crawl.DefaultFetchSchedule
mapred.output.key.class org.apache.hadoop.io.Text
file.content.limit 10485760
http.agent.url http://poisk/index.php/Category:Systems
dfs.safemode.extension 30000
tasktracker.http.threads 40
db.fetch.schedule.adaptive.dec_rate 0.2
user.name nutch
mapred.output.compress false
io.bytes.per.checksum 512
fetcher.server.delay 0.2
searcher.summary.context 5
db.fetch.interval.default 2592000
searcher.max.time.tick_count -1
parser.html.form.use_action false
fs.trash.root ${hadoop.tmp.dir}/Trash
mapred.reduce.max.attempts 4
fs.ramfs.impl org.apache.hadoop.fs.InMemoryFileSystem
db.score.count.filtered false
fetcher.max.crawl.delay 30
dfs.info.port 50070
indexer.maxMergeDocs 2147483647
mapred.jar
/nutch/filesystem/mapreduce/local/jobTracker/job_200711261211_0026.jar
fs.s3.buffer.dir ${hadoop.tmp.dir}/s3
dfs.block.size 67108864
http.robots.403.allow true
ftp.content.limit 10485760
job.end.retry.attempts 0
fs.file.impl org.apache.hadoop.fs.LocalFileSystem
query.title.boost 1.5
mapred.speculative.execution true
mapred.local.dir.minspacestart 0
mapred.output.compression.type RECORD
mime.types.file tika-mimetypes.xml
generate.max.per.host.by.ip false
fetcher.parse false
db.default.fetch.interval 30
db.max.outlinks.per.page -1
analysis.common.terms.file common-terms.utf8
mapred.userlog.retain.hours 24
dfs.replication.max 512
http.redirect.max 5
local.cache.size 10737418240
mapred.min.split.size 0
mapred.map.tasks 18
fetcher.threads.fetch 10
mapred.child.java.opts -Xmx1500m
mapred.output.value.class org.apache.nutch.parse.ParseImpl
http.timeout 10000
http.content.limit 10485760
dfs.secondary.info.port 50090
ipc.server.listen.queue.size 128
encodingdetector.charset.min.confidence -1
mapred.inmem.merge.threshold 1000
job.end.retry.interval 30000
fs.checkpoint.dir ${hadoop.tmp.dir}/dfs/namesecondary
query.url.boost 4.0
mapred.reduce.tasks 6
db.score.link.external 1.0
query.anchor.boost 2.0
mapred.userlog.limit.kb 0
webinterface.private.actions false
db.max.inlinks 10000000
mapred.job.split.file
/nutch/filesystem/mapreduce/system/job_200711261211_0026/job.split
mapred.job.name parse crawl20071126/segments/20071126123442
dfs.datanode.dns.nameserver default
dfs.blockreport.intervalMsec 3600000
ftp.username anonymous
db.fetch.schedule.adaptive.inc_rate 0.4
searcher.max.hits -1
mapred.map.max.attempts 4
urlnormalizer.regex.file regex-normalize.xml
ftp.keep.connection false
searcher.filter.cache.threshold 0.05
mapred.job.tracker.handler.count 10
dfs.client.block.write.retries 3
mapred.input.format.class
org.apache.hadoop.mapred.SequenceFileInputFormat
http.verbose true
fetcher.threads.per.host 8
mapred.tasktracker.expiry.interval 600000
mapred.job.tracker.info.bindAddress 0.0.0.0
ipc.client.timeout 60000
keep.failed.task.files false
mapred.output.format.class
org.apache.nutch.parse.ParseOutputFormat
mapred.output.compression.codec
org.apache.hadoop.io.compress.DefaultCodec
io.map.index.skip 0
mapred.working.dir /user/nutch
tasktracker.http.bindAddress 0.0.0.0
io.seqfile.compression.type RECORD
mapred.reducer.class org.apache.nutch.parse.ParseSegment
lang.analyze.max.length 2048
db.fetch.schedule.adaptive.min_interval 60.0
http.agent.name Jeffcrawler
dfs.default.chunk.view.size 32768
hadoop.logfile.size 10000000
dfs.datanode.du.pct 0.98f
parser.caching.forbidden.policy content
http.useHttp11 false
fs.inmemory.size.mb 75
db.fetch.schedule.adaptive.sync_delta true
dfs.datanode.du.reserved 0
mapred.job.tracker.info.port 50030
plugin.auto-activation true
fs.checkpoint.period 3600
mapred.jobtracker.completeuserjobs.maximum 100
mapred.task.tracker.report.bindAddress 127.0.0.1
db.signature.text_profile.min_token_len 2
query.phrase.boost 1.0
lang.ngram.min.length 1
dfs.df.interval 60000
dfs.data.dir /nutch/filesystem/data
dfs.datanode.bindAddress 0.0.0.0
fs.s3.maxRetries 4
dfs.datanode.dns.interface default
http.agent.email Jeff
extension.clustering.hits-to-cluster 100
searcher.max.time.tick_length 200
http.agent.description Jeff's Crawler
query.lang.boost 0.0
mapred.local.dir /nutch/filesystem/mapreduce/local
fs.hftp.impl org.apache.hadoop.dfs.HftpFileSystem
mapred.mapper.class org.apache.nutch.parse.ParseSegment
fs.trash.interval 0
fs.s3.sleepTimeSeconds 10
dfs.replication.min 1
mapred.submit.replication 10
indexer.max.title.length 100
parser.character.encoding.default windows-1252
mapred.map.output.compression.codec
org.apache.hadoop.io.compress.DefaultCodec
mapred.tasktracker.dns.interface default
http.robots.agents Jeffcrawler,*
mapred.job.tracker cisserver:9001
dfs.heartbeat.interval 3
urlfilter.regex.file crawl-urlfilter.txt
io.seqfile.sorter.recordlimit 1000000
fetcher.store.content true
urlfilter.suffix.file suffix-urlfilter.txt
dfs.name.dir /nutch/filesystem/name
fetcher.verbose true
db.signature.class org.apache.nutch.crawl.MD5Signature
db.max.anchor.length 100
parse.plugin.file parse-plugins.xml
nutch.segment.name 20071126123442
mapred.local.dir.minspacekill 0
searcher.dir /var/nutch/crawl
fs.kfs.impl org.apache.hadoop.fs.kfs.KosmosFileSystem
plugin.includes
protocol-(httpclient|file|ftp)|creativecommons|clustering-carrot2|urlfi
lter-regex|parse-(html|js|text|pdf|rss|msword|msexcel|mspowerpoint|oo|s
wf|ext)|index-(basic|more|anchor)|language-identifier|analysis-(en|ru)|
query-(basic|more|site|url)|summary-basic|microformats-reltag|scoring-o
pic|subcollection
mapred.map.output.compression.type RECORD
mapred.temp.dir ${hadoop.tmp.dir}/mapred/temp
db.fetch.retry.max 3
query.cc.boost 0.0
dfs.replication 2
db.ignore.internal.links false
dfs.info.bindAddress 0.0.0.0
query.site.boost 0.0
searcher.hostgrouping.rawhits.factor 2.0
fetcher.server.min.delay 0.0
hadoop.logfile.count 10
indexer.termIndexInterval 128
file.content.ignored true
db.score.link.internal 1.0
io.seqfile.compress.blocksize 1000000
fs.s3.block.size 67108864
ftp.server.timeout 100000
http.max.delays 1000
indexer.minMergeDocs 50
mapred.reduce.parallel.copies 5
io.seqfile.lazydecompress true
mapred.output.dir
/user/nutch/crawl20071126/segments/20071126123442
indexer.max.tokens 10000000
io.sort.mb 100
ipc.client.connection.maxidletime 1000
db.fetch.schedule.adaptive.max_interval 31536000.0
mapred.compress.map.output false
ipc.client.kill.max 10
urlnormalizer.order
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
ipc.client.connect.max.retries 10
urlfilter.prefix.file prefix-urlfilter.txt
db.signature.text_profile.quant_rate 0.01
query.type.boost 0.0
fs.s3.impl org.apache.hadoop.fs.s3.S3FileSystem
mime.type.magic true
generate.max.per.host -1
db.fetch.interval.max 7776000
urlnormalizer.loop.count 1
mapred.input.dir
/user/nutch/crawl20071126/segments/20071126123442/content
io.file.buffer.size 4096
db.score.injected 1.0
dfs.replication.considerLoad true
jobclient.output.filter FAILED
mapred.tasktracker.tasks.maximum 2
io.compression.codecs
org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compres
s.GzipCodec
fs.checkpoint.size 67108864
________________________________
From: Bolle, Jeffrey F. [mailto:jbolle@mitre.org]
Sent: Monday, November 26, 2007 3:08 PM
To: nutch-user@lucene.apache.org
Subject: Crash in Parser
All,
I'm having some trouble with the Nutch nightly. It has been a
while since I last updated my crawl of our intranet. I was attempting
to run the crawl today and it failed with this:
Exception in thread "main" java.io.IOException: Job failed!
at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:831)
at
org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:142)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:126)
In the web interface it says that:
Task task_200711261211_0026_m_000015_0 filed to report status
for 602 seconds. Killing!
Task task_200711261211_0026_m_000015_1 filed to report status
for 601 seconds. Killing!
Task task_200711261211_0026_m_000015_2 filed to report status
for 601 seconds. Killing!
Task task_200711261211_0026_m_000015_3 filed to report status
for 602 seconds. Killing!
I don't have the fetchers set to parse. Nutch and hadoop are
running on a 3 node cluster. I've attached the job configuration file
as saved from the web interface.
Is there any way I can get more information on which file or
url the parse is failing on? Why doesn't the parsing of a file or URL
fail more cleanly?
Any recommendations on helping nutch avoid whatever is causing
the hang and allowing it to index the rest of the content?
Thanks.
Jeff Bolle