You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Karsten Dello (JIRA)" <ji...@apache.org> on 2007/01/01 22:42:27 UTC
[jira] Commented: (NUTCH-424) CLONE - Problem persists with Nutch
0.8.1 (Nekohtml 0.9.4) - NekoHTML's DOMFragmentParser hangs on certain URLs
[ http://issues.apache.org/jira/browse/NUTCH-424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12461647 ]
Karsten Dello commented on NUTCH-424:
-------------------------------------
Sorry for cloning, but I could not reopen the original issue.
The problem persist with current stable nutch version (0.8.1) which uses nekohtml 0.9.4
I do parsing after fetching ("nutch parse") so I cannot see the url in the log file,
though I have set log level to DEBUG
- is there a way to accomplish this?
Anyway, it seems to be exactly the same errror,
here comes the output from jstack and "kill -SIGQUIT <pid>"
(1) jstack output
Attaching to process ID 27428, please wait...
Debugger attached successfully.
Client compiler detected.
JVM version is 1.5.0_05-b05
Thread 27439: (state = BLOCKED)
- java.lang.AbstractStringBuilder.expandCapacity(int) @bci=28, line=99 (Compiled frame)
- java.lang.AbstractStringBuilder.append(java.lang.String) @bci=36, line=393 (Compiled frame)
- java.lang.StringBuffer.append(java.lang.String) @bci=2, line=225 (Compiled frame)
- org.apache.xerces.dom.CharacterDataImpl.appendData(java.lang.String) @bci=59 (Compiled frame)
- org.cyberneko.html.parsers.DOMFragmentParser.characters(org.apache.xerces.xni.XMLString, org.apache.xerces.xni.Augmentations) @bci=117, line=463 (Compiled frame)
- org.cyberneko.html.filters.DefaultFilter.characters(org.apache.xerces.xni.XMLString, org.apache.xerces.xni.Augmentations) @bci=13, line=195 (Compiled frame)
- org.cyberneko.html.HTMLTagBalancer.characters(org.apache.xerces.xni.XMLString, org.apache.xerces.xni.Augmentations) @bci=294, line=821 (Compiled frame)
- org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters() @bci=324, line=1972 (Compiled frame)
- org.cyberneko.html.HTMLScanner$ContentScanner.scan(boolean) @bci=184, line=1775 (Compiled frame)
- org.cyberneko.html.HTMLScanner.scanDocument(boolean) @bci=5, line=789 (Compiled frame)
- org.cyberneko.html.HTMLConfiguration.parse(org.apache.xerces.xni.parser.XMLInputSource) @bci=7, line=431 (Compiled frame)
- org.cyberneko.html.parsers.DOMFragmentParser.parse(org.xml.sax.InputSource, org.w3c.dom.DocumentFragment) @bci=93, line=164 (Compiled frame)
- org.apache.nutch.parse.html.HtmlParser.parseNeko(org.xml.sax.InputSource) @bci=76, line=261 (Compiled frame)
- org.apache.nutch.parse.html.HtmlParser.parse(org.xml.sax.InputSource) @bci=20, line=225 (Compiled frame)
- org.apache.nutch.parse.ParseUtil.parse(org.apache.nutch.protocol.Content) @bci=145, line=82 (Compiled frame)
- org.apache.nutch.parse.ParseSegment.map(org.apache.hadoop.io.WritableComparable, org.apache.hadoop.io.Writable, org.apache.hadoop.mapred.OutputCollector, org.apache.hadoop.mapred.Reporter) @bci=22, line=66 (Compiled frame)
- org.apache.hadoop.mapred.MapRunner.run(org.apache.hadoop.mapred.RecordReader, org.apache.hadoop.mapred.OutputCollector, org.apache.hadoop.mapred.Reporter) @bci=55, line=48 (Compiled frame)
- org.apache.hadoop.mapred.MapTask.run(org.apache.hadoop.mapred.JobConf, org.apache.hadoop.mapred.TaskUmbilicalProtocol) @bci=198, line=129 (Interpreted frame)
- org.apache.hadoop.mapred.LocalJobRunner$Job.run() @bci=120, line=91 (Interpreted frame)
Thread 27435: (state = BLOCKED)
Thread 27434: (state = BLOCKED)
- java.lang.Object.wait(long) @bci=0 (Interpreted frame)
- java.lang.ref.ReferenceQueue.remove(long) @bci=44, line=116 (Compiled frame)
- java.lang.ref.ReferenceQueue.remove() @bci=2, line=132 (Compiled frame)
Thread 27433: (state = BLOCKED)
- java.lang.Object.wait(long) @bci=0 (Interpreted frame)
- java.lang.Object.wait() @bci=2, line=474 (Compiled frame)
Thread 27428: (state = BLOCKED)
- java.lang.Thread.sleep(long) @bci=0 (Interpreted frame)
- org.apache.hadoop.mapred.JobClient.runJob(org.apache.hadoop.mapred.JobConf) @bci=67, line=332 (Interpreted frame)
- org.apache.nutch.parse.ParseSegment.parse(org.apache.hadoop.fs.Path) @bci=303, line=120 (Interpreted frame)
- org.apache.nutch.parse.ParseSegment.main(java.lang.String[]) @bci=43, line=138 (Interpreted frame)
(2) kill -SIGQUIT
Full thread dump Java HotSpot(TM) Client VM (1.5.0_05-b05 mixed mode):
"Thread-0" prio=1 tid=0x08518da0 nid=0x6b2f waiting on condition [0xababa000..0xababb680]
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:99)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:393)
at java.lang.StringBuffer.append(StringBuffer.java:225)
- locked <0x45a086c8> (a java.lang.StringBuffer)
at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown Source)
at org.cyberneko.html.parsers.DOMFragmentParser.characters(DOMFragmentParser.java:463)
at org.cyberneko.html.filters.DefaultFilter.characters(DefaultFilter.java:195)
at org.cyberneko.html.HTMLTagBalancer.characters(HTMLTagBalancer.java:821)
at org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters(HTMLScanner.java:1972)
at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1775)
at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789)
at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478)
at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431)
at org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java:164)
at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:261)
at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:225)
at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:164)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:66)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:129)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:91)
"Low Memory Detector" daemon prio=1 tid=0x080c6d20 nid=0x6b2d runnable [0x00000000..0x00000000]
"CompilerThread0" daemon prio=1 tid=0x080c57d0 nid=0x6b2c waiting on condition [0x00000000..0xa99d41e8]
"Signal Dispatcher" daemon prio=1 tid=0x080c4938 nid=0x6b2b waiting on condition [0x00000000..0x00000000]
"Finalizer" daemon prio=1 tid=0x080b9528 nid=0x6b2a in Object.wait() [0xa9891000..0xa9891500]
at java.lang.Object.wait(Native Method)
- waiting on <0x4cc933a8> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
- locked <0x4cc933a8> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
"Reference Handler" daemon prio=1 tid=0x080b8860 nid=0x6b29 in Object.wait() [0xa9810000..0xa9810580]
at java.lang.Object.wait(Native Method)
- waiting on <0x4cc93428> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:474)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
- locked <0x4cc93428> (a java.lang.ref.Reference$Lock)
"main" prio=1 tid=0x0805cba0 nid=0x6b24 waiting on condition [0xbfffc000..0xbfffcb58]
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:332)
at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:120)
at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:138)
"VM Thread" prio=1 tid=0x080b5c40 nid=0x6b28 runnable
"VM Periodic Task Thread" prio=1 tid=0x080c81b0 nid=0x6b2e waiting on condition
> CLONE - Problem persists with Nutch 0.8.1 (Nekohtml 0.9.4) - NekoHTML's DOMFragmentParser hangs on certain URLs
> ---------------------------------------------------------------------------------------------------------------
>
> Key: NUTCH-424
> URL: http://issues.apache.org/jira/browse/NUTCH-424
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Environment: Linux and Windows
> Reporter: Karsten Dello
>
> I've tracked down occasional fetcher hangs to NekoHTML's DOMFragmentParser hanging certain HTML documents, for example, http://www.inlandrevenue.gov.uk/charities/chapter_3.htm.
> The thread dump on the hung parser is:
> "CompilerThread0" daemon prio=1 tid=0x080c4c18 nid=0x47da waiting on condition [0x00000000..0x8a3daf68]
> "Signal Dispatcher" daemon prio=1 tid=0x080c3d60 nid=0x47d9 waiting on condition [0x00000000..0x00000000]
> "Finalizer" daemon prio=1 tid=0x080b8818 nid=0x47d8 in Object.wait() [0x8a2a0000..0x8a2a0680]
> at java.lang.Object.wait(Native Method)
> - waiting on <0x4a60d058> (a java.lang.ref.ReferenceQueue$Lock)
> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
> - locked <0x4a60d058> (a java.lang.ref.ReferenceQueue$Lock)
> at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
> at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> "Reference Handler" daemon prio=1 tid=0x080b7b50 nid=0x47d7 in Object.wait() [0x8a21f000..0x8a21f800]
> at java.lang.Object.wait(Native Method)
> - waiting on <0x4a60d0d8> (a java.lang.ref.Reference$Lock)
> at java.lang.Object.wait(Object.java:474)
> at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
> - locked <0x4a60d0d8> (a java.lang.ref.Reference$Lock)
> "main" prio=1 tid=0x0805c170 nid=0x47d1 waiting on condition [0xbfffc000..0xbfffcec8]
> at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:99)
> at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:393)
> at java.lang.StringBuffer.append(StringBuffer.java:225)
> - locked <0x45910118> (a java.lang.StringBuffer)
> at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown Source)
> at org.cyberneko.html.parsers.DOMFragmentParser.characters(Unknown Source)
> at org.cyberneko.html.filters.DefaultFilter.characters(Unknown Source)
> at org.cyberneko.html.HTMLTagBalancer.characters(Unknown Source)
> at org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters(Unknown Source)
> at org.cyberneko.html.HTMLScanner$ContentScanner.scan(Unknown Source)
> at org.cyberneko.html.HTMLScanner.scanDocument(Unknown Source)
> at org.cyberneko.html.HTMLConfiguration.parse(Unknown Source)
> at org.cyberneko.html.HTMLConfiguration.parse(Unknown Source)
> at org.cyberneko.html.parsers.DOMFragmentParser.parse(Unknown Source)
> at net.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:157)
> at net.nutch.parse.ParserChecker.main(ParserChecker.java:74)
> "VM Thread" prio=1 tid=0x080b4f30 nid=0x47d6 runnable
> "VM Periodic Task Thread" prio=1 tid=0x080c75f8 nid=0x47dc waiting on condition
> Using the URL mentioned above, I was able to successfully parse the file using a normal NekoHTML DocumentParser.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira