You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Karsten Dello (JIRA)" <ji...@apache.org> on 2007/01/01 22:27:27 UTC

[jira] Created: (NUTCH-424) CLONE - Problem persists with Nutch 0.8.1 (Nekohtml 0.9.4) - NekoHTML's DOMFragmentParser hangs on certain URLs

CLONE - Problem persists with Nutch 0.8.1 (Nekohtml 0.9.4) - NekoHTML's DOMFragmentParser hangs on certain URLs
---------------------------------------------------------------------------------------------------------------

                 Key: NUTCH-424
                 URL: http://issues.apache.org/jira/browse/NUTCH-424
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
         Environment: Linux and Windows
            Reporter: Karsten Dello


I've tracked down occasional fetcher hangs to NekoHTML's DOMFragmentParser hanging certain HTML documents, for example, http://www.inlandrevenue.gov.uk/charities/chapter_3.htm.

The thread dump on the hung parser is:
"CompilerThread0" daemon prio=1 tid=0x080c4c18 nid=0x47da waiting on condition [0x00000000..0x8a3daf68]

"Signal Dispatcher" daemon prio=1 tid=0x080c3d60 nid=0x47d9 waiting on condition [0x00000000..0x00000000]

"Finalizer" daemon prio=1 tid=0x080b8818 nid=0x47d8 in Object.wait() [0x8a2a0000..0x8a2a0680]
        at java.lang.Object.wait(Native Method)
        - waiting on <0x4a60d058> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
        - locked <0x4a60d058> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=1 tid=0x080b7b50 nid=0x47d7 in Object.wait() [0x8a21f000..0x8a21f800]
        at java.lang.Object.wait(Native Method)
        - waiting on <0x4a60d0d8> (a java.lang.ref.Reference$Lock)
        at java.lang.Object.wait(Object.java:474)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
        - locked <0x4a60d0d8> (a java.lang.ref.Reference$Lock)

"main" prio=1 tid=0x0805c170 nid=0x47d1 waiting on condition [0xbfffc000..0xbfffcec8]
        at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:99)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:393)
        at java.lang.StringBuffer.append(StringBuffer.java:225)
        - locked <0x45910118> (a java.lang.StringBuffer)
        at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown Source)
        at org.cyberneko.html.parsers.DOMFragmentParser.characters(Unknown Source)
        at org.cyberneko.html.filters.DefaultFilter.characters(Unknown Source)
        at org.cyberneko.html.HTMLTagBalancer.characters(Unknown Source)
        at org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters(Unknown Source)
        at org.cyberneko.html.HTMLScanner$ContentScanner.scan(Unknown Source)
        at org.cyberneko.html.HTMLScanner.scanDocument(Unknown Source)
        at org.cyberneko.html.HTMLConfiguration.parse(Unknown Source)
        at org.cyberneko.html.HTMLConfiguration.parse(Unknown Source)
        at org.cyberneko.html.parsers.DOMFragmentParser.parse(Unknown Source)
        at net.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:157)
        at net.nutch.parse.ParserChecker.main(ParserChecker.java:74)

"VM Thread" prio=1 tid=0x080b4f30 nid=0x47d6 runnable

"VM Periodic Task Thread" prio=1 tid=0x080c75f8 nid=0x47dc waiting on condition

Using the URL mentioned above, I was able to successfully parse the file using a normal NekoHTML DocumentParser.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (NUTCH-424) NekoHTML's DOMFragmentParser hangs on certain URLs (CLONE: Problem persists with Nutch 0.9 and 0.8.1 (Nekohtml 0.9.4))

Posted by "Mike Brzozowski (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mike Brzozowski updated NUTCH-424:
----------------------------------

    Summary: NekoHTML's DOMFragmentParser hangs on certain URLs (CLONE: Problem persists with Nutch 0.9 and 0.8.1 (Nekohtml 0.9.4))  (was: CLONE - Problem persists with Nutch 0.8.1 (Nekohtml 0.9.4) - NekoHTML's DOMFragmentParser hangs on certain URLs)

> NekoHTML's DOMFragmentParser hangs on certain URLs (CLONE: Problem persists with Nutch 0.9 and 0.8.1 (Nekohtml 0.9.4))
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-424
>                 URL: https://issues.apache.org/jira/browse/NUTCH-424
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0
>         Environment: Linux and Windows
>            Reporter: Karsten Dello
>
> I've tracked down occasional fetcher hangs to NekoHTML's DOMFragmentParser hanging certain HTML documents, for example, http://www.inlandrevenue.gov.uk/charities/chapter_3.htm.
> The thread dump on the hung parser is:
> "CompilerThread0" daemon prio=1 tid=0x080c4c18 nid=0x47da waiting on condition [0x00000000..0x8a3daf68]
> "Signal Dispatcher" daemon prio=1 tid=0x080c3d60 nid=0x47d9 waiting on condition [0x00000000..0x00000000]
> "Finalizer" daemon prio=1 tid=0x080b8818 nid=0x47d8 in Object.wait() [0x8a2a0000..0x8a2a0680]
>         at java.lang.Object.wait(Native Method)
>         - waiting on <0x4a60d058> (a java.lang.ref.ReferenceQueue$Lock)
>         at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>         - locked <0x4a60d058> (a java.lang.ref.ReferenceQueue$Lock)
>         at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>         at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> "Reference Handler" daemon prio=1 tid=0x080b7b50 nid=0x47d7 in Object.wait() [0x8a21f000..0x8a21f800]
>         at java.lang.Object.wait(Native Method)
>         - waiting on <0x4a60d0d8> (a java.lang.ref.Reference$Lock)
>         at java.lang.Object.wait(Object.java:474)
>         at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
>         - locked <0x4a60d0d8> (a java.lang.ref.Reference$Lock)
> "main" prio=1 tid=0x0805c170 nid=0x47d1 waiting on condition [0xbfffc000..0xbfffcec8]
>         at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:99)
>         at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:393)
>         at java.lang.StringBuffer.append(StringBuffer.java:225)
>         - locked <0x45910118> (a java.lang.StringBuffer)
>         at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown Source)
>         at org.cyberneko.html.parsers.DOMFragmentParser.characters(Unknown Source)
>         at org.cyberneko.html.filters.DefaultFilter.characters(Unknown Source)
>         at org.cyberneko.html.HTMLTagBalancer.characters(Unknown Source)
>         at org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters(Unknown Source)
>         at org.cyberneko.html.HTMLScanner$ContentScanner.scan(Unknown Source)
>         at org.cyberneko.html.HTMLScanner.scanDocument(Unknown Source)
>         at org.cyberneko.html.HTMLConfiguration.parse(Unknown Source)
>         at org.cyberneko.html.HTMLConfiguration.parse(Unknown Source)
>         at org.cyberneko.html.parsers.DOMFragmentParser.parse(Unknown Source)
>         at net.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:157)
>         at net.nutch.parse.ParserChecker.main(ParserChecker.java:74)
> "VM Thread" prio=1 tid=0x080b4f30 nid=0x47d6 runnable
> "VM Periodic Task Thread" prio=1 tid=0x080c75f8 nid=0x47dc waiting on condition
> Using the URL mentioned above, I was able to successfully parse the file using a normal NekoHTML DocumentParser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (NUTCH-424) CLONE - Problem persists with Nutch 0.8.1 (Nekohtml 0.9.4) - NekoHTML's DOMFragmentParser hangs on certain URLs

Posted by "Mike Brzozowski (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mike Brzozowski updated NUTCH-424:
----------------------------------

    Affects Version/s: 0.8.1
                       0.9.0

> CLONE - Problem persists with Nutch 0.8.1 (Nekohtml 0.9.4) - NekoHTML's DOMFragmentParser hangs on certain URLs
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-424
>                 URL: https://issues.apache.org/jira/browse/NUTCH-424
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0
>         Environment: Linux and Windows
>            Reporter: Karsten Dello
>
> I've tracked down occasional fetcher hangs to NekoHTML's DOMFragmentParser hanging certain HTML documents, for example, http://www.inlandrevenue.gov.uk/charities/chapter_3.htm.
> The thread dump on the hung parser is:
> "CompilerThread0" daemon prio=1 tid=0x080c4c18 nid=0x47da waiting on condition [0x00000000..0x8a3daf68]
> "Signal Dispatcher" daemon prio=1 tid=0x080c3d60 nid=0x47d9 waiting on condition [0x00000000..0x00000000]
> "Finalizer" daemon prio=1 tid=0x080b8818 nid=0x47d8 in Object.wait() [0x8a2a0000..0x8a2a0680]
>         at java.lang.Object.wait(Native Method)
>         - waiting on <0x4a60d058> (a java.lang.ref.ReferenceQueue$Lock)
>         at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>         - locked <0x4a60d058> (a java.lang.ref.ReferenceQueue$Lock)
>         at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>         at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> "Reference Handler" daemon prio=1 tid=0x080b7b50 nid=0x47d7 in Object.wait() [0x8a21f000..0x8a21f800]
>         at java.lang.Object.wait(Native Method)
>         - waiting on <0x4a60d0d8> (a java.lang.ref.Reference$Lock)
>         at java.lang.Object.wait(Object.java:474)
>         at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
>         - locked <0x4a60d0d8> (a java.lang.ref.Reference$Lock)
> "main" prio=1 tid=0x0805c170 nid=0x47d1 waiting on condition [0xbfffc000..0xbfffcec8]
>         at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:99)
>         at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:393)
>         at java.lang.StringBuffer.append(StringBuffer.java:225)
>         - locked <0x45910118> (a java.lang.StringBuffer)
>         at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown Source)
>         at org.cyberneko.html.parsers.DOMFragmentParser.characters(Unknown Source)
>         at org.cyberneko.html.filters.DefaultFilter.characters(Unknown Source)
>         at org.cyberneko.html.HTMLTagBalancer.characters(Unknown Source)
>         at org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters(Unknown Source)
>         at org.cyberneko.html.HTMLScanner$ContentScanner.scan(Unknown Source)
>         at org.cyberneko.html.HTMLScanner.scanDocument(Unknown Source)
>         at org.cyberneko.html.HTMLConfiguration.parse(Unknown Source)
>         at org.cyberneko.html.HTMLConfiguration.parse(Unknown Source)
>         at org.cyberneko.html.parsers.DOMFragmentParser.parse(Unknown Source)
>         at net.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:157)
>         at net.nutch.parse.ParserChecker.main(ParserChecker.java:74)
> "VM Thread" prio=1 tid=0x080b4f30 nid=0x47d6 runnable
> "VM Periodic Task Thread" prio=1 tid=0x080c75f8 nid=0x47dc waiting on condition
> Using the URL mentioned above, I was able to successfully parse the file using a normal NekoHTML DocumentParser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-424) CLONE - Problem persists with Nutch 0.8.1 (Nekohtml 0.9.4) - NekoHTML's DOMFragmentParser hangs on certain URLs

Posted by "Mike Brzozowski (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494777 ] 

Mike Brzozowski commented on NUTCH-424:
---------------------------------------

This problem appears to persist in nutch-0.9. Is there a workaround?

> CLONE - Problem persists with Nutch 0.8.1 (Nekohtml 0.9.4) - NekoHTML's DOMFragmentParser hangs on certain URLs
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-424
>                 URL: https://issues.apache.org/jira/browse/NUTCH-424
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0
>         Environment: Linux and Windows
>            Reporter: Karsten Dello
>
> I've tracked down occasional fetcher hangs to NekoHTML's DOMFragmentParser hanging certain HTML documents, for example, http://www.inlandrevenue.gov.uk/charities/chapter_3.htm.
> The thread dump on the hung parser is:
> "CompilerThread0" daemon prio=1 tid=0x080c4c18 nid=0x47da waiting on condition [0x00000000..0x8a3daf68]
> "Signal Dispatcher" daemon prio=1 tid=0x080c3d60 nid=0x47d9 waiting on condition [0x00000000..0x00000000]
> "Finalizer" daemon prio=1 tid=0x080b8818 nid=0x47d8 in Object.wait() [0x8a2a0000..0x8a2a0680]
>         at java.lang.Object.wait(Native Method)
>         - waiting on <0x4a60d058> (a java.lang.ref.ReferenceQueue$Lock)
>         at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>         - locked <0x4a60d058> (a java.lang.ref.ReferenceQueue$Lock)
>         at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>         at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> "Reference Handler" daemon prio=1 tid=0x080b7b50 nid=0x47d7 in Object.wait() [0x8a21f000..0x8a21f800]
>         at java.lang.Object.wait(Native Method)
>         - waiting on <0x4a60d0d8> (a java.lang.ref.Reference$Lock)
>         at java.lang.Object.wait(Object.java:474)
>         at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
>         - locked <0x4a60d0d8> (a java.lang.ref.Reference$Lock)
> "main" prio=1 tid=0x0805c170 nid=0x47d1 waiting on condition [0xbfffc000..0xbfffcec8]
>         at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:99)
>         at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:393)
>         at java.lang.StringBuffer.append(StringBuffer.java:225)
>         - locked <0x45910118> (a java.lang.StringBuffer)
>         at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown Source)
>         at org.cyberneko.html.parsers.DOMFragmentParser.characters(Unknown Source)
>         at org.cyberneko.html.filters.DefaultFilter.characters(Unknown Source)
>         at org.cyberneko.html.HTMLTagBalancer.characters(Unknown Source)
>         at org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters(Unknown Source)
>         at org.cyberneko.html.HTMLScanner$ContentScanner.scan(Unknown Source)
>         at org.cyberneko.html.HTMLScanner.scanDocument(Unknown Source)
>         at org.cyberneko.html.HTMLConfiguration.parse(Unknown Source)
>         at org.cyberneko.html.HTMLConfiguration.parse(Unknown Source)
>         at org.cyberneko.html.parsers.DOMFragmentParser.parse(Unknown Source)
>         at net.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:157)
>         at net.nutch.parse.ParserChecker.main(ParserChecker.java:74)
> "VM Thread" prio=1 tid=0x080b4f30 nid=0x47d6 runnable
> "VM Periodic Task Thread" prio=1 tid=0x080c75f8 nid=0x47dc waiting on condition
> Using the URL mentioned above, I was able to successfully parse the file using a normal NekoHTML DocumentParser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-424) CLONE - Problem persists with Nutch 0.8.1 (Nekohtml 0.9.4) - NekoHTML's DOMFragmentParser hangs on certain URLs

Posted by "Karsten Dello (JIRA)" <ji...@apache.org>.
    [ http://issues.apache.org/jira/browse/NUTCH-424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12461647 ] 

Karsten Dello commented on NUTCH-424:
-------------------------------------

Sorry for cloning, but I could not reopen the original issue.

The problem persist with current stable nutch version (0.8.1) which uses nekohtml 0.9.4 

I do parsing after fetching ("nutch parse") so I cannot see the url in the log file,
though I have set log level to DEBUG
- is there a way to accomplish this?

Anyway,  it seems to be exactly the same errror,
here comes the output from  jstack and "kill -SIGQUIT <pid>" 

(1) jstack output

Attaching to process ID 27428, please wait...
Debugger attached successfully.
Client compiler detected.
JVM version is 1.5.0_05-b05
Thread 27439: (state = BLOCKED)
 - java.lang.AbstractStringBuilder.expandCapacity(int) @bci=28, line=99 (Compiled frame)
 - java.lang.AbstractStringBuilder.append(java.lang.String) @bci=36, line=393 (Compiled frame)
 - java.lang.StringBuffer.append(java.lang.String) @bci=2, line=225 (Compiled frame)
 - org.apache.xerces.dom.CharacterDataImpl.appendData(java.lang.String) @bci=59 (Compiled frame)
 - org.cyberneko.html.parsers.DOMFragmentParser.characters(org.apache.xerces.xni.XMLString, org.apache.xerces.xni.Augmentations) @bci=117, line=463 (Compiled frame)
 - org.cyberneko.html.filters.DefaultFilter.characters(org.apache.xerces.xni.XMLString, org.apache.xerces.xni.Augmentations) @bci=13, line=195 (Compiled frame)
 - org.cyberneko.html.HTMLTagBalancer.characters(org.apache.xerces.xni.XMLString, org.apache.xerces.xni.Augmentations) @bci=294, line=821 (Compiled frame)
 - org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters() @bci=324, line=1972 (Compiled frame)
 - org.cyberneko.html.HTMLScanner$ContentScanner.scan(boolean) @bci=184, line=1775 (Compiled frame)
 - org.cyberneko.html.HTMLScanner.scanDocument(boolean) @bci=5, line=789 (Compiled frame)
 - org.cyberneko.html.HTMLConfiguration.parse(org.apache.xerces.xni.parser.XMLInputSource) @bci=7, line=431 (Compiled frame)
 - org.cyberneko.html.parsers.DOMFragmentParser.parse(org.xml.sax.InputSource, org.w3c.dom.DocumentFragment) @bci=93, line=164 (Compiled frame)
 - org.apache.nutch.parse.html.HtmlParser.parseNeko(org.xml.sax.InputSource) @bci=76, line=261 (Compiled frame)
 - org.apache.nutch.parse.html.HtmlParser.parse(org.xml.sax.InputSource) @bci=20, line=225 (Compiled frame)
 - org.apache.nutch.parse.ParseUtil.parse(org.apache.nutch.protocol.Content) @bci=145, line=82 (Compiled frame)
 - org.apache.nutch.parse.ParseSegment.map(org.apache.hadoop.io.WritableComparable, org.apache.hadoop.io.Writable, org.apache.hadoop.mapred.OutputCollector, org.apache.hadoop.mapred.Reporter) @bci=22, line=66 (Compiled frame)
 - org.apache.hadoop.mapred.MapRunner.run(org.apache.hadoop.mapred.RecordReader, org.apache.hadoop.mapred.OutputCollector, org.apache.hadoop.mapred.Reporter) @bci=55, line=48 (Compiled frame)
 - org.apache.hadoop.mapred.MapTask.run(org.apache.hadoop.mapred.JobConf, org.apache.hadoop.mapred.TaskUmbilicalProtocol) @bci=198, line=129 (Interpreted frame)
 - org.apache.hadoop.mapred.LocalJobRunner$Job.run() @bci=120, line=91 (Interpreted frame)


Thread 27435: (state = BLOCKED)


Thread 27434: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Interpreted frame)
 - java.lang.ref.ReferenceQueue.remove(long) @bci=44, line=116 (Compiled frame)
 - java.lang.ref.ReferenceQueue.remove() @bci=2, line=132 (Compiled frame)


Thread 27433: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Interpreted frame)
 - java.lang.Object.wait() @bci=2, line=474 (Compiled frame)


Thread 27428: (state = BLOCKED)
 - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame)
 - org.apache.hadoop.mapred.JobClient.runJob(org.apache.hadoop.mapred.JobConf) @bci=67, line=332 (Interpreted frame)
 - org.apache.nutch.parse.ParseSegment.parse(org.apache.hadoop.fs.Path) @bci=303, line=120 (Interpreted frame)
 - org.apache.nutch.parse.ParseSegment.main(java.lang.String[]) @bci=43, line=138 (Interpreted frame)


(2) kill -SIGQUIT 



Full thread dump Java HotSpot(TM) Client VM (1.5.0_05-b05 mixed mode):

"Thread-0" prio=1 tid=0x08518da0 nid=0x6b2f waiting on condition [0xababa000..0xababb680]
        at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:99)
        at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:393)
        at java.lang.StringBuffer.append(StringBuffer.java:225)
        - locked <0x45a086c8> (a java.lang.StringBuffer)
        at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown Source)
        at org.cyberneko.html.parsers.DOMFragmentParser.characters(DOMFragmentParser.java:463)
        at org.cyberneko.html.filters.DefaultFilter.characters(DefaultFilter.java:195)
        at org.cyberneko.html.HTMLTagBalancer.characters(HTMLTagBalancer.java:821)
        at org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters(HTMLScanner.java:1972)
        at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1775)
        at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789)
        at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478)
        at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431)
        at org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser.java:164)
        at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:261)
        at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:225)
        at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:164)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82)
        at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:66)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:129)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:91)

"Low Memory Detector" daemon prio=1 tid=0x080c6d20 nid=0x6b2d runnable [0x00000000..0x00000000]

"CompilerThread0" daemon prio=1 tid=0x080c57d0 nid=0x6b2c waiting on condition [0x00000000..0xa99d41e8]

"Signal Dispatcher" daemon prio=1 tid=0x080c4938 nid=0x6b2b waiting on condition [0x00000000..0x00000000]

"Finalizer" daemon prio=1 tid=0x080b9528 nid=0x6b2a in Object.wait() [0xa9891000..0xa9891500]
        at java.lang.Object.wait(Native Method)
        - waiting on <0x4cc933a8> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
        - locked <0x4cc933a8> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=1 tid=0x080b8860 nid=0x6b29 in Object.wait() [0xa9810000..0xa9810580]
        at java.lang.Object.wait(Native Method)
        - waiting on <0x4cc93428> (a java.lang.ref.Reference$Lock)
        at java.lang.Object.wait(Object.java:474)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
        - locked <0x4cc93428> (a java.lang.ref.Reference$Lock)

"main" prio=1 tid=0x0805cba0 nid=0x6b24 waiting on condition [0xbfffc000..0xbfffcb58]
        at java.lang.Thread.sleep(Native Method)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:332)
        at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:120)
        at org.apache.nutch.parse.ParseSegment.main(ParseSegment.java:138)

"VM Thread" prio=1 tid=0x080b5c40 nid=0x6b28 runnable 

"VM Periodic Task Thread" prio=1 tid=0x080c81b0 nid=0x6b2e waiting on condition 



> CLONE - Problem persists with Nutch 0.8.1 (Nekohtml 0.9.4) - NekoHTML's DOMFragmentParser hangs on certain URLs
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-424
>                 URL: http://issues.apache.org/jira/browse/NUTCH-424
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>         Environment: Linux and Windows
>            Reporter: Karsten Dello
>
> I've tracked down occasional fetcher hangs to NekoHTML's DOMFragmentParser hanging certain HTML documents, for example, http://www.inlandrevenue.gov.uk/charities/chapter_3.htm.
> The thread dump on the hung parser is:
> "CompilerThread0" daemon prio=1 tid=0x080c4c18 nid=0x47da waiting on condition [0x00000000..0x8a3daf68]
> "Signal Dispatcher" daemon prio=1 tid=0x080c3d60 nid=0x47d9 waiting on condition [0x00000000..0x00000000]
> "Finalizer" daemon prio=1 tid=0x080b8818 nid=0x47d8 in Object.wait() [0x8a2a0000..0x8a2a0680]
>         at java.lang.Object.wait(Native Method)
>         - waiting on <0x4a60d058> (a java.lang.ref.ReferenceQueue$Lock)
>         at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>         - locked <0x4a60d058> (a java.lang.ref.ReferenceQueue$Lock)
>         at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>         at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> "Reference Handler" daemon prio=1 tid=0x080b7b50 nid=0x47d7 in Object.wait() [0x8a21f000..0x8a21f800]
>         at java.lang.Object.wait(Native Method)
>         - waiting on <0x4a60d0d8> (a java.lang.ref.Reference$Lock)
>         at java.lang.Object.wait(Object.java:474)
>         at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
>         - locked <0x4a60d0d8> (a java.lang.ref.Reference$Lock)
> "main" prio=1 tid=0x0805c170 nid=0x47d1 waiting on condition [0xbfffc000..0xbfffcec8]
>         at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:99)
>         at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:393)
>         at java.lang.StringBuffer.append(StringBuffer.java:225)
>         - locked <0x45910118> (a java.lang.StringBuffer)
>         at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown Source)
>         at org.cyberneko.html.parsers.DOMFragmentParser.characters(Unknown Source)
>         at org.cyberneko.html.filters.DefaultFilter.characters(Unknown Source)
>         at org.cyberneko.html.HTMLTagBalancer.characters(Unknown Source)
>         at org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters(Unknown Source)
>         at org.cyberneko.html.HTMLScanner$ContentScanner.scan(Unknown Source)
>         at org.cyberneko.html.HTMLScanner.scanDocument(Unknown Source)
>         at org.cyberneko.html.HTMLConfiguration.parse(Unknown Source)
>         at org.cyberneko.html.HTMLConfiguration.parse(Unknown Source)
>         at org.cyberneko.html.parsers.DOMFragmentParser.parse(Unknown Source)
>         at net.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:157)
>         at net.nutch.parse.ParserChecker.main(ParserChecker.java:74)
> "VM Thread" prio=1 tid=0x080b4f30 nid=0x47d6 runnable
> "VM Periodic Task Thread" prio=1 tid=0x080c75f8 nid=0x47dc waiting on condition
> Using the URL mentioned above, I was able to successfully parse the file using a normal NekoHTML DocumentParser.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (NUTCH-424) NekoHTML's DOMFragmentParser hangs on certain URLs (CLONE: Problem persists with Nutch 0.9 and 0.8.1 (Nekohtml 0.9.4))

Posted by "Mike Brzozowski (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494782 ] 

Mike Brzozowski commented on NUTCH-424:
---------------------------------------

It looks like if you have multiple threads they block on each other. For instance, here's one such fetcher thread:

Name: FetcherThread
State: BLOCKED on java.lang.ref.Reference$Lock@bafb71 owned by: FetcherThread
Total blocked: 48,597  Total waited: 1,444

Stack trace: 
java.util.Arrays.copyOf(Arrays.java:2882)
java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
java.lang.StringBuffer.append(StringBuffer.java:224)
   - locked java.lang.StringBuffer@1272f63
org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown Source)
org.cyberneko.html.parsers.DOMFragmentParser.characters(DOMFragmentParser.java:463)

Right now I have 30 threads that are all inside appendData(). Is this function thread-safe?

It looks like the released build was not changed to use DOMParser after all... 

> NekoHTML's DOMFragmentParser hangs on certain URLs (CLONE: Problem persists with Nutch 0.9 and 0.8.1 (Nekohtml 0.9.4))
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-424
>                 URL: https://issues.apache.org/jira/browse/NUTCH-424
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8.1, 0.9.0
>         Environment: Linux and Windows
>            Reporter: Karsten Dello
>
> I've tracked down occasional fetcher hangs to NekoHTML's DOMFragmentParser hanging certain HTML documents, for example, http://www.inlandrevenue.gov.uk/charities/chapter_3.htm.
> The thread dump on the hung parser is:
> "CompilerThread0" daemon prio=1 tid=0x080c4c18 nid=0x47da waiting on condition [0x00000000..0x8a3daf68]
> "Signal Dispatcher" daemon prio=1 tid=0x080c3d60 nid=0x47d9 waiting on condition [0x00000000..0x00000000]
> "Finalizer" daemon prio=1 tid=0x080b8818 nid=0x47d8 in Object.wait() [0x8a2a0000..0x8a2a0680]
>         at java.lang.Object.wait(Native Method)
>         - waiting on <0x4a60d058> (a java.lang.ref.ReferenceQueue$Lock)
>         at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>         - locked <0x4a60d058> (a java.lang.ref.ReferenceQueue$Lock)
>         at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>         at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> "Reference Handler" daemon prio=1 tid=0x080b7b50 nid=0x47d7 in Object.wait() [0x8a21f000..0x8a21f800]
>         at java.lang.Object.wait(Native Method)
>         - waiting on <0x4a60d0d8> (a java.lang.ref.Reference$Lock)
>         at java.lang.Object.wait(Object.java:474)
>         at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
>         - locked <0x4a60d0d8> (a java.lang.ref.Reference$Lock)
> "main" prio=1 tid=0x0805c170 nid=0x47d1 waiting on condition [0xbfffc000..0xbfffcec8]
>         at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:99)
>         at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:393)
>         at java.lang.StringBuffer.append(StringBuffer.java:225)
>         - locked <0x45910118> (a java.lang.StringBuffer)
>         at org.apache.xerces.dom.CharacterDataImpl.appendData(Unknown Source)
>         at org.cyberneko.html.parsers.DOMFragmentParser.characters(Unknown Source)
>         at org.cyberneko.html.filters.DefaultFilter.characters(Unknown Source)
>         at org.cyberneko.html.HTMLTagBalancer.characters(Unknown Source)
>         at org.cyberneko.html.HTMLScanner$ContentScanner.scanCharacters(Unknown Source)
>         at org.cyberneko.html.HTMLScanner$ContentScanner.scan(Unknown Source)
>         at org.cyberneko.html.HTMLScanner.scanDocument(Unknown Source)
>         at org.cyberneko.html.HTMLConfiguration.parse(Unknown Source)
>         at org.cyberneko.html.HTMLConfiguration.parse(Unknown Source)
>         at org.cyberneko.html.parsers.DOMFragmentParser.parse(Unknown Source)
>         at net.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:157)
>         at net.nutch.parse.ParserChecker.main(ParserChecker.java:74)
> "VM Thread" prio=1 tid=0x080b4f30 nid=0x47d6 runnable
> "VM Periodic Task Thread" prio=1 tid=0x080c75f8 nid=0x47dc waiting on condition
> Using the URL mentioned above, I was able to successfully parse the file using a normal NekoHTML DocumentParser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.