You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Ferdy Galema (JIRA)" <ji...@apache.org> on 2012/06/12 13:12:43 UTC

[jira] [Created] (NUTCH-1387) All parsers should respond to cancellation.

Ferdy Galema created NUTCH-1387:
-----------------------------------

             Summary: All parsers should respond to cancellation.
                 Key: NUTCH-1387
                 URL: https://issues.apache.org/jira/browse/NUTCH-1387
             Project: Nutch
          Issue Type: Bug
            Reporter: Ferdy Galema


During parsing a TimeoutException can occur. This is caused whenever the FutureTask.get() cannot be completed within the specified timeout. The tricky part is that single urls might be perfectly able to complete within the timeout, but when there is a heavy concurrent load (a lot of semi-expensive parses) the parser load might stack up and cause many parses to timeout. This can be the case with parsing during fetch. But when using a separate parserjob this can also happen because Parser implementation do not necessarily have to respond to a thread interrupt. (Which is fired away with the task.cancel(true) call). If a parser does not check the Thread.interrupted state at regular intervals, it will just continue to run and eat up resources. I find it very helpful to debug stalling fetchers/parsers with the lazy men's profiler: kill -QUIT <process_id>. This will dump stacktraces, sometimes exposing the fact that hundreds of parser threads are still active in the background. (Of course many of them already timed out a long time ago).

To fix this, every parser should check it's interrupted state at regular intervals. (For example an html parse might be stuck walking the DOM tree, so checking after every Nth element would be an appropiate moment.)

This issue is for reference first. Fixing it all at once would be a huge task.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (NUTCH-1387) All parsers should respond to cancellation / interrupts.

Posted by "Ferdy Galema (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ferdy Galema updated NUTCH-1387:
--------------------------------

    Component/s: parser
        Summary: All parsers should respond to cancellation / interrupts.  (was: All parsers should respond to cancellation.)
    
> All parsers should respond to cancellation / interrupts.
> --------------------------------------------------------
>
>                 Key: NUTCH-1387
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1387
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>            Reporter: Ferdy Galema
>
> During parsing a TimeoutException can occur. This is caused whenever the FutureTask.get() cannot be completed within the specified timeout. The tricky part is that single urls might be perfectly able to complete within the timeout, but when there is a heavy concurrent load (a lot of semi-expensive parses) the parser load might stack up and cause many parses to timeout. This can be the case with parsing during fetch. But when using a separate parserjob this can also happen because Parser implementation do not necessarily have to respond to a thread interrupt. (Which is fired away with the task.cancel(true) call). If a parser does not check the Thread.interrupted state at regular intervals, it will just continue to run and eat up resources. I find it very helpful to debug stalling fetchers/parsers with the lazy men's profiler: kill -QUIT <process_id>. This will dump stacktraces, sometimes exposing the fact that hundreds of parser threads are still active in the background. (Of course many of them already timed out a long time ago).
> To fix this, every parser should check it's interrupted state at regular intervals. (For example an html parse might be stuck walking the DOM tree, so checking after every Nth element would be an appropiate moment.)
> This issue is for reference first. Fixing it all at once would be a huge task.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira