You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Alexis (JIRA)" <ji...@apache.org> on 2011/02/08 18:47:02 UTC

[jira] Updated: (NUTCH-965) Parsing takes up 100% CPU

     [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexis updated NUTCH-965:
-------------------------

    Attachment: parserJob.patch

In the parser mapper, compare Content-Length header to the size of the content buffer to see if they match.

If this HTTP header is available and in the case that the file was truncated, skip the parsing step to avoid that the parser gets stuck in infinite loop taking up all the CPU resources.


Before, in the logs, we would see:

{noformat}2011-02-07 14:03:34,693 WARN  parse.ParseUtil - TIMEOUT parsing http://downtownjoes.com/botb1.flv with org.apache.nutch.parse.tika.TikaParser@8c0162
2011-02-07 14:03:34,693 WARN  parse.ParseUtil - Unable to successfully parse content http://downtownjoes.com/botb1.flv of type video/x-flv
2011-02-07 14:04:04,725 WARN  parse.ParseUtil - TIMEOUT parsing http://downtownjoes.com/dtj.flv with org.apache.nutch.parse.tika.TikaParser@8c0162
2011-02-07 14:04:04,725 WARN  parse.ParseUtil - Unable to successfully parse content http://downtownjoes.com/dtj.flv of type video/x-flv
2011-02-07 14:04:34,772 WARN  parse.ParseUtil - TIMEOUT parsing http://downtownjoes.com/botb2.flv with org.apache.nutch.parse.tika.TikaParser@8c0162
2011-02-07 14:04:34,772 WARN  parse.ParseUtil - Unable to successfully parse content http://downtownjoes.com/botb2.flv of type video/x-flv
{noformat} 

After:

{noformat}2011-02-08 09:06:54,482 INFO  parse.ParserJob - http://downtownjoes.com/botb1.flv skipped. Content of size 4527822 was truncated to 63980
2011-02-08 09:06:54,482 INFO  parse.ParserJob - http://downtownjoes.com/dtj.flv skipped. Content of size 2692082 was truncated to 63980
2011-02-08 09:06:54,482 INFO  parse.ParserJob - http://downtownjoes.com/botb2.flv skipped. Content of size 35496213 was truncated to 61058
{noformat} 




> Parsing takes up 100% CPU
> -------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>         Attachments: parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira