You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lewis John McGibbney (Updated) (JIRA)" <ji...@apache.org> on 2012/01/11 20:59:39 UTC

[jira] [Updated] (NUTCH-965) Skip parsing for truncated documents

     [ https://issues.apache.org/jira/browse/NUTCH-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney updated NUTCH-965:
---------------------------------------

    Attachment: NUTCH-965-v2.patch

Hi Guys,

I would ask you's to comment as this patch is not finished yet. Although I've made the functionality a boolean configurable, I've also intentionally neglected to address the second of your points Julien, regarding FetcherJob.java.

I see that the boolean parsing value is set in this class [1], but would like you to confirm if the code I'm writing should live under the public Collection object on line 138.

Once this is addressed it would be great to get a patch for trunk.

Thanks for anyone that can comment on this. 

[1] http://svn.apache.org/viewvc/nutch/branches/nutchgora/src/java/org/apache/nutch/fetcher/FetcherJob.java?view=markup
                
> Skip parsing for truncated documents
> ------------------------------------
>
>                 Key: NUTCH-965
>                 URL: https://issues.apache.org/jira/browse/NUTCH-965
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Alexis
>            Assignee: Lewis John McGibbney
>             Fix For: nutchgora, 1.5
>
>         Attachments: NUTCH-965-v2.patch, parserJob.patch
>
>
> The issue you're likely to run into when parsing truncated FLV files is described here:
> http://www.mail-archive.com/user@nutch.apache.org/msg01880.html
> The parser library gets stuck in infinite loop as it encounters corrupted data due to for example truncating big binary files at fetch time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira