You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2013/04/12 23:58:16 UTC

[jira] [Commented] (NUTCH-1557) File extraction and classification for any MIME types from segments

    [ https://issues.apache.org/jira/browse/NUTCH-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13630674#comment-13630674 ] 

Lewis John McGibbney commented on NUTCH-1557:
---------------------------------------------

Hi Chao,
Do you have any patch proposal for this?
What is your requirement behind this issue?
                
> File extraction and classification for any MIME types from segments
> -------------------------------------------------------------------
>
>                 Key: NUTCH-1557
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1557
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.6
>         Environment: Hardware: Intel core i5 2.5 GHz, 4GB memory. Software: Linux (Ubuntu 12.04) 64-bit, with JVM 1.7.0_15
>            Reporter: Chao Yan
>            Priority: Minor
>
> Basic idea is to implement a file dumper as a plugin to extra files from Nutch SequenceFiles. The file dumper should detect the content type and dump them into different directories based on content type. The extracted file will be renamed based on information from URL, metadata, and even content. File name should be globally unique with the correct file extension. The file dumper should also allow user to specify the formats of the files they want, and can be extended to specify any criteria on the extracted files. A more advanced goal is to implement it with MapReduce.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira