You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Chao Yan (JIRA)" <ji...@apache.org> on 2013/04/13 03:23:13 UTC

[jira] [Comment Edited] (NUTCH-1557) File extraction and classification for any MIME types from segments

    [ https://issues.apache.org/jira/browse/NUTCH-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13630852#comment-13630852 ] 

Chao Yan edited comment on NUTCH-1557 at 4/13/13 1:22 AM:
----------------------------------------------------------

Hi Lewis,
I am still trying to build a usable patch. The segment dumper will serve as a plugin for Nutch to dump files from SequenceFiles, but I am still not clear that which extension-point it should be mount to.
The dumper requires a mimes.type file which contains the mapping from mime types to file extensions and it also requires a third party library.
                
      was (Author: aceyan):
    Hi Lewis,
I am still trying to build a usable patch. The segment dumper will serve as a plugin for Nutch to dump files from SequenceFiles, but I am still not clear that which extension-point it should be mount to.
The dumper requires a mimes.type file which contains the mapping from mime types to file extensions and a third party library.
                  
> File extraction and classification for any MIME types from segments
> -------------------------------------------------------------------
>
>                 Key: NUTCH-1557
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1557
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.6
>         Environment: Hardware: Intel core i5 2.5 GHz, 4GB memory. Software: Linux (Ubuntu 12.04) 64-bit, with JVM 1.7.0_15
>            Reporter: Chao Yan
>            Priority: Minor
>         Attachments: FileDumper.java, readme.txt
>
>
> Basic idea is to implement a file dumper as a plugin to extra files from Nutch SequenceFiles. The file dumper should detect the content type and dump them into different directories based on content type. The extracted file will be renamed based on information from URL, metadata, and even content. File name should be globally unique with the correct file extension. The file dumper should also allow user to specify the formats of the files they want, and can be extended to specify any criteria on the extracted files. A more advanced goal is to implement it with MapReduce.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira