You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2013/01/12 20:00:12 UTC
[jira] [Updated] (NUTCH-1281) tika parser not work properly with
unwanted file types that passed from filters in nutch
[ https://issues.apache.org/jira/browse/NUTCH-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lewis John McGibbney updated NUTCH-1281:
----------------------------------------
Fix Version/s: 2.2
1.7
> tika parser not work properly with unwanted file types that passed from filters in nutch
> ----------------------------------------------------------------------------------------
>
> Key: NUTCH-1281
> URL: https://issues.apache.org/jira/browse/NUTCH-1281
> Project: Nutch
> Issue Type: Improvement
> Components: parser
> Reporter: behnam nikbakht
> Fix For: 1.7, 2.2
>
>
> when in parse-plugins.xml, set this property:
> <mimeType name="*">
> <plugin id="parse-tika" />
> </mimeType>
> all unwanted files that pass from all filters, refered to tika
> but for some file types like .flv, tika parser has problem and hunged and cause to fail in parse Job.
> if this file types passed from regex-urlfilter and other filters, parse job failed.
> for this problem I suggest that add some properties for valid file types, and use this code in TikaParser.java, like this:
> public ParseResult getParse(Content content) {
> String mimeType = content.getContentType();
> + String[]validTypes=new String[]{"application/pdf","application/x-tika-msoffice","application/x-tika- ooxml","application/vnd.oasis.opendocument.text","text/plain","application/rtf","application/rss+xml","application/x-bzip2","application/x-gzip","application/x-javascript","application/javascript","text/javascript","application/x-shockwave-flash","application/zip","text/xml","application/xml"};
> + boolean valid=false;
> + for(int k=0;k<validTypes.length;k++){
> + if(validTypes[k].compareTo(mimeType.toLowerCase())==0)
> + valid=true;
> + }
> + if(!valid)
> + return new ParseStatus(ParseStatus.NOTPARSED, "Can't parse for unwanted filetype "+ mimeType).getEmptyParseResult(content.getUrl(), getConf());
>
> URL base;
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira