You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2013/01/12 20:00:12 UTC

[jira] [Updated] (NUTCH-1281) tika parser not work properly with unwanted file types that passed from filters in nutch

     [ https://issues.apache.org/jira/browse/NUTCH-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney updated NUTCH-1281:
----------------------------------------

    Fix Version/s: 2.2
                   1.7
    
> tika parser not work properly with unwanted file types that passed from filters in nutch
> ----------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1281
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1281
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: behnam nikbakht
>             Fix For: 1.7, 2.2
>
>
> when in parse-plugins.xml, set this property:
> <mimeType name="*">
>         <plugin id="parse-tika" />
> </mimeType>
> all unwanted files that pass from all filters, refered to tika
> but for some file types like .flv, tika parser has problem and hunged and cause to fail in parse Job.
> if this file types passed from regex-urlfilter and other filters, parse job failed.
> for this problem I suggest that add some properties for valid file types, and use this code in TikaParser.java, like this:
> public ParseResult getParse(Content content) {
> 		String mimeType = content.getContentType();
> +		String[]validTypes=new String[]{"application/pdf","application/x-tika-msoffice","application/x-tika- ooxml","application/vnd.oasis.opendocument.text","text/plain","application/rtf","application/rss+xml","application/x-bzip2","application/x-gzip","application/x-javascript","application/javascript","text/javascript","application/x-shockwave-flash","application/zip","text/xml","application/xml"};
> +		boolean valid=false;
> +		for(int k=0;k<validTypes.length;k++){
> +			if(validTypes[k].compareTo(mimeType.toLowerCase())==0)
> +				valid=true;
> +		}
> +		if(!valid)
> +	                return new ParseStatus(ParseStatus.NOTPARSED, "Can't parse for unwanted filetype "+ mimeType).getEmptyParseResult(content.getUrl(), getConf());
> 	
> 		URL base;

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira