You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org> on 2012/02/19 12:38:34 UTC

[jira] [Commented] (NUTCH-1281) tika parser not work properly with unwanted file types that passed from filters in nutch

    [ https://issues.apache.org/jira/browse/NUTCH-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211314#comment-13211314 ] 

Lewis John McGibbney commented on NUTCH-1281:
---------------------------------------------

Hi behnam, there is a similar issue open and a patch has been submitted for Nutchgora. I wonder if you can check it out and comment on the link between these two. NUTCH-965

Also would it be possible for you to attach your code changes as a patch against trunk? Which I guess is what you are using. Thank you
                
> tika parser not work properly with unwanted file types that passed from filters in nutch
> ----------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1281
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1281
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: behnam nikbakht
>
> when in parse-plugins.xml, set this property:
> <mimeType name="*">
>         <plugin id="parse-tika" />
> </mimeType>
> all unwanted files that pass from all filters, refered to tika
> but for some file types like .flv, tika parser has problem and hunged and cause to fail in parse Job.
> if this file types passed from regex-urlfilter and other filters, parse job failed.
> for this problem I suggest that add some properties for valid file types, and use this code in TikaParser.java, like this:
> public ParseResult getParse(Content content) {
> 		String mimeType = content.getContentType();
> +		String[]validTypes=new String[]{"application/pdf","application/x-tika-msoffice","application/x-tika- ooxml","application/vnd.oasis.opendocument.text","text/plain","application/rtf","application/rss+xml","application/x-bzip2","application/x-gzip","application/x-javascript","application/javascript","text/javascript","application/x-shockwave-flash","application/zip","text/xml","application/xml"};
> +		boolean valid=false;
> +		for(int k=0;k<validTypes.length;k++){
> +			if(validTypes[k].compareTo(mimeType.toLowerCase())==0)
> +				valid=true;
> +		}
> +		if(!valid)
> +	                return new ParseStatus(ParseStatus.NOTPARSED, "Can't parse for unwanted filetype "+ mimeType).getEmptyParseResult(content.getUrl(), getConf());
> 	
> 		URL base;

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira