You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "behnam nikbakht (Created) (JIRA)" <ji...@apache.org> on 2012/02/19 06:43:59 UTC

[jira] [Created] (NUTCH-1281) tika parser not work properly with unwanted file types that passed from filters in nutch

tika parser not work properly with unwanted file types that passed from filters in nutch
----------------------------------------------------------------------------------------

                 Key: NUTCH-1281
                 URL: https://issues.apache.org/jira/browse/NUTCH-1281
             Project: Nutch
          Issue Type: Improvement
          Components: parser
            Reporter: behnam nikbakht


when in parse-plugins.xml, set this property:
<mimeType name="*">
        <plugin id="parse-tika" />
</mimeType>
all unwanted files that pass from all filters, refered to tika
but for some file types like .flv, tika parser has problem and hunged and cause to fail in parse Job.
if this file types passed from regex-urlfilter and other filters, parse job failed.
for this problem I suggest that add some properties for valid file types, and use this code in TikaParser.java, like this:


public ParseResult getParse(Content content) {
		String mimeType = content.getContentType();

+		String[]validTypes=new String[]{"application/pdf","application/x-tika-msoffice","application/x-tika- ooxml","application/vnd.oasis.opendocument.text","text/plain","application/rtf","application/rss+xml","application/x-bzip2","application/x-gzip","application/x-javascript","application/javascript","text/javascript","application/x-shockwave-flash","application/zip","text/xml","application/xml"};
+		boolean valid=false;
+		for(int k=0;k<validTypes.length;k++){
+			if(validTypes[k].compareTo(mimeType.toLowerCase())==0)
+				valid=true;
+		}
+		if(!valid)
+	                return new ParseStatus(ParseStatus.NOTPARSED, "Can't parse for unwanted filetype "+ mimeType).getEmptyParseResult(content.getUrl(), getConf());
	
		URL base;

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1281) tika parser not work properly with unwanted file types that passed from filters in nutch

Posted by "behnam nikbakht (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211316#comment-13211316 ] 

behnam nikbakht commented on NUTCH-1281:
----------------------------------------

Problem is that actual mime-types can not properly filtered until the parse or fetch start. and here are many file types that we can not filter all of them, and maybe there are some bugs with tika parser with some file types.
so we can filter them in TikaParser from valid file types.
                
> tika parser not work properly with unwanted file types that passed from filters in nutch
> ----------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1281
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1281
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: behnam nikbakht
>
> when in parse-plugins.xml, set this property:
> <mimeType name="*">
>         <plugin id="parse-tika" />
> </mimeType>
> all unwanted files that pass from all filters, refered to tika
> but for some file types like .flv, tika parser has problem and hunged and cause to fail in parse Job.
> if this file types passed from regex-urlfilter and other filters, parse job failed.
> for this problem I suggest that add some properties for valid file types, and use this code in TikaParser.java, like this:
> public ParseResult getParse(Content content) {
> 		String mimeType = content.getContentType();
> +		String[]validTypes=new String[]{"application/pdf","application/x-tika-msoffice","application/x-tika- ooxml","application/vnd.oasis.opendocument.text","text/plain","application/rtf","application/rss+xml","application/x-bzip2","application/x-gzip","application/x-javascript","application/javascript","text/javascript","application/x-shockwave-flash","application/zip","text/xml","application/xml"};
> +		boolean valid=false;
> +		for(int k=0;k<validTypes.length;k++){
> +			if(validTypes[k].compareTo(mimeType.toLowerCase())==0)
> +				valid=true;
> +		}
> +		if(!valid)
> +	                return new ParseStatus(ParseStatus.NOTPARSED, "Can't parse for unwanted filetype "+ mimeType).getEmptyParseResult(content.getUrl(), getConf());
> 	
> 		URL base;

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1281) tika parser not work properly with unwanted file types that passed from filters in nutch

Posted by "Lewis John McGibbney (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211314#comment-13211314 ] 

Lewis John McGibbney commented on NUTCH-1281:
---------------------------------------------

Hi behnam, there is a similar issue open and a patch has been submitted for Nutchgora. I wonder if you can check it out and comment on the link between these two. NUTCH-965

Also would it be possible for you to attach your code changes as a patch against trunk? Which I guess is what you are using. Thank you
                
> tika parser not work properly with unwanted file types that passed from filters in nutch
> ----------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1281
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1281
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: behnam nikbakht
>
> when in parse-plugins.xml, set this property:
> <mimeType name="*">
>         <plugin id="parse-tika" />
> </mimeType>
> all unwanted files that pass from all filters, refered to tika
> but for some file types like .flv, tika parser has problem and hunged and cause to fail in parse Job.
> if this file types passed from regex-urlfilter and other filters, parse job failed.
> for this problem I suggest that add some properties for valid file types, and use this code in TikaParser.java, like this:
> public ParseResult getParse(Content content) {
> 		String mimeType = content.getContentType();
> +		String[]validTypes=new String[]{"application/pdf","application/x-tika-msoffice","application/x-tika- ooxml","application/vnd.oasis.opendocument.text","text/plain","application/rtf","application/rss+xml","application/x-bzip2","application/x-gzip","application/x-javascript","application/javascript","text/javascript","application/x-shockwave-flash","application/zip","text/xml","application/xml"};
> +		boolean valid=false;
> +		for(int k=0;k<validTypes.length;k++){
> +			if(validTypes[k].compareTo(mimeType.toLowerCase())==0)
> +				valid=true;
> +		}
> +		if(!valid)
> +	                return new ParseStatus(ParseStatus.NOTPARSED, "Can't parse for unwanted filetype "+ mimeType).getEmptyParseResult(content.getUrl(), getConf());
> 	
> 		URL base;

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1281) tika parser not work properly with unwanted file types that passed from filters in nutch

Posted by "Julien Nioche (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212502#comment-13212502 ] 

Julien Nioche commented on NUTCH-1281:
--------------------------------------

Behnam,

I suppose that you are seeing this issue when using the Crawl class but not when using a script. The reason for this is that the timeout mechanism prevents the parser to get locked with files which have been truncated or put the underlying parser library in a spin. When using the Crawl class, these runaway threads are not cleared,  accumulate and take all the memory left. The Crawl class is planned to be replaced by a shell script which will remove this issue and allow people to modify the process easily (+ make the pipeline easier to understand)

Or are you seeing this when using the Parse command in a script? Again, the timeout mechanism should prevent the parser to crash.

Now if the issue is to prevent the Tika plugin to process certain types, a better approach would be to filter the docs prior to parsing based on their mime-types which we now can access from the crawldb metadata. The trouble is that the URLFilters consider only the string of a URL and not any metadata. We could change the API of URLFilters? What other metadata would we take into account for filtering?

Another approach would be to filter based on the content type in ParseUtil - so that it is used not only for Tika but for any other parser and have a blacklist of mimetypes that would not be parsed. 

Any thoughts?




                
> tika parser not work properly with unwanted file types that passed from filters in nutch
> ----------------------------------------------------------------------------------------
>
>                 Key: NUTCH-1281
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1281
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>            Reporter: behnam nikbakht
>
> when in parse-plugins.xml, set this property:
> <mimeType name="*">
>         <plugin id="parse-tika" />
> </mimeType>
> all unwanted files that pass from all filters, refered to tika
> but for some file types like .flv, tika parser has problem and hunged and cause to fail in parse Job.
> if this file types passed from regex-urlfilter and other filters, parse job failed.
> for this problem I suggest that add some properties for valid file types, and use this code in TikaParser.java, like this:
> public ParseResult getParse(Content content) {
> 		String mimeType = content.getContentType();
> +		String[]validTypes=new String[]{"application/pdf","application/x-tika-msoffice","application/x-tika- ooxml","application/vnd.oasis.opendocument.text","text/plain","application/rtf","application/rss+xml","application/x-bzip2","application/x-gzip","application/x-javascript","application/javascript","text/javascript","application/x-shockwave-flash","application/zip","text/xml","application/xml"};
> +		boolean valid=false;
> +		for(int k=0;k<validTypes.length;k++){
> +			if(validTypes[k].compareTo(mimeType.toLowerCase())==0)
> +				valid=true;
> +		}
> +		if(!valid)
> +	                return new ParseStatus(ParseStatus.NOTPARSED, "Can't parse for unwanted filetype "+ mimeType).getEmptyParseResult(content.getUrl(), getConf());
> 	
> 		URL base;

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira