You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Matthias Paul <ma...@gmail.com> on 2012/05/18 14:56:30 UTC

Exclude certain mime-types

How can I exlude certain mime-types from crawling, for example Word-documents?
If I have parse-tika in plugin.includes it will parse them. Do I have
to change parse-plugins.xml?

I can't exclude them in regex-urlfilter as the .doc extension is not
present in the urls.

Thanks
Matthias

RE: Exclude certain mime-types

Posted by Markus Jelsma <ma...@openindex.io>.

 
 
-----Original message-----
> From:Matthias Paul <ma...@gmail.com>
> Sent: Fri 18-May-2012 14:57
> To: user@nutch.apache.org
> Subject: Exclude certain mime-types
> 
> How can I exlude certain mime-types from crawling, for example Word-documents?
> If I have parse-tika in plugin.includes it will parse them. Do I have
> to change parse-plugins.xml?

You have to get rid of the wildcard MIME-type that is mapped to Tika and manually map the desired MIMEs to the appropriate parser, which is usually Tika.

Keep in mind that in here you have to map both text/html and application/xhtml+xml if you need to parse HTML.

> 
> I can't exclude them in regex-urlfilter as the .doc extension is not
> present in the urls.
> 
> Thanks
> Matthias
>