You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Matthias Paul <ma...@gmail.com> on 2010/09/27 18:51:54 UTC

parse-tika config

Hi,

I'm using parse-tika, but how can I decide which mime-types to parse and
which not? e.g. if I'm only interested in pdfs and not doc's?
Do I have to work on the tika-mimetypes.xml file or is there some other way
to configure this?

Second question: if I don't use parse-tika, I get a lot of parse-exceptions
for jpgs etc as Nutch doesn't know what to do with this content.
So it's normal to see this exceptions? I find it a bit strange to see all
this errors...

Thanks
Matthias