You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Joseph Naegele <jn...@grierforensics.com> on 2016/04/26 15:41:27 UTC

Can't disable fallback parser

Hi folks,

 

I'm using Nutch 1.11. What I'd like to do is use parse-tika for HTML and
maybe a select few other content types, but nothing else. This doesn't
appear to be possible without making changes in places beyond
parse-plugins.xml.

 

Implementation details: In ParserFactory, if no parser is found for the
given contentType and parse-tika *is* being used, it is automatically used
as a fallback, since parse-tika's plugin.xml file says it works with all
contentTypes.

 

This seems like a bit underhanded, since in parse-plugins.xml I'm explicitly
disabling the glob -> parse-tika mapping. I haven't tested but I imagine I
can work around this by just changing parse-tika's plugin.xml to map to a
subset of contentTypes, rather than '*'.

 

Is this a bug or just something that should be documented?

 

Thanks,

Joe