You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2012/01/04 16:54:25 UTC
Re: nutch parse Tika problem
On Thursday 22 December 2011 02:06:14 Xiao Li wrote:
> Hi
>
> I am debuging Nutch in Eclipse on Ubuntu platform. I can run the crawler
> program smoothly. However, when it tries to parse a PDF file, I just get
> the error msg "failed(2,0): Can't retrieve Tika parser for mime-type
> application/pdf".
Check the log, there should be more there. However, by default tika is mapped
to * in parse-plugins and it's own plugin.xml, it should work without a
problem. Or are you using an ancient version?
>
> I try to debug deeply into Tika and find that in TikaConfig class,
>
> public TikaConfig() throws MimeTypeException, IOException {
> ParseContext context = new ParseContext();
> Iterator<Parser> iterator = ServiceRegistry.lookupProviders(
> Parser.class, this.getClass().getClassLoader());
> while (iterator.hasNext()) {
> Parser parser = iterator.next();
> for (MediaType type : parser.getSupportedTypes(context)) {
> parsers.put(type.toString(), parser);
> }
> }
> mimeTypes = MimeTypesFactory.create("tika-mimetypes.xml");
> }
>
> the while loop does not do anything. It does not put a <application/pdf,
> class> entry in its Map. That's why it can not retrieve a parse for mime
> application/pdf. I strongly suspect that there is no parser class
> registered in ServiceRegistry. However, even when I write the property in
> nutch-site.xml and parse-plugin.xml. The problem is still.
>
> Can anybody help me?
>
> cheers
> Xiao
--
Markus Jelsma - CTO - Openindex