You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2012/01/04 16:54:25 UTC

Re: nutch parse Tika problem


On Thursday 22 December 2011 02:06:14 Xiao Li wrote:
> Hi
> 
> I am debuging Nutch in Eclipse on Ubuntu platform. I can run the crawler
> program smoothly. However, when it tries to parse a PDF file, I just get
> the error msg  "failed(2,0): Can't retrieve Tika parser for mime-type
> application/pdf".

Check the log, there should be more there. However, by default tika is mapped 
to * in parse-plugins and it's own plugin.xml, it should work without a 
problem. Or are you using an ancient version?

> 
> I try to debug deeply into Tika and find that in TikaConfig class,
> 
> public TikaConfig() throws MimeTypeException, IOException {
>     ParseContext context = new ParseContext();
>     Iterator<Parser> iterator = ServiceRegistry.lookupProviders(
>         Parser.class, this.getClass().getClassLoader());
>     while (iterator.hasNext()) {
>         Parser parser = iterator.next();
>         for (MediaType type : parser.getSupportedTypes(context)) {
>         parsers.put(type.toString(), parser);
>         }
>     }
>     mimeTypes = MimeTypesFactory.create("tika-mimetypes.xml");
>     }
> 
> the while loop does not do anything. It does not put a <application/pdf,
> class> entry in its Map. That's why it can not retrieve a parse for mime
> application/pdf. I strongly suspect that there is no parser class
> registered in ServiceRegistry. However, even when I write the property in
> nutch-site.xml and parse-plugin.xml. The problem is still.
> 
> Can anybody help me?
> 
> cheers
> Xiao

-- 
Markus Jelsma - CTO - Openindex