You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by AJ Chen <aj...@web2express.org> on 2010/08/05 22:33:58 UTC

tika error

I'm running nutch1.1 in distributed mode. The slave and master have the same
configuration related to parsing:
nutch-site.xml:  parse-(html|text|js|zip|tika)
parse-plugins.xml: enable nutch parser for html, text, js, zip

This settings will use nutch parsers for html, text, js, zip, but tika for
pdf and everything else.

When doing ParseSegment step, log messages on master look normal. But, the
log file on the slave machine is full of  the following errors for all
mime-type:
2010-08-05 09:22:31,916 ERROR tika.TikaParser - Can't retrieve Tika parser
for mime-type text/html
2010-08-05 09:22:32,048 ERROR tika.TikaParser - Can't retrieve Tika parser
for mime-type application/pdf
...same for all other mime-type

any idea why slave machine has tika error for all mime-types?

thanks,
-aj

Re: tika error

Posted by AJ Chen <aj...@web2express.org>.
Tika parsing works on the master node. It's the slave node throwing out
"Can't retrieve Tika parser" error on every document even though the slave
node has all the needed config  (same as the master node). There must be
something strange on the slave node.
For example, my config uses nutch html parser (not tika) for html doc.  But,
the parsing on slave node throws "Can't retrieve Tika parser
 for mime-type text/html". It should not try to use tika for html doc.

Anybody saw this problem before?  I download the nutch source and recompile
nutch-1.1.job.
thanks,
-aj

On Fri, Aug 6, 2010 at 10:44 AM, Scott Gonyea <me...@sgonyea.com> wrote:

> Are your ERROR messages confined to the ParseSegment stuff?  You have a
> tika-mimetypes.xml, right?  And your nutch-default has it included /
> nutch-site doesn't override it?
>
> Did you download the nutch 1.1 build or did you clone it off of svn?
>
> sg
>
> On Thu, Aug 5, 2010 at 1:33 PM, AJ Chen <aj...@web2express.org> wrote:
>
> > I'm running nutch1.1 in distributed mode. The slave and master have the
> > same
> > configuration related to parsing:
> > nutch-site.xml:  parse-(html|text|js|zip|tika)
> > parse-plugins.xml: enable nutch parser for html, text, js, zip
> >
> > This settings will use nutch parsers for html, text, js, zip, but tika
> for
> > pdf and everything else.
> >
> > When doing ParseSegment step, log messages on master look normal. But,
> the
> > log file on the slave machine is full of  the following errors for all
> > mime-type:
> > 2010-08-05 09:22:31,916 ERROR tika.TikaParser - Can't retrieve Tika
> parser
> > for mime-type text/html
> > 2010-08-05 09:22:32,048 ERROR tika.TikaParser - Can't retrieve Tika
> parser
> > for mime-type application/pdf
> > ...same for all other mime-type
> >
> > any idea why slave machine has tika error for all mime-types?
> >
> > thanks,
> > -aj
> >
>



-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Re: tika error

Posted by Scott Gonyea <me...@sgonyea.com>.
Are your ERROR messages confined to the ParseSegment stuff?  You have a
tika-mimetypes.xml, right?  And your nutch-default has it included /
nutch-site doesn't override it?

Did you download the nutch 1.1 build or did you clone it off of svn?

sg

On Thu, Aug 5, 2010 at 1:33 PM, AJ Chen <aj...@web2express.org> wrote:

> I'm running nutch1.1 in distributed mode. The slave and master have the
> same
> configuration related to parsing:
> nutch-site.xml:  parse-(html|text|js|zip|tika)
> parse-plugins.xml: enable nutch parser for html, text, js, zip
>
> This settings will use nutch parsers for html, text, js, zip, but tika for
> pdf and everything else.
>
> When doing ParseSegment step, log messages on master look normal. But, the
> log file on the slave machine is full of  the following errors for all
> mime-type:
> 2010-08-05 09:22:31,916 ERROR tika.TikaParser - Can't retrieve Tika parser
> for mime-type text/html
> 2010-08-05 09:22:32,048 ERROR tika.TikaParser - Can't retrieve Tika parser
> for mime-type application/pdf
> ...same for all other mime-type
>
> any idea why slave machine has tika error for all mime-types?
>
> thanks,
> -aj
>