You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Markus Jelsma <ma...@openindex.io> on 2012/02/08 13:07:11 UTC

tika-core, tika-parser?

Hi,

In Nutch we have a copy of Tika-core. But with just that lib we also have 
access to the Tika.parser API from the other module. How does this all work 
because i have had confusing results in the past (and now).

Right now we've added a class to org.apache.tika.parser.html but we get a 
ClassNotFound with a newly compiled Tika. Our code compiles when we add tika-
parsers to the classpath, but when we run we get some obscure exception:

Exception in thread "main" java.lang.NoClassDefFoundError: Could not 
initialize class org.apache.tika.parser.dwg.DWGParser
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:247)
        at sun.misc.Service$LazyIterator.next(Service.java:271)
        at org.apache.nutch.parse.tika.TikaConfig.<init>(TikaConfig.java:149)
        at 
org.apache.nutch.parse.tika.TikaConfig.getDefaultConfig(TikaConfig.java:211)
        at org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:255)
        at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
        at 
org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:132)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:71)
        at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:101)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:138)

When we previously patched Tika in the core module all went perfectly well but 
patching the parser module and getting it all compiled in tike-core.jar seems 
tricky. Any advice? What am i missing? How do the parser libs end up in the 
core jar?

Thanks

Re: tika-core, tika-parser?

Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 8 Feb 2012, Markus Jelsma wrote:
> In Nutch we have a copy of Tika-core. But with just that lib we also 
> have access to the Tika.parser API from the other module. How does this 
> all work because i have had confusing results in the past (and now).

Tika Core comes with the core of Tika, which includes a definition of how 
parsers work, but not any parsers

All the parsers themselves are in the Tika Parsers module. Most of the 
parsers have dependencies on third party libraries, it's normally 
recommended to use one of Maven or the OSGi Bundle to have these pulled in 
for you


> Right now we've added a class to org.apache.tika.parser.html but we get a
> ClassNotFound with a newly compiled Tika. Our code compiles when we add tika-
> parsers to the classpath, but when we run we get some obscure exception:
>
> Exception in thread "main" java.lang.NoClassDefFoundError: Could not
> initialize class org.apache.tika.parser.dwg.DWGParser
>        at java.lang.Class.forName0(Native Method)
>        at java.lang.Class.forName(Class.java:247)
>        at sun.misc.Service$LazyIterator.next(Service.java:271)
>        at org.apache.nutch.parse.tika.TikaConfig.<init>(TikaConfig.java:149)
>        at
> org.apache.nutch.parse.tika.TikaConfig.getDefaultConfig(TikaConfig.java:211)

You've got a Tika parsers config file that says that the DWG parser is 
present, but you haven't included it. You should either include all the 
tika parsers, or not include the default org.apache.tika.parsers.Parsers 
config file that lists them

Nick