You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2012/02/08 13:50:18 UTC

tika-core, tika-parser

Hi,

Can anyone shed light on this? We don't have any parsers in our libs dir and 
we don't have tika-parsers jar, only the tika-core jar. Where are the parsers 
and how does this all work? 

I've posted a question (same subject) on the Tika list and Nick tells me there 
must be parsers somewhere. Well, i have no idea how we do it in Nutch, do you?

Thanks

Re: tika-core, tika-parser

Posted by Markus Jelsma <ma...@openindex.io>.
Yes, it's listed there indeed! But where are the parser impls then? I'll check 
this out. I must be getting crazy or something!

On Wednesday 08 February 2012 13:58:46 Lewis John Mcgibbney wrote:
> Hi Markus,
> 
> For starters
> 
> http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?view
> =markup
> 
> Can we pick our way through this?
> 
> Thanks
> 
> On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma
> 
> <ma...@openindex.io>wrote:
> > Hi,
> > 
> > Can anyone shed light on this? We don't have any parsers in our libs dir
> > and
> > we don't have tika-parsers jar, only the tika-core jar. Where are the
> > parsers
> > and how does this all work?
> > 
> > I've posted a question (same subject) on the Tika list and Nick tells me
> > there
> > must be parsers somewhere. Well, i have no idea how we do it in Nutch, do
> > you?
> > 
> > Thanks

-- 
Markus Jelsma - CTO - Openindex

Re: tika-core, tika-parser

Posted by Markus Jelsma <ma...@openindex.io>.

On Wednesday 08 February 2012 18:27:32 Ken Krugler wrote:
> On Feb 8, 2012, at 5:28am, Markus Jelsma wrote:
> > On Wednesday 08 February 2012 14:22:36 Julien Nioche wrote:
> >> sorry don't understand what your issue is. We have a dependency on
> >> tika-parsers and the actual parser implementations (listed in tika
> >> parsers' POM) are pulled transitively just like any other dependency
> >> managed by Ivy. They end up being copied in 
> >> runtime/local/plugins/parse-tika/ or put in the job in runtime/deploy/
> > 
> > My problem is that i am working on some code for Tika-parsers
> > 1.1-SNAPSHOT that i need to use in Nutch. However, when i build
> > tika-parsers and put it in Nutch' lib directory i still seem to be
> > missing dependencies. Then trouble
> 
> > begins:
> I don't know anything about how Nutch handles jars in its lib directory,
> but this sounds like you have a "raw" jar (tika-parsers) without its
> pom.xml.
> 
> So then Ivy (or Maven) doesn't know about the transitive dependencies on
> other jars, which are needed to implement the actual parsing support.

You're right, that's exactly what happened. However, i wasn't completely aware 
of it. Thanks

> 
> -- Ken
> 
> > Exception in thread "main" java.lang.NoClassDefFoundError: Could not
> > initialize class org.apache.tika.parser.dwg.DWGParser
> > 
> >        at java.lang.Class.forName0(Native Method)
> >        at java.lang.Class.forName(Class.java:247)
> >        at sun.misc.Service$LazyIterator.next(Service.java:271)
> >        at
> >        org.apache.nutch.parse.tika.TikaConfig.<init>(TikaConfig.java:149
> >        ) at
> > 
> > org.apache.nutch.parse.tika.TikaConfig.getDefaultConfig(TikaConfig.java:2
> > 11)
> > 
> >        at
> >        org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:25
> >        4) at
> > 
> > org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162
> > )
> > 
> >        at
> > 
> > org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:132)
> > 
> >        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:71)
> >        at
> >        org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:101)
> >        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at
> >        org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:138)
> > 
> > Nick told me to remove DWG from the org.apache.tika.parsers.Parsers
> > config file, which i did. But then other dependency issues come and go.
> > The more parsers i remove from the config file the better it goes, but
> > then Tika won't build anymore because of failing tests.
> > 
> > I asked this on the Nutch list because i wasn't sure anymore how Nutch
> > deals with these its own deps, which you explained well.
> > 
> > I'll give up for now :)
> > 
> >> On 8 February 2012 13:03, Markus Jelsma <ma...@openindex.io> 
wrote:
> >>> Yes, it looks like it! It should also be upgraded to Tika 1.0. But
> >>> that's something else.
> >>> 
> >>> dependencies, dependencies, dependencies.... :(
> >>> 
> >>> On Wednesday 08 February 2012 14:04:26 Julien Nioche wrote:
> >>>> The dependencies for the plugins are defined locally as shown in the
> >>>> URL below, where you can see the ref to tika-parsers for parse-tika.
> >>>> Is that more clear for you Markus?
> >>>> 
> >>>> On 8 February 2012 12:58, Lewis John Mcgibbney
> >>> 
> >>> <le...@gmail.com>wrote:
> >>>>> Hi Markus,
> >>>>> 
> >>>>> For starters
> >>> 
> >>> http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?
> >>> vi
> >>> 
> >>>>> ew=markup
> >>>>> 
> >>>>> Can we pick our way through this?
> >>>>> 
> >>>>> Thanks
> >>>>> 
> >>>>> 
> >>>>> On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma
> >>>>> <markus.jelsma@openindex.io
> >>>>> 
> >>>>>> wrote:
> >>>>>> Hi,
> >>>>>> 
> >>>>>> Can anyone shed light on this? We don't have any parsers in our libs
> >>> 
> >>> dir
> >>> 
> >>>>>> and
> >>>>>> we don't have tika-parsers jar, only the tika-core jar. Where are
> >>>>>> the parsers
> >>>>>> and how does this all work?
> >>>>>> 
> >>>>>> I've posted a question (same subject) on the Tika list and Nick
> >>>>>> tells
> >>> 
> >>> me
> >>> 
> >>>>>> there
> >>>>>> must be parsers somewhere. Well, i have no idea how we do it in
> >>>>>> Nutch, do you?
> >>>>>> 
> >>>>>> Thanks
> >>>>> 
> >>>>> --
> >>>>> *Lewis*
> >>> 
> >>> --
> >>> Markus Jelsma - CTO - Openindex
> 
> --------------------------
> Ken Krugler
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr

-- 
Markus Jelsma - CTO - Openindex

Re: tika-core, tika-parser

Posted by Ken Krugler <kk...@transpac.com>.
On Feb 8, 2012, at 5:28am, Markus Jelsma wrote:

> 
> 
> On Wednesday 08 February 2012 14:22:36 Julien Nioche wrote:
>> sorry don't understand what your issue is. We have a dependency on
>> tika-parsers and the actual parser implementations (listed in tika parsers'
>> POM) are pulled transitively just like any other dependency managed by Ivy.
>> They end up being copied in  runtime/local/plugins/parse-tika/ or put in
>> the job in runtime/deploy/
> 
> My problem is that i am working on some code for Tika-parsers 1.1-SNAPSHOT 
> that i need to use in Nutch. However, when i build tika-parsers and put it in 
> Nutch' lib directory i still seem to be missing dependencies. Then trouble 
> begins:

I don't know anything about how Nutch handles jars in its lib directory, but this sounds like you have a "raw" jar (tika-parsers) without its pom.xml.

So then Ivy (or Maven) doesn't know about the transitive dependencies on other jars, which are needed to implement the actual parsing support.

-- Ken

> 
> Exception in thread "main" java.lang.NoClassDefFoundError: Could not 
> initialize class org.apache.tika.parser.dwg.DWGParser
>        at java.lang.Class.forName0(Native Method)
>        at java.lang.Class.forName(Class.java:247)
>        at sun.misc.Service$LazyIterator.next(Service.java:271)
>        at org.apache.nutch.parse.tika.TikaConfig.<init>(TikaConfig.java:149)
>        at 
> org.apache.nutch.parse.tika.TikaConfig.getDefaultConfig(TikaConfig.java:211)
>        at org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:254)
>        at 
> org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
>        at 
> org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:132)
>        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:71)
>        at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:101)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:138)
> 
> Nick told me to remove DWG from the org.apache.tika.parsers.Parsers config 
> file, which i did. But then other dependency issues come and go. The more 
> parsers i remove from the config file the better it goes, but then Tika won't 
> build anymore because of failing tests.
> 
> I asked this on the Nutch list because i wasn't sure anymore how Nutch deals 
> with these its own deps, which you explained well.
> 
> I'll give up for now :)
> 
> 
> 
>> 
>> On 8 February 2012 13:03, Markus Jelsma <ma...@openindex.io> wrote:
>>> Yes, it looks like it! It should also be upgraded to Tika 1.0. But that's
>>> something else.
>>> 
>>> dependencies, dependencies, dependencies.... :(
>>> 
>>> On Wednesday 08 February 2012 14:04:26 Julien Nioche wrote:
>>>> The dependencies for the plugins are defined locally as shown in the
>>>> URL below, where you can see the ref to tika-parsers for parse-tika.
>>>> Is that more clear for you Markus?
>>>> 
>>>> On 8 February 2012 12:58, Lewis John Mcgibbney
>>> 
>>> <le...@gmail.com>wrote:
>>>>> Hi Markus,
>>>>> 
>>>>> For starters
>>> 
>>> http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?vi
>>> 
>>>>> ew=markup
>>>>> 
>>>>> Can we pick our way through this?
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> 
>>>>> On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma
>>>>> <markus.jelsma@openindex.io
>>>>> 
>>>>>> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> Can anyone shed light on this? We don't have any parsers in our libs
>>> 
>>> dir
>>> 
>>>>>> and
>>>>>> we don't have tika-parsers jar, only the tika-core jar. Where are
>>>>>> the parsers
>>>>>> and how does this all work?
>>>>>> 
>>>>>> I've posted a question (same subject) on the Tika list and Nick
>>>>>> tells
>>> 
>>> me
>>> 
>>>>>> there
>>>>>> must be parsers somewhere. Well, i have no idea how we do it in
>>>>>> Nutch, do you?
>>>>>> 
>>>>>> Thanks
>>>>> 
>>>>> --
>>>>> *Lewis*
>>> 
>>> --
>>> Markus Jelsma - CTO - Openindex
> 
> -- 
> Markus Jelsma - CTO - Openindex

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr





Re: tika-core, tika-parser

Posted by Markus Jelsma <ma...@openindex.io>.

On Wednesday 08 February 2012 14:22:36 Julien Nioche wrote:
> sorry don't understand what your issue is. We have a dependency on
> tika-parsers and the actual parser implementations (listed in tika parsers'
> POM) are pulled transitively just like any other dependency managed by Ivy.
> They end up being copied in  runtime/local/plugins/parse-tika/ or put in
> the job in runtime/deploy/

My problem is that i am working on some code for Tika-parsers 1.1-SNAPSHOT 
that i need to use in Nutch. However, when i build tika-parsers and put it in 
Nutch' lib directory i still seem to be missing dependencies. Then trouble 
begins:

Exception in thread "main" java.lang.NoClassDefFoundError: Could not 
initialize class org.apache.tika.parser.dwg.DWGParser
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:247)
        at sun.misc.Service$LazyIterator.next(Service.java:271)
        at org.apache.nutch.parse.tika.TikaConfig.<init>(TikaConfig.java:149)
        at 
org.apache.nutch.parse.tika.TikaConfig.getDefaultConfig(TikaConfig.java:211)
        at org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:254)
        at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:162)
        at 
org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:132)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:71)
        at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:101)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:138)

Nick told me to remove DWG from the org.apache.tika.parsers.Parsers config 
file, which i did. But then other dependency issues come and go. The more 
parsers i remove from the config file the better it goes, but then Tika won't 
build anymore because of failing tests.

I asked this on the Nutch list because i wasn't sure anymore how Nutch deals 
with these its own deps, which you explained well.

I'll give up for now :)



> 
> On 8 February 2012 13:03, Markus Jelsma <ma...@openindex.io> wrote:
> > Yes, it looks like it! It should also be upgraded to Tika 1.0. But that's
> > something else.
> > 
> > dependencies, dependencies, dependencies.... :(
> > 
> > On Wednesday 08 February 2012 14:04:26 Julien Nioche wrote:
> > > The dependencies for the plugins are defined locally as shown in the
> > > URL below, where you can see the ref to tika-parsers for parse-tika.
> > > Is that more clear for you Markus?
> > > 
> > > On 8 February 2012 12:58, Lewis John Mcgibbney
> > 
> > <le...@gmail.com>wrote:
> > > > Hi Markus,
> > > > 
> > > > For starters
> > 
> > http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?vi
> > 
> > > > ew=markup
> > > > 
> > > > Can we pick our way through this?
> > > > 
> > > > Thanks
> > > > 
> > > > 
> > > > On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma
> > > > <markus.jelsma@openindex.io
> > > > 
> > > > > wrote:
> > > >> Hi,
> > > >> 
> > > >> Can anyone shed light on this? We don't have any parsers in our libs
> > 
> > dir
> > 
> > > >> and
> > > >> we don't have tika-parsers jar, only the tika-core jar. Where are
> > > >> the parsers
> > > >> and how does this all work?
> > > >> 
> > > >> I've posted a question (same subject) on the Tika list and Nick
> > > >> tells
> > 
> > me
> > 
> > > >> there
> > > >> must be parsers somewhere. Well, i have no idea how we do it in
> > > >> Nutch, do you?
> > > >> 
> > > >> Thanks
> > > > 
> > > > --
> > > > *Lewis*
> > 
> > --
> > Markus Jelsma - CTO - Openindex

-- 
Markus Jelsma - CTO - Openindex

Re: tika-core, tika-parser

Posted by Julien Nioche <li...@gmail.com>.
sorry don't understand what your issue is. We have a dependency on
tika-parsers and the actual parser implementations (listed in tika parsers'
POM) are pulled transitively just like any other dependency managed by Ivy.
They end up being copied in  runtime/local/plugins/parse-tika/ or put in
the job in runtime/deploy/


On 8 February 2012 13:03, Markus Jelsma <ma...@openindex.io> wrote:

> Yes, it looks like it! It should also be upgraded to Tika 1.0. But that's
> something else.
>
> dependencies, dependencies, dependencies.... :(
>
> On Wednesday 08 February 2012 14:04:26 Julien Nioche wrote:
> > The dependencies for the plugins are defined locally as shown in the URL
> > below, where you can see the ref to tika-parsers for parse-tika. Is that
> > more clear for you Markus?
> >
> > On 8 February 2012 12:58, Lewis John Mcgibbney
> <le...@gmail.com>wrote:
> > > Hi Markus,
> > >
> > > For starters
> > >
> > >
> > >
> http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?vi
> > > ew=markup
> > >
> > > Can we pick our way through this?
> > >
> > > Thanks
> > >
> > >
> > > On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma
> > > <markus.jelsma@openindex.io
> > >
> > > > wrote:
> > >> Hi,
> > >>
> > >> Can anyone shed light on this? We don't have any parsers in our libs
> dir
> > >> and
> > >> we don't have tika-parsers jar, only the tika-core jar. Where are the
> > >> parsers
> > >> and how does this all work?
> > >>
> > >> I've posted a question (same subject) on the Tika list and Nick tells
> me
> > >> there
> > >> must be parsers somewhere. Well, i have no idea how we do it in Nutch,
> > >> do you?
> > >>
> > >> Thanks
> > >
> > > --
> > > *Lewis*
>
> --
> Markus Jelsma - CTO - Openindex
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: tika-core, tika-parser

Posted by Markus Jelsma <ma...@openindex.io>.
Yes, it looks like it! It should also be upgraded to Tika 1.0. But that's 
something else.

dependencies, dependencies, dependencies.... :(

On Wednesday 08 February 2012 14:04:26 Julien Nioche wrote:
> The dependencies for the plugins are defined locally as shown in the URL
> below, where you can see the ref to tika-parsers for parse-tika. Is that
> more clear for you Markus?
> 
> On 8 February 2012 12:58, Lewis John Mcgibbney 
<le...@gmail.com>wrote:
> > Hi Markus,
> > 
> > For starters
> > 
> > 
> > http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?vi
> > ew=markup
> > 
> > Can we pick our way through this?
> > 
> > Thanks
> > 
> > 
> > On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma
> > <markus.jelsma@openindex.io
> > 
> > > wrote:
> >> Hi,
> >> 
> >> Can anyone shed light on this? We don't have any parsers in our libs dir
> >> and
> >> we don't have tika-parsers jar, only the tika-core jar. Where are the
> >> parsers
> >> and how does this all work?
> >> 
> >> I've posted a question (same subject) on the Tika list and Nick tells me
> >> there
> >> must be parsers somewhere. Well, i have no idea how we do it in Nutch,
> >> do you?
> >> 
> >> Thanks
> > 
> > --
> > *Lewis*

-- 
Markus Jelsma - CTO - Openindex

Re: tika-core, tika-parser

Posted by Julien Nioche <li...@gmail.com>.
The dependencies for the plugins are defined locally as shown in the URL
below, where you can see the ref to tika-parsers for parse-tika. Is that
more clear for you Markus?

On 8 February 2012 12:58, Lewis John Mcgibbney <le...@gmail.com>wrote:

> Hi Markus,
>
> For starters
>
>
> http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?view=markup
>
> Can we pick our way through this?
>
> Thanks
>
>
> On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma <markus.jelsma@openindex.io
> > wrote:
>
>> Hi,
>>
>> Can anyone shed light on this? We don't have any parsers in our libs dir
>> and
>> we don't have tika-parsers jar, only the tika-core jar. Where are the
>> parsers
>> and how does this all work?
>>
>> I've posted a question (same subject) on the Tika list and Nick tells me
>> there
>> must be parsers somewhere. Well, i have no idea how we do it in Nutch, do
>> you?
>>
>> Thanks
>>
>
>
>
> --
> *Lewis*
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: tika-core, tika-parser

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Markus,

For starters

http://svn.apache.org/viewvc/nutch/trunk/src/plugin/parse-tika/ivy.xml?view=markup

Can we pick our way through this?

Thanks

On Wed, Feb 8, 2012 at 12:50 PM, Markus Jelsma
<ma...@openindex.io>wrote:

> Hi,
>
> Can anyone shed light on this? We don't have any parsers in our libs dir
> and
> we don't have tika-parsers jar, only the tika-core jar. Where are the
> parsers
> and how does this all work?
>
> I've posted a question (same subject) on the Tika list and Nick tells me
> there
> must be parsers somewhere. Well, i have no idea how we do it in Nutch, do
> you?
>
> Thanks
>



-- 
*Lewis*