You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by BrunoWL <bw...@gmail.com> on 2009/11/10 19:28:24 UTC

Integration with Tika

Hi. i'm a benning in nutch. Can anybody tell how to make nutch use parsers
from tika.
I did all kind of search and didn't find a answer.

thanks.

-- 
View this message in context: http://old.nabble.com/Integration-with-Tika-tp26287368p26287368.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.


Re: Integration with Tika

Posted by Kirby Bohling <ki...@gmail.com>.
You'll need to be careful of the classloader issues if you do that...

The core Nutch code needs just the mime type stuff, but if you access
Tika from the lib directory rather then from the plugins/lib
directory, it won't be able to find any extensions.  I've used Tika to
implement a docx plugin, and came across all these problems.

Kirby


On Thu, Nov 12, 2009 at 8:41 AM, Julien Nioche
<li...@gmail.com> wrote:
> Speaking of which, I'm planning to do some work on the Tika integration
> within the next week or so. Basically, I'll create a new plugin which will
> be used for the mime types that Tika can already handle while keeping some
> of the existing plugins for the more complex cases. This should allow us to
> already have a first version of the Tika integration without losing any the
> functionalities. Will update the list as soon as I have something working +
> will create a JIRA
>
> J.
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>
> 2009/11/10 Andrzej Bialecki <ab...@getopt.org>
>>
>> BrunoWL wrote:
>>>
>>> Hi. i'm a benning in nutch. Can anybody tell how to make nutch use
>>> parsers
>>> from tika.
>>> I did all kind of search and didn't find a answer.
>>
>> Tika parsers are not integrated yet with Nutch - we use our own parsers,
>> and in most cases they are of similar quality as those in Tika (since most
>> Tika parsers originated in Nutch). Tight Tika integration is on the roadmap.
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>

Re: Integration with Tika

Posted by Julien Nioche <li...@gmail.com>.
Speaking of which, I'm planning to do some work on the Tika integration
within the next week or so. Basically, I'll create a new plugin which will
be used for the mime types that Tika can already handle while keeping some
of the existing plugins for the more complex cases. This should allow us to
already have a first version of the Tika integration without losing any the
functionalities. Will update the list as soon as I have something working +
will create a JIRA

J.
-- 
DigitalPebble Ltd
http://www.digitalpebble.com

2009/11/10 Andrzej Bialecki <ab...@getopt.org>

> BrunoWL wrote:
>
>> Hi. i'm a benning in nutch. Can anybody tell how to make nutch use parsers
>> from tika.
>> I did all kind of search and didn't find a answer.
>>
>
> Tika parsers are not integrated yet with Nutch - we use our own parsers,
> and in most cases they are of similar quality as those in Tika (since most
> Tika parsers originated in Nutch). Tight Tika integration is on the roadmap.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>

Re: Integration with Tika

Posted by Andrzej Bialecki <ab...@getopt.org>.
BrunoWL wrote:
> Hi. i'm a benning in nutch. Can anybody tell how to make nutch use parsers
> from tika.
> I did all kind of search and didn't find a answer.

Tika parsers are not integrated yet with Nutch - we use our own parsers, 
and in most cases they are of similar quality as those in Tika (since 
most Tika parsers originated in Nutch). Tight Tika integration is on the 
roadmap.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com