You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2009/03/10 10:57:36 UTC

Moving Nutch parsers to Tika

Hi all,

I've been debating this for a while, too, what Sami suggested in another 
thread: "I think we should start looking at Apache Tika for most (or 
all) of our parsers."

This is actually a part of my broader vision for Nutch, that this 
project should not duplicate functionality of other well-established 
projects by re-implementing the same functionality, only poorly - 
because our focus is not on parsers, plugins, mime/charset detection, 
distributed RPC, but on building a robust platform for crawling.

We could start working on this particular issue by donating the Nutch 
parsers to Tika, those that are not already present there, and start 
using Tika's parsers in Nutch where it's already possible. Once Tika 
supports all types of parsers that we have, we should switch completely 
to Tika.

Of course, this will happen post-1.0 release.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Moving Nutch parsers to Tika

Posted by Sami Siren <ss...@gmail.com>.
Andrzej Bialecki wrote:
> Hi all,
> 
> I've been debating this for a while, too, what Sami suggested in another 
> thread: "I think we should start looking at Apache Tika for most (or 
> all) of our parsers."
> 
> This is actually a part of my broader vision for Nutch, that this 
> project should not duplicate functionality of other well-established 
> projects by re-implementing the same functionality, only poorly - 
> because our focus is not on parsers, plugins, mime/charset detection, 
> distributed RPC, but on building a robust platform for crawling.

I share that same vision.

> 
> We could start working on this particular issue by donating the Nutch 
> parsers to Tika, those that are not already present there, and start 
> using Tika's parsers in Nutch where it's already possible. Once Tika 
> supports all types of parsers that we have, we should switch completely 
> to Tika.

I think that the only parser that is totally missing from Tika is swf 
(https://issues.apache.org/jira/browse/TIKA-147). Tika also supports 
some formats that Nutch currently does not (in addition to providing 
more advanced parsing on some formats).

--
  Sami Siren

Re: Moving Nutch parsers to Tika

Posted by Otis Gospodnetic <og...@yahoo.com>.
I absolutely agree.  Duplicating the work and focusing on non-core when the same functionality can be gotten by using Tika is not wise for Nutch.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Andrzej Bialecki <ab...@getopt.org>
> To: nutch-dev@lucene.apache.org
> Sent: Tuesday, March 10, 2009 5:57:36 AM
> Subject: Moving Nutch parsers to Tika
> 
> Hi all,
> 
> I've been debating this for a while, too, what Sami suggested in another thread: 
> "I think we should start looking at Apache Tika for most (or all) of our 
> parsers."
> 
> This is actually a part of my broader vision for Nutch, that this project should 
> not duplicate functionality of other well-established projects by 
> re-implementing the same functionality, only poorly - because our focus is not 
> on parsers, plugins, mime/charset detection, distributed RPC, but on building a 
> robust platform for crawling.
> 
> We could start working on this particular issue by donating the Nutch parsers to 
> Tika, those that are not already present there, and start using Tika's parsers 
> in Nutch where it's already possible. Once Tika supports all types of parsers 
> that we have, we should switch completely to Tika.
> 
> Of course, this will happen post-1.0 release.
> 
> -- Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com