You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2009/03/10 10:57:36 UTC
Moving Nutch parsers to Tika
Hi all,
I've been debating this for a while, too, what Sami suggested in another
thread: "I think we should start looking at Apache Tika for most (or
all) of our parsers."
This is actually a part of my broader vision for Nutch, that this
project should not duplicate functionality of other well-established
projects by re-implementing the same functionality, only poorly -
because our focus is not on parsers, plugins, mime/charset detection,
distributed RPC, but on building a robust platform for crawling.
We could start working on this particular issue by donating the Nutch
parsers to Tika, those that are not already present there, and start
using Tika's parsers in Nutch where it's already possible. Once Tika
supports all types of parsers that we have, we should switch completely
to Tika.
Of course, this will happen post-1.0 release.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Moving Nutch parsers to Tika
Posted by Sami Siren <ss...@gmail.com>.
Andrzej Bialecki wrote:
> Hi all,
>
> I've been debating this for a while, too, what Sami suggested in another
> thread: "I think we should start looking at Apache Tika for most (or
> all) of our parsers."
>
> This is actually a part of my broader vision for Nutch, that this
> project should not duplicate functionality of other well-established
> projects by re-implementing the same functionality, only poorly -
> because our focus is not on parsers, plugins, mime/charset detection,
> distributed RPC, but on building a robust platform for crawling.
I share that same vision.
>
> We could start working on this particular issue by donating the Nutch
> parsers to Tika, those that are not already present there, and start
> using Tika's parsers in Nutch where it's already possible. Once Tika
> supports all types of parsers that we have, we should switch completely
> to Tika.
I think that the only parser that is totally missing from Tika is swf
(https://issues.apache.org/jira/browse/TIKA-147). Tika also supports
some formats that Nutch currently does not (in addition to providing
more advanced parsing on some formats).
--
Sami Siren
Re: Moving Nutch parsers to Tika
Posted by Otis Gospodnetic <og...@yahoo.com>.
I absolutely agree. Duplicating the work and focusing on non-core when the same functionality can be gotten by using Tika is not wise for Nutch.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
> From: Andrzej Bialecki <ab...@getopt.org>
> To: nutch-dev@lucene.apache.org
> Sent: Tuesday, March 10, 2009 5:57:36 AM
> Subject: Moving Nutch parsers to Tika
>
> Hi all,
>
> I've been debating this for a while, too, what Sami suggested in another thread:
> "I think we should start looking at Apache Tika for most (or all) of our
> parsers."
>
> This is actually a part of my broader vision for Nutch, that this project should
> not duplicate functionality of other well-established projects by
> re-implementing the same functionality, only poorly - because our focus is not
> on parsers, plugins, mime/charset detection, distributed RPC, but on building a
> robust platform for crawling.
>
> We could start working on this particular issue by donating the Nutch parsers to
> Tika, those that are not already present there, and start using Tika's parsers
> in Nutch where it's already possible. Once Tika supports all types of parsers
> that we have, we should switch completely to Tika.
>
> Of course, this will happen post-1.0 release.
>
> -- Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com