You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Jukka Zitting <ju...@gmail.com> on 2006/08/16 13:06:13 UTC

Tika update

Hi,

There was recently discussion on perhaps starting a new Lucene
sub-project, named Tika, to create a general-purpose library from the
parser components and other features in Nutch that might interest a
wider audience. To keep things rolling we've created a temporary
staging area for the project at http://code.google.com/p/tika/ on
Google Code, and I've started to flesh out a potential project
structure using Maven 2.

Note that the project materials in svn refer to the project as "Apache
Tika" even though the project has *not* been officially accepted. The
reason for this is   that the Google Code project is just a temporary
staging ground and I wanted to give a better idea of what the project
could look like if accepted. The jury is still out on whether to start
a project like this, so any comments and feedback on the idea are very
much welcome.

Most, if not all, code in Tika will be based on existing code from
Nutch and other Apache projects, so I'm not sure if the project needs
to go through the Incubator if accepted by the Lucene PMC.

So far the tika source tree contains just a modified version of my
TextExtractor code from the Apache Jackrabbit project, and Jérôme is
planning to add some of his stuff. The source tree at Google Code
should be considered just a playground for bringing things together
and discussing ideas, before migrating back to ASF infrastructure.

BR,

Jukka Zitting

-- 
Yukatan - http://yukatan.fi/ - info@yukatan.fi
Software craftsmanship, JCR consulting, and Java development

Re: Tika update

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On 8/16/06, Sami Siren <ss...@gmail.com> wrote:
> IMO to solve the main problem one does not need to set up another
> project, just refactor and repackage.

I'd be happy either way, as long as I get a nice reusable library to
use in Jackrabbit. :-)

I think the key question on whether to branch a new project or just a
separate build target in svn is the expected community around the
identified codebase. For example I'd be very much interested in
working on the general-purpose code identified for Tika, but at least
for now I have little use for Nutch or interest in participating in
general Nutch development (not saying that Nutch isn't good, just that
I don't have the itch that Nutch is scratching). If there are enough
people like me, then I think it makes sense to start another project,
but otherwise I'd be happy to hang around here as well.

BR,

Jukka Zitting

-- 
Yukatan - http://yukatan.fi/ - info@yukatan.fi
Software craftsmanship, JCR consulting, and Java development

Re: Tika update

Posted by Sami Siren <ss...@gmail.com>.

Chris Mattmann wrote:
> However, the current Nutch software contains many "value-added" pieces of
> code that are monolithically packaged together. If the services and
> capabilities from the code were provided as separate, modular component
> libraries, such services and capabilities could benefit many projects,
> besides just Nutch. Ideally, one would not want to include the entire Nutch
> jar file to take advantage of its content parsing tools. Additionally, with
> the formulation of Hadoop, there is precedent for breaking Nutch down into
> different component libraries. So far, the "valued-added" pieces of code
> monolithically packaged that we have identified are of three kinds:

IMO to solve the main problem one does not need to set up another 
project, just refactor and repackage.

--
  Sami Siren

Re: Tika update

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.

Hmmm I guess the nutch-dev list doesn't like MS Word attachments. Here's the
content of the proposal, pasted in plain text:

<snip>
Proposal for new Lucene Sub Project called "Tika"

Chris A. Mattmann, Jerome Charron

Overview

With its simple but efficient plugin system, Nutch is becoming more and more
of a search engine framework that can easily be tuned to many kinds of
domain-specific search applications (e.g., corporate, personal, internet,
vertical search, general search). Nutch is a standalone "library",
containing search engine tools such as a crawler, and tools for index
management. Nutch is also very much a component in its own right, exporting
its own API, and having the ability to be used as a plugin component in
other systems. 

However, the current Nutch software contains many "value-added" pieces of
code that are monolithically packaged together. If the services and
capabilities from the code were provided as separate, modular component
libraries, such services and capabilities could benefit many projects,
besides just Nutch. Ideally, one would not want to include the entire Nutch
jar file to take advantage of its content parsing tools. Additionally, with
the formulation of Hadoop, there is precedent for breaking Nutch down into
different component libraries. So far, the "valued-added" pieces of code
monolithically packaged that we have identified are of three kinds:

* Infrastructure : The Nutch plugin system : This plugin system, as a
standalone library can be reused in Lucene, Nutch, Solr and many others
projects to easily provide some extensible capabilities.
* Content analysis : the MimeType repository, the language identifier, the
summarizers, the signature implementations. These pieces of code could be
useful in any content related project.
* Content Parsing : All the Nutch's parse plugins. These plugins are
generally more or less some wrappers based on some external APIs. Their
added value is to provide a common API to access content of many type of
content. Again, it could be very useful in many content based projects
(Lucene based projects, Solr, ...)

It is our proposal that these identified pieces of code be extracted and
maintained in a separate library that we would dub "Tika". Tika would become
a sub-project of Lucene in a similar fashion to that of Hadoop, and similar
to its graduation out of Nutch that occurred recently. Tika would be a
framework and API for content analysis, and parsing in large
scale-distributed systems. It would also include Nutch's useful plugin
system, which could be easily reused across many projects, both within and
outside of Lucene.

Benefits of Extracting the Aforementioned Value Added Code Fragments into
their Own Library

* Avoid duplicating code over Lucene's subprojects.
* Better visibility of these pieces of code.
* Wider usage of these pieces of code.
* The two previous points will provide a better extension and maintenance of
these pieces of code.

RoadMap

* tika-0.1 : Simply gathers the easiest Nutch externalizable code (MimeType,
LanguageIdentifier, Summarizer, Signature)
* tika-0.2 : Provides a generic plugin mechanism
* tika-0.3 : Provides content parsing / analysis plugins
> * Operational Tika library
> * Nutch external dependency on Tika
* tika-0.4 and beyond: issues identified by community, more content parsing
plugins, graphical user interfaces, command line tools, and more

There has been some recent interest in a generic framework for content
analysis, and metadata management on the Nutch mailing lists recently. From
that interest, we have gathered the following list of candidate committers
who have expressed interested in our proposed project. The leader of the
Tika project would be Chris Mattmann. Chris works at NASA's Jet Propulsion
Laboratory as a Member of the Technical Staff in the Modeling and Data
Management Systems Section. Chris has contributed many patches to Nutch, and
a single patch to the Hadoop project as well. In addition to his work at
JPL, Chris is also a Ph.D. candidate at the University of Southern
California's Center for Software Engineering, where he works with his
advisor Dr. Nenad Medvidovic researching software architecture for
data-intensive systems. Chris's dissertation research investigates software
connectors and their properties in large-scale, distributed, data-intensive
systems. His expected date of defense is May 2007. The other "core" member
of the commit team would be Jerome Charron, one of Nutch's existing
committers. Jerome has contributed many useful patches to the Nutch system,
including the metadata analysis container and the mime type identification
system. Though Chris would be the lead of the project, the oversight and
vision for the project would be shared between Jerome and Chris. The full
list of candidate committers are as follows. This list is not meant to be
exhaustive, and is based entirely on the interest that we have gleaned from
mailing list conversations.

Candidate Committers

* Jérôme Charron 
* Chris Mattmann 
* Rida Benjelloun (Lius)
* Otis Gospodnetić (Simpy)
* Dawid Weiss (Carrot2)
* Jukka Zitting 
* Michael Wechner 

</snip>

The MS-word version can be found at the following link:

http://www-scf.usc.edu/~mattmann/Tika.doc

Thanks,
  Chris

On 8/16/06 7:22 AM, "Chris Mattmann" <ch...@jpl.nasa.gov> wrote:

> Hi Jukka,
> 
>  Thanks for your email. Indeed, there was discussion on the Lucene PMC email
> list, about the Tika project. It was decided by the powers that be to
> discuss it more on the Nutch mailing list before moving forward with any
> vote on making Tika a sub-project of Apache Lucene. With regards to that, my
> action was to send the Tika proposal to the nutch-dev list, and help to
> start up a discussion on Tika, to get feedback from the community. Seeing as
> though you lighted the fire under this (thanks!), it's only appropriate for
> me to send out the Tika project proposal sent to the Lucene PMC. So, here it
> is, attached. I'd love to here feedback from the Nutch community on what it
> thinks of such a project.
> 
> Cheers,
>    Chris
> 
> 
> 
> On 8/16/06 4:06 AM, "Jukka Zitting" <ju...@gmail.com> wrote:
> 
>> Hi,
> 
> There was recently discussion on perhaps starting a new
>> Lucene
> sub-project, named Tika, to create a general-purpose library from
>> the
> parser components and other features in Nutch that might interest a
> wider
>> audience. To keep things rolling we've created a temporary
> staging area for
>> the project at http://code.google.com/p/tika/ on
> Google Code, and I've started
>> to flesh out a potential project
> structure using Maven 2.
> 
> Note that the
>> project materials in svn refer to the project as "Apache
> Tika" even though the
>> project has *not* been officially accepted. The
> reason for this is   that the
>> Google Code project is just a temporary
> staging ground and I wanted to give a
>> better idea of what the project
> could look like if accepted. The jury is still
>> out on whether to start
> a project like this, so any comments and feedback on
>> the idea are very
> much welcome.
> 
> Most, if not all, code in Tika will be based
>> on existing code from
> Nutch and other Apache projects, so I'm not sure if the
>> project needs
> to go through the Incubator if accepted by the Lucene PMC.
> 
> So
>> far the tika source tree contains just a modified version of my
> TextExtractor
>> code from the Apache Jackrabbit project, and Jérôme is
> planning to add some of
>> his stuff. The source tree at Google Code
> should be considered just a
>> playground for bringing things together
> and discussing ideas, before migrating
>> back to ASF infrastructure.
> 
> BR,
> 
> Jukka Zitting
> 
> --
> Yukatan -
>> http://yukatan.fi/ - info@yukatan.fi
> Software craftsmanship, JCR consulting,
>> and Java development
> 
>

Re: Tika update

Posted by Chris Mattmann <ch...@jpl.nasa.gov>.

Hi Jukka,

 Thanks for your email. Indeed, there was discussion on the Lucene PMC email
list, about the Tika project. It was decided by the powers that be to
discuss it more on the Nutch mailing list before moving forward with any
vote on making Tika a sub-project of Apache Lucene. With regards to that, my
action was to send the Tika proposal to the nutch-dev list, and help to
start up a discussion on Tika, to get feedback from the community. Seeing as
though you lighted the fire under this (thanks!), it's only appropriate for
me to send out the Tika project proposal sent to the Lucene PMC. So, here it
is, attached. I'd love to here feedback from the Nutch community on what it
thinks of such a project.

Cheers,
   Chris

On 8/16/06 4:06 AM, "Jukka Zitting" <ju...@gmail.com> wrote:

> Hi,

There was recently discussion on perhaps starting a new
> Lucene
sub-project, named Tika, to create a general-purpose library from
> the
parser components and other features in Nutch that might interest a
wider
> audience. To keep things rolling we've created a temporary
staging area for
> the project at http://code.google.com/p/tika/ on
Google Code, and I've started
> to flesh out a potential project
structure using Maven 2.

Note that the
> project materials in svn refer to the project as "Apache
Tika" even though the
> project has *not* been officially accepted. The
reason for this is   that the
> Google Code project is just a temporary
staging ground and I wanted to give a
> better idea of what the project
could look like if accepted. The jury is still
> out on whether to start
a project like this, so any comments and feedback on
> the idea are very
much welcome.

Most, if not all, code in Tika will be based
> on existing code from
Nutch and other Apache projects, so I'm not sure if the
> project needs
to go through the Incubator if accepted by the Lucene PMC.

So
> far the tika source tree contains just a modified version of my
TextExtractor
> code from the Apache Jackrabbit project, and Jérôme is
planning to add some of
> his stuff. The source tree at Google Code
should be considered just a
> playground for bringing things together
and discussing ideas, before migrating
> back to ASF infrastructure.

BR,

Jukka Zitting

-- 
Yukatan -
> http://yukatan.fi/ - info@yukatan.fi
Software craftsmanship, JCR consulting,
> and Java development