You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Chris Mattmann <ch...@jpl.nasa.gov> on 2007/10/08 16:02:07 UTC

Monthly report draft

Hi Guys,

 Here is my draft of the report. Let me know if you guys concur, and I'll
add it to the wiki:

<report>
Tika is a toolkit for detecting and extracting metadata and structured
text content from various documents using existing parser
Libraries. Tika entered incubation on March 22nd, 2007.

Community

There have been a number of positive items within Tika during the last few
months. The traffic on the Tika mailing list has increased significantly
(with typically 2, 3 questions, and 1 or 2 commits every day, or every other
day), and there have been a lot of recent inquiries from external projects
wanting to collaborate with Tika (including Aperture, PDFBox and a fellow
developing a JSon library currently hosted at Google code). In addition,
Tika's architecture has become a recent discussion of interest (as we'll see
below).

We recently elected Keith Bennett as a new committer to Tika. Keith has been
spearheading many of the new patches committed to Tika, as well as
participating in discussions about the architecture, and future direction of
the project.

Tika will be represented at the "Fast Feather" track at Apache Con US by
Jukka Zitting. The rest of the community is helping to create the content
for the presentation. The abstract is listed below:

-----
Tika is a new content analysis framework borne from the desire to factor our
commonality from the Apache Nutch search engine framework. Tika provides a
mime detection framework, an extensible parsing framework and metadata
environment for content analysis. Though in its nascent stages, progress on
Tika has recently taken shape and the project is nearing a stable 0.1
release. 

In this talk, we'll describe the core APIs of Tika and discuss its use in
several distinct domains including search engines, scientific data
dissemination and an industrial setting.
-----

Development

There have been a flurry of JIRA issues and code activity [1] including 47
issues currently in JIRA, with 32 resolved issues, 14 closed issues, and 2
open major/minor issues in progress).

Tika's Parser interface (one of its key components) has just undergone a
major overhaul led by Jukka Zitting, and Chris Mattmann has recently
contributed a MimeType system (with help from fellow Apache Nutch committer
Jerome Charron) to Tika. We also cleaned up and refactored large parts of
the rest of the code (removing references to LuisLite and branding the
project wherever possible with the Tika name), in preparation for an
upcoming 0.1 release.

Chris Mattmann has led an effort to carve out the existing MimeType
detection system in Apache Nutch [2] and replace it with Tika's improved
MimeType detection system. There is a patch sitting in JIRA right now [3],
and barring objections, Nutch will rely on Tika for its MimeType detection
abilities.

Also active recently were committers Bertrand Delacretaz, Sami Siren and
Rida Benjelloun, committing patches and improvements wherever needed.

Issues before graduation

No changes since our last report: the Tika project is still at an
early stage of incubation. We need to continue bringing in the initial
codebases and are targeting an initial incubating release (0.1) probably
within the next month. We also need to work on growing the community and
figuring out how to best interact with external parser projects.


[1] http://issues.apache.org/jira/browse/TIKA
[2] http://lucene.apache.org/nutch/
[3] http://issues.apache.org/jira/browse/NUTCH-562
</report>

Let me know what you guys think. Thanks to Bertrand for his original report
which inspired mine ;)

Cheers,
  Chris


______________________________________________
Chris Mattmann, Ph.D.
Chris.Mattmann@jpl.nasa.gov
Cognizant Development Engineer
Early Detection Research Network Project

_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.



Re: Monthly report draft

Posted by Rida Benjelloun <ri...@doculibre.com>.
+1, Thanks Chris!
Rida.

On 10/8/07, Bertrand Delacretaz <bd...@apache.org> wrote:
>
> On 10/8/07, Chris Mattmann <ch...@jpl.nasa.gov> wrote:
>
> > ....Let me know what you guys think....
>
> +1 to the report, thanks Chris!
>
> -Bertrand
>



-- 
---------------------------------------------------------
Rida Benjelloun
Doculibre inc.
ridabenjelloun@apache.org
rida.benjelloun@doculibre.com
Cel: 418-262-3222
Tel: 418-353-3390
Site Web : http://www.doculibre.com
---------------------------------------------------------

Re: Monthly report draft

Posted by Bertrand Delacretaz <bd...@apache.org>.
On 10/8/07, Chris Mattmann <ch...@jpl.nasa.gov> wrote:

> ....Let me know what you guys think....

+1 to the report, thanks Chris!

-Bertrand