You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Christiaan Fluit <ch...@aduna-software.com> on 2007/10/05 14:31:49 UTC

Introducing the Aperture project

Hello Tika developers (cc Aperture list),

My name is Christiaan Fluit, I am one of the admins of the Aperture 
project. We have recently become aware of the existence of the Tika 
project, as several Tika developers/users brought it to our attention. 
It seems that you are trying to solve the same problems as we do. This 
mail is intended to give you an introduction to Aperture and the areas 
in which our projects overlap, so that you know we exist, what we do and 
how it relates to Tika. If you are interested, we can also explore 
various modes of cooperation.

Aperture is a Java framework for extracting and querying full-text 
content and metadata from various information systems (e.g. file 
systems, web sites, mail boxes) and the file formats (e.g. documents, 
images) occurring in these systems.

You could have a look at the homepage [1], sourceforge page [2] and Wiki 
[3].

The project started two years ago, when two organizations (DFKI, a 
German research institute [4] and Aduna, a Dutch software firm [5, 6]) 
recognized they had a common need for a data extraction framework. The 
core requirements were to crawl various data sources and to extract data 
from the objects that occur in these sources, using RDF [7] as a means 
to communicate and store information throughout this framework.

Since its inception the project benefited from contributions from 
developers affiliated with both founding partners as well as a group of 
external open source enthusiasts. It has been successfully embedded in 
various software projects - see [8] for details.

Aperture comes from the Semantic Web community. We firmly believe that 
storing data in RDF triples, making it conform to a well-defined model 
and making it searchable (Lucene) and structurally queryable (SPARQL)), 
allows for very powerful applications. Many aspects of data integration 
that plague the users of relational databases or XML schemas become 
manageable or non-existent when using RDF technologies.

As far as we can see, the scope of Aperture is currently broader than 
Tika, perhaps you can comment on this. We provide seven kinds of 
services. Two of them have direct equivalents in Tika, namely Extractor 
(processes a stream to extract text and metadata) and MimeTypeIdentifier 
(determines the MIME type of a stream using heuristics such as magic 
numbers, strings, file extensions, etc.). The Aperture Extractors 
correspond directly to Tika Parsers, see e.g. [11] and [12]. As for MIME 
type identification, please compare [9] and [10]. The Documentation page 
on the Wiki [13] can also provide you with more details.

We believe that cooperation is better than competition. We could both 
benefit from our combined experience and ideas. We are looking forward 
to hear your view on this.

This mail has been sent to both the tika-dev and aperture-devel list so 
that both communities are kept informed of the progress of this discussion.


Kind regards,

Christiaan Fluit,
Leo Sauermann,
Antoni Mylka,

Admins of the Aperture Project.


[1] http://aperture.sourceforge.net
[2] http://sf.net/projects/aperture
[3] http://aperture.wiki.sourceforge.net
[4] http://www.dfki.de
[5] http://www.aduna-software.com
[6] http://www.openrdf.org
[7] http://www.w3.org/RDF/
[8] http://aperture.wiki.sourceforge.net/ProjectsUsingAperture
[9] 
http://aperture.svn.sourceforge.net/viewvc/aperture/trunk/aperture/src/java/org/semanticdesktop/aperture/mime/identifier/magic/
[10] 
http://svn.apache.org/viewvc/incubator/tika/trunk/src/main/java/org/apache/tika/mime/
[11] 
http://aperture.svn.sourceforge.net/viewvc/aperture/trunk/aperture/src/java/org/semanticdesktop/aperture/extractor/
[12] 
http://svn.apache.org/viewvc/incubator/tika/trunk/src/main/java/org/apache/tika/parser/
[13] http://aperture.wiki.sourceforge.net/Documentation


-- 
christiaan.fluit@aduna-software.com

Aduna
Prinses Julianaplein 14-b
3817 CS Amersfoort
The Netherlands

+31 33 465 9987 phone
+31 33 465 9987 fax

http://www.aduna-software.com

Re: Introducing the Aperture project

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On 10/5/07, Christiaan Fluit <ch...@aduna-software.com> wrote:
> We have recently become aware of the existence of the Tika
> project, as several Tika developers/users brought it to our attention.

Likewise, Aperture surfaced on our radar a while ago [1], and it
certainly looks interesting!

Unless you've already seen it, you should check out the Tika proposal
at [2] for more background on where we are coming from and what the
goals of the Tika project are.

> It seems that you are trying to solve the same problems as we do. This
> mail is intended to give you an introduction to Aperture and the areas
> in which our projects overlap, so that you know we exist, what we do and
> how it relates to Tika. If you are interested, we can also explore
> various modes of cooperation.

Thanks for the introduction! I agree that we have similar goals and
would very much like see how and where we could work together. Tika is
still in an early stage of development, so I think we have lots of
options available, both technically and organizationally.

> As far as we can see, the scope of Aperture is currently broader than
> Tika, perhaps you can comment on this.

Yes. The scope of the Tika project is relatively tight on purpose, and
we're looking at ways to make the code as modular as possible to
support reuse in a wide variety of use cases. We want to make it easy
to use Tika for example with existing crawlers like in Apache Nutch
[3], with content repositories or databases like Apache Jackrabbit
[4], or with more advanced content analysis programs like Apache UIMA
[5].

> We believe that cooperation is better than competition. We could both
> benefit from our combined experience and ideas. We are looking forward
> to hear your view on this.

Agreed!

I see licensing as one major issue to be resolved for enabling better
cooperation. I see that your interfaces are licensed under AFL, which
seems to be in line with the Apache License, but your implementation
classes are under OSL, which makes it impossible for us to directly
use your code within an Apache project (see [6]).

One concrete thing that I think we could use as a starting point to
better understand the different design and licensing constraints would
be to implement an ApertureParser in Tika and a TikaExtractor in
Aperture. Such "cross-linking" would potentially allow Tika to use the
Aperture extractors and Aperture to use Tika parsers, and would
perhaps pave the way for more intimate integration in the future.

BR,

Jukka Zitting

[1] http://www.nabble.com/Aperture-tf4009924.html
[2] http://wiki.apache.org/incubator/TikaProposal
[3] http://lucene.apache.org/nutch/
[3] http://jacrkabbit.apache.org/
[5] http://incubator.apache.org/uima/
[6] http://people.apache.org/~rubys/3party.html