You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Christiaan Fluit <ch...@aduna.biz> on 2006/03/09 12:57:17 UTC

Aperture 2006.1 alpha 2 released

A little while ago I announced the existence of the Aperture project, 
founded by my company together with the DFKI institute.

We just released Aperture 2006.1 alpha 2, which may be of interest to 
all Lucene users dealing with crawling and text extraction.

The project page is located at:

	http://sourceforge.net/projects/aperture

To summarize, Aperture now has code for the following tasks:

- Crawling of file systems, websites and IMAP folders. An Outlook 
mailbox crawler is also in the works, any help is welcome.

- Text and metadata extraction of a large and growing number of document 
formats, e.g. MS Office files, MS Works, OpenOffice, OpenDocument, RTF, 
PDF, WordPerfect, Quattro, Presentations, HTML, XML, plain text...

- A robust magic number-based MIME type identifier, a must for choosing 
the right extractor for a given document.

- Security-related classes for handling self-signed certificates when 
communicating using SSL.

Most of the code is already in good shape. The reason that it is still 
labeled as "alpha" is that we only recently started applying Aperture in 
our own software, which may still lead to certain (probably minor) API 
changes.

Future plans include continuously extending the set of extractors, e.g. 
by including extractors for mp3, images, videos, etc., adding support 
for Thunderbird and other mail clients, support for expanding and 
crawling archives, address books, ...

Furthermore we are working on metadata storage facilities that build 
upon Lucene and Sesame, a RDF storage and query engine (see 
www.openrdf.org). This should combine the expressiveness of RDF and the 
performance and scalability of Sesame with Lucene's full-text indexing 
capabilities.

For questions please consider joining the aperture-devel mailing list.


Regards,

Christiaan Fluit.

-- 
christiaan.fluit@aduna.biz

Aduna
Prinses Julianaplein 14-b
3817 CS Amersfoort
The Netherlands

+31 33 465 9987 phone
+31 33 465 9987 fax

http://aduna.biz

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org