You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Jukka Zitting <ju...@gmail.com> on 2008/12/17 14:17:30 UTC

Fwd: Proposal: Commons SAX

Hi,

I think the SAX classes that we've come up in o.a.tika.sax would be
useful also to other projects that don't otherwise depend on Tika, so
I've contacted Apache Commons about the possibility of starting a
"Commons SAX" component to make the code available to a wider
audience. See below for the proposal.

BR,

Jukka Zitting



---------- Forwarded message ----------
From: Jukka Zitting <ju...@gmail.com>
Date: Wed, Dec 17, 2008 at 2:09 PM
Subject: Proposal: Commons SAX
To: Jakarta Commons Developers List <de...@commons.apache.org>


Hi,

In the Apache Tika project [1] we use SAX quite a lot, and have
written a set of quite useful general utility classes for SAX
handling.

For example, in org.apache.tika.sax [2] we have the following:

* ContentHandlerDecorator - Convenient base class for writing
ContentHandler decorators
* EmbeddedContentHandler - Decorator that blocks startDocument() and
endDocument() calls
* TeeContentHandler - Forwards SAX events to multiple handlers
* TextContentHandler - Decorator that blocks everything but character
events (and start/endDocument)
* WriteOutContentHandler - Writes the contents of all character events
to a Writer

In org.apache.tika.sax.xpath [3] we have a simple XPath subset
implementation that supports streaming and filtering of SAX events. In
other words, the implementation doesn't need a DOM tree to evaluate
XPath statements.

I believe this code would be useful also outside Tika, and I was
thinking that it might perhaps make sense to create a Commons project
for this. I also know of some SAX processing classes in Cocoon and
Jackrabbit that could well be of interest to a wider audience.

Do you think something like this would be interesting as a Commons
project? Are there other similar efforts that I should know of? I
looked at XML Commons in xml.apache.org, but it seems pretty dormant.

[1] http://lucene.apache.org/tika/
[2] http://lucene.apache.org/tika/apidocs/org/apache/tika/sax/package-summary.html
[3] http://lucene.apache.org/tika/apidocs/org/apache/tika/sax/xpath/package-summary.html

BR,

Jukka Zitting

RE: Proposal: Commons SAX

Posted by Uwe Schindler <us...@pangaea.de>.
Don't forget ElementMappingContentHandler, its useful for others, too :)

Uwe

-----
UWE SCHINDLER
Webserver/Middleware Development
PANGAEA - Publishing Network for Geoscientific and Environmental Data
MARUM - University of Bremen
Room 2500, Leobener Str., D-28359 Bremen
Tel.: +49 421 218 65595
Fax:  +49 421 218 65505
http://www.pangaea.de/
E-mail: uschindler@pangaea.de

> -----Original Message-----
> From: Jukka Zitting [mailto:jukka.zitting@gmail.com]
> Sent: Wednesday, December 17, 2008 2:17 PM
> To: tika-dev@lucene.apache.org
> Subject: Fwd: Proposal: Commons SAX
> 
> Hi,
> 
> I think the SAX classes that we've come up in o.a.tika.sax would be
> useful also to other projects that don't otherwise depend on Tika, so
> I've contacted Apache Commons about the possibility of starting a
> "Commons SAX" component to make the code available to a wider
> audience. See below for the proposal.
> 
> BR,
> 
> Jukka Zitting
> 
> 
> 
> ---------- Forwarded message ----------
> From: Jukka Zitting <ju...@gmail.com>
> Date: Wed, Dec 17, 2008 at 2:09 PM
> Subject: Proposal: Commons SAX
> To: Jakarta Commons Developers List <de...@commons.apache.org>
> 
> 
> Hi,
> 
> In the Apache Tika project [1] we use SAX quite a lot, and have
> written a set of quite useful general utility classes for SAX
> handling.
> 
> For example, in org.apache.tika.sax [2] we have the following:
> 
> * ContentHandlerDecorator - Convenient base class for writing
> ContentHandler decorators
> * EmbeddedContentHandler - Decorator that blocks startDocument() and
> endDocument() calls
> * TeeContentHandler - Forwards SAX events to multiple handlers
> * TextContentHandler - Decorator that blocks everything but character
> events (and start/endDocument)
> * WriteOutContentHandler - Writes the contents of all character events
> to a Writer
> 
> In org.apache.tika.sax.xpath [3] we have a simple XPath subset
> implementation that supports streaming and filtering of SAX events. In
> other words, the implementation doesn't need a DOM tree to evaluate
> XPath statements.
> 
> I believe this code would be useful also outside Tika, and I was
> thinking that it might perhaps make sense to create a Commons project
> for this. I also know of some SAX processing classes in Cocoon and
> Jackrabbit that could well be of interest to a wider audience.
> 
> Do you think something like this would be interesting as a Commons
> project? Are there other similar efforts that I should know of? I
> looked at XML Commons in xml.apache.org, but it seems pretty dormant.
> 
> [1] http://lucene.apache.org/tika/
> [2] http://lucene.apache.org/tika/apidocs/org/apache/tika/sax/package-
> summary.html
> [3]
> http://lucene.apache.org/tika/apidocs/org/apache/tika/sax/xpath/package-
> summary.html
> 
> BR,
> 
> Jukka Zitting