You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Jukka Zitting <ju...@gmail.com> on 2008/11/06 01:27:53 UTC

The ODF toolkit

Hi,

Check out the new ODF toolkit project [1]. Especially the ODFDOM
library [2] seems like something we could use in Tika to better
extract stuff from OpenDocument files.

[1] http://odftoolkit.org/
[2] http://odftoolkit.org/projects/odftoolkit/pages/ODFDOM

BR,

Jukka Zitting

RE: The ODF toolkit

Posted by Uwe Schindler <uw...@thetaphi.de>.

Hallo,

I am really interested in helping in TIKA development. I like the real good
TIKA design with SAX events!

> Hi,
> 
> Check out the new ODF toolkit project [1]. Especially the ODFDOM
> library [2] seems like something we could use in Tika to better
> extract stuff from OpenDocument files.
> 
> [1] http://odftoolkit.org/
> [2] http://odftoolkit.org/projects/odftoolkit/pages/ODFDOM
> 
> BR,
> 
> Jukka Zitting

I have seen this project, too. The problem with it is, that it only has
Mappings for the Object definitions as customized DOM objects, but that does
not really help you when importing the text.

TIKA's big advantage is the possibility to use SAX events when importing XML
formats. I am currently working on a patch for the ODF importer, that maps
content.xml's tags to XHTML tags. This can be done very simple by a new SAX
filter: TagMappingContentHandler.

I prepare to post 2 patches to TIKA's issue management system, that:

a) import ODF documents as structured XHTML items as mentioned before.

b) a better conversion of XHTML sax streams to plain text (better than just
only reading characters() events), as the problem here is the difference
between HTML block and span elements. Just reading the element contents
creates whitespace issues...

The same technique could be used for Open XML (Office 2007) items. Using the
new classes of POI is a pain (the same problem: thousands of ne objects from
a really big JAR file that just contains DOM not SAX mappings for Open XML
objects). A clean SAX solution would be preferable.

Just give me some more two days to finish my patches!

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

Re: The ODF toolkit

Posted by Rida Benjelloun <ri...@doculibre.com>.

Hi Jukka,
Do we have actually  any problem with the current implementation of
OpenDocument format ?
What king of additional information we should extract ?
Regards.


2008/11/5 Jukka Zitting <ju...@gmail.com>

> Hi,
>
> Check out the new ODF toolkit project [1]. Especially the ODFDOM
> library [2] seems like something we could use in Tika to better
> extract stuff from OpenDocument files.
>
> [1] http://odftoolkit.org/
> [2] http://odftoolkit.org/projects/odftoolkit/pages/ODFDOM
>
> BR,
>
> Jukka Zitting
>