You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Devin Han <de...@apache.org> on 2011/10/24 10:54:57 UTC

Tika is waiting for ODFToolkit to improve ODF file format processing

I saw this issue in Tika: OpenOffice parser: master footer text isn't
extracted https://issues.apache.org/jira/browse/TIKA-736

The current ODF parser of Tika doesn't touch the styles part and the embeded
document, only meta and content. They are waiting for the first ODF Toolkit
incubating release, then switch to a full featured parser much as they have
for the POI powered ones.

The first release is coming and we will have no code update before it. So, I
suggest start the discussion that how to use ODF Toolkit to realize it based
on the snapshot.

This feature concerns ODFDOM and Simple ODF API. We have involved text
extraction in the cookbook and demo, see:

http://incubator.apache.org/odftoolkit/simple/document/cookbook/TextExtractor.html
http://incubator.apache.org/odftoolkit/simple/demo/demo2.html

The work we need to do:
(1) What' s the detail requirements of Tika?
(2) Whether the exist features odf ODF Toolkit can cover the requirements of
Tika?
(3) How to use ODF Toolkit realize it?

CC to Tika Dev list, in case, guys in this list are interested in this
issue.
-- 
-Devin

Re: Tika is waiting for ODFToolkit to improve ODF file format processing

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Tue, Oct 25, 2011 at 5:40 PM, Rob Weir <ro...@apache.org> wrote:

> Is there a list of the complete set of tags you use, or a schema or something?

Hmm, I think technically any tags that are valid XHTML is fair game,
but in practice the parsers seems to use a very limited set of tags
(table/td/tr, a, img, p, br, div, b, i, u, hN, ul/li, span).  I'm sure
there are more... and I'm not familiar with most of Tika's parsers!

>> For TIKA-736 in particular, it'd be nice to "reconstruct" each slide
>> so that any text from the master slide/layout is inlined into each
>> slide that uses it, so that the resulting text looks the way it looks
>> when you view the document in OpenOffice. This is the approach we're
>> working towards in TIKA-712 for PPT/X files.
>
> Text box position is ultimately encoded as x,y coordinates on the
> slide. So the visual appearance on the slide and the order of the
> text boxes in the document's XML are generally unrelated. But it
> should be possible to sort the coordinates to get an top-to-bottom,
> left-to-write reading order. Maybe even with some sensitivity to
> BiDi.
>
> I've certainly seen that use case mentioned by others.

OK that makes sense.

Besides header/footer shared across pages, and embedded  docs,
are there other cases where ODF pulls in cross-referenced text?

On the position sorting, PDFBox works in a similar way, since PDF
also places text (well, glyphs!) at positions and then we have to
sometimes "reconstruct" how those glyphs might translate back into
words/lines.

>> I imagine to do this you'd need DOM-like access to the master slide /
>> layout / style, and could then us SAX-like single pass for the
>> "normal" slides.
>>
>
> Well, you could stream one slide at a time, but we'd need to be able
> to store the complete text contents of each individual slide to do the
> coordinate sort. But that is not so bad. Presentations tend to be
> outrageously large based on large images (high color depth, high dpi)
> rather than large amounts of text.

That sounds great, as long as we have random-access to the set of
master slides so we can "slip-stream" in any headers/footers/etc.

Thanks!

Mike McCandless

http://blog.mikemccandless.com

Re: Tika is waiting for ODFToolkit to improve ODF file format processing

Posted by Rob Weir <ro...@apache.org>.

On Tue, Oct 25, 2011 at 1:03 PM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> On Mon, Oct 24, 2011 at 9:17 AM, Rob Weir <ro...@apache.org> wrote:
>> On Mon, Oct 24, 2011 at 4:54 AM, Devin Han <de...@apache.org> wrote:
>>> I saw this issue in Tika: OpenOffice parser: master footer text isn't
>>> extracted https://issues.apache.org/jira/browse/TIKA-736
>>>
>>> The current ODF parser of Tika doesn't touch the styles part and the embeded
>>> document, only meta and content. They are waiting for the first ODF Toolkit
>>> incubating release, then switch to a full featured parser much as they have
>>> for the POI powered ones.
>>>
>>> The first release is coming and we will have no code update before it. So, I
>>> suggest start the discussion that how to use ODF Toolkit to realize it based
>>> on the snapshot.
>>>
>>
>> In that JIRA thread Uwe talks about the desire for a
>> streaming/SAX-like API for scanning the ODF documents.  I agree.  The
>> DOM approach we use with ODF Toolkit is necessary for when you need
>> random, read/write access to a document.  But you pay a performance
>> (mainly heap memory) penalty for that flexibility.  But if you can
>> organize your program logic into a single-pass read-only approach,
>> then a streaming approach can -- in theory -- perform much better for
>> that restricted use case.  But I still wonder how much the underlying
>> ZipInputStream implementation actually manages to stream the deflate
>> algorithm when it unzips ODF's ZIP package....
>>
>> In any case, this is something I'd be interested in working on after
>> we get our initial ODF Toolkit release out.  A memory optimized
>> streaming API for read-only, single pass uses.
>
> I agree a more SAX-like (single pass, don't hold stuff in RAM)
> approach would mostly fit Tika's needs well.
>
> Note that the DOM approach is also used by other parsers Tika wraps
> (eg PDFBox, POI I think), so this is not a "unique" challenge for
> ODF.
>
> Tika's needs are actually quite simple compared to what ODFToolkit can
> do.
>
> Ie, really we just need read-only single pass (document -> text), with
> some amount of document structure retained (so we know where to put
> <p>, <div>, <b>, etc., tags).
>

Is there a list of the complete set of tags you use, or a schema or something?

> For TIKA-736 in particular, it'd be nice to "reconstruct" each slide
> so that any text from the master slide/layout is inlined into each
> slide that uses it, so that the resulting text looks the way it looks
> when you view the document in OpenOffice.  This is the approach we're
> working towards in TIKA-712 for PPT/X files.
>

Text box position is ultimately encoded as x,y coordinates on the
slide.  So the visual appearance on the slide and the order of the
text boxes in the document's XML are generally unrelated.  But it
should be possible to sort the coordinates to get an top-to-bottom,
left-to-write reading order.  Maybe even with some sensitivity to
BiDi.

I've certainly seen that use case mentioned by others.

> I imagine to do this you'd need DOM-like access to the master slide /
> layout / style, and could then us SAX-like single pass for the
> "normal" slides.
>

Well, you could stream one slide at a time, but we'd need to be able
to store the complete text contents of each individual slide to do the
coordinate sort.  But that is not so bad.  Presentations tend to be
outrageously large based on large images (high color depth, high dpi)
rather than large amounts of text.

> TIKA-735 is another issue with the the current ODF parser, whereby the text
> from embedded documents is always placed at the end of the text from
> the original document, rather than being inlined at the point where
> the embedding occurred.  Seems like a SAX like API would work fine
> here, ie, we should simply recurse into the embedded doc when we
> encounter it.
>

Right.

> Mike McCandless
>
> http://blog.mikemccandless.com
>

Re: Tika is waiting for ODFToolkit to improve ODF file format processing

Posted by Michael McCandless <lu...@mikemccandless.com>.

On Mon, Oct 24, 2011 at 9:17 AM, Rob Weir <ro...@apache.org> wrote:
> On Mon, Oct 24, 2011 at 4:54 AM, Devin Han <de...@apache.org> wrote:
>> I saw this issue in Tika: OpenOffice parser: master footer text isn't
>> extracted https://issues.apache.org/jira/browse/TIKA-736
>>
>> The current ODF parser of Tika doesn't touch the styles part and the embeded
>> document, only meta and content. They are waiting for the first ODF Toolkit
>> incubating release, then switch to a full featured parser much as they have
>> for the POI powered ones.
>>
>> The first release is coming and we will have no code update before it. So, I
>> suggest start the discussion that how to use ODF Toolkit to realize it based
>> on the snapshot.
>>
>
> In that JIRA thread Uwe talks about the desire for a
> streaming/SAX-like API for scanning the ODF documents.  I agree.  The
> DOM approach we use with ODF Toolkit is necessary for when you need
> random, read/write access to a document.  But you pay a performance
> (mainly heap memory) penalty for that flexibility.  But if you can
> organize your program logic into a single-pass read-only approach,
> then a streaming approach can -- in theory -- perform much better for
> that restricted use case.  But I still wonder how much the underlying
> ZipInputStream implementation actually manages to stream the deflate
> algorithm when it unzips ODF's ZIP package....
>
> In any case, this is something I'd be interested in working on after
> we get our initial ODF Toolkit release out.  A memory optimized
> streaming API for read-only, single pass uses.

I agree a more SAX-like (single pass, don't hold stuff in RAM)
approach would mostly fit Tika's needs well.

Note that the DOM approach is also used by other parsers Tika wraps
(eg PDFBox, POI I think), so this is not a "unique" challenge for
ODF.

Tika's needs are actually quite simple compared to what ODFToolkit can
do.

Ie, really we just need read-only single pass (document -> text), with
some amount of document structure retained (so we know where to put
<p>, <div>, <b>, etc., tags).

For TIKA-736 in particular, it'd be nice to "reconstruct" each slide
so that any text from the master slide/layout is inlined into each
slide that uses it, so that the resulting text looks the way it looks
when you view the document in OpenOffice.  This is the approach we're
working towards in TIKA-712 for PPT/X files.

I imagine to do this you'd need DOM-like access to the master slide /
layout / style, and could then us SAX-like single pass for the
"normal" slides.

TIKA-735 is another issue with the the current ODF parser, whereby the text
from embedded documents is always placed at the end of the text from
the original document, rather than being inlined at the point where
the embedding occurred.  Seems like a SAX like API would work fine
here, ie, we should simply recurse into the embedded doc when we
encounter it.

Mike McCandless

http://blog.mikemccandless.com

Re: Tika is waiting for ODFToolkit to improve ODF file format processing

Posted by Rob Weir <ro...@apache.org>.

On Mon, Oct 24, 2011 at 4:54 AM, Devin Han <de...@apache.org> wrote:
> I saw this issue in Tika: OpenOffice parser: master footer text isn't
> extracted https://issues.apache.org/jira/browse/TIKA-736
>
> The current ODF parser of Tika doesn't touch the styles part and the embeded
> document, only meta and content. They are waiting for the first ODF Toolkit
> incubating release, then switch to a full featured parser much as they have
> for the POI powered ones.
>
> The first release is coming and we will have no code update before it. So, I
> suggest start the discussion that how to use ODF Toolkit to realize it based
> on the snapshot.
>

In that JIRA thread Uwe talks about the desire for a
streaming/SAX-like API for scanning the ODF documents.  I agree.  The
DOM approach we use with ODF Toolkit is necessary for when you need
random, read/write access to a document.  But you pay a performance
(mainly heap memory) penalty for that flexibility.  But if you can
organize your program logic into a single-pass read-only approach,
then a streaming approach can -- in theory -- perform much better for
that restricted use case.  But I still wonder how much the underlying
ZipInputStream implementation actually manages to stream the deflate
algorithm when it unzips ODF's ZIP package....

In any case, this is something I'd be interested in working on after
we get our initial ODF Toolkit release out.  A memory optimized
streaming API for read-only, single pass uses.

> This feature concerns ODFDOM and Simple ODF API. We have involved text
> extraction in the cookbook and demo, see:
>
> http://incubator.apache.org/odftoolkit/simple/document/cookbook/TextExtractor.html
> http://incubator.apache.org/odftoolkit/simple/demo/demo2.html
>
> The work we need to do:
> (1) What' s the detail requirements of Tika?
> (2) Whether the exist features odf ODF Toolkit can cover the requirements of
> Tika?
> (3) How to use ODF Toolkit realize it?
>
> CC to Tika Dev list, in case, guys in this list are interested in this
> issue.
> --
> -Devin
>