You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Guillaume LOUVEL <lo...@yahoo.fr> on 2008/05/30 14:22:01 UTC

OpenOffice Document

Hello,

I continue my test of Tika,
I have problem with OpenOffice document  .odt (writer file).
If we have several lines of text,
the parse function aggregate all lines in one text without separation 
between the end of each lines et the begining of next line.

Example :
The original text :

This is a sample Open Office document, written in NeoOffice 2.2.1 for 
the Mac.


Sdfsdf


dfssdf


test text


And the result of the parse function with this call :
content = ParseUtils.getStringContent(new URL("file:" + 
instance.getPath()), TikaConfig.getDefaultConfig());

This is a sample Open Office document, written in NeoOffice 2.2.1 for 
the Mac.Sdfsdfdfssdftest text


Do you have a solution for this problem ?

Guillaume LOUVEL

Re: OpenOffice Document

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Fri, May 30, 2008 at 3:22 PM, Guillaume LOUVEL <lo...@yahoo.fr> wrote:
> I continue my test of Tika,
> I have problem with OpenOffice document  .odt (writer file).
> If we have several lines of text,
> the parse function aggregate all lines in one text without separation
> between the end of each lines et the begining of next line.

That's because our current OpenOffice parser simply drops all XML tags
from the content.xml from inside the .odt file. And since there is no
whitespace between successive <text:p/> elements the last word of a
paragraph gets concatenated with the first word of the next paragraph.

Sooner or later we need to come up with a better OpenOffice parser,
perhaps an XSL transformation that explicitly converts content.xml to
simple XHTML. Or better yet, perhaps there already exists a library
for doing that...

Can you file an improvement request for this?

BR,

Jukka Zitting