You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Rida Benjelloun <ri...@doculibre.com> on 2008/01/13 02:28:39 UTC

Content extraction problem

Hello, everybody!

Happy New Year 2008.

I am currently testing the extraction of content and I see that when I get
the content of the document via "writer.ToString ()", some words are stuck
together. This is a problem  in terms of indexing.

Example of Excel  extraction via writer.toString() :  Simple Excel
documentSample Excel Worksheet - Numbers and their Squares Number Square 1.0
1.0 2.0 4.0 3.0 9.0 4.0 16.0 5.0 25.0 6.0 36.0 7.0 49.0 8.0 64.0 9.0 81.0
10.0 100.0 11.0 121.0 12.0 144.0 13.0 169.0 14.0 196.0 15.0 225.0 Written
and saved in Microsoft Excel X for Mac Service Release 1.

I note also that metadata are also added in the in the content. In my
opinion it would be preferable  that the toString () on the writer return
only the content of the document and not metadata. The metadata  are
already  stored in the metadata object

Regards.

Rida.

Re: Content extraction problem

Posted by Sami Siren <ss...@gmail.com>.

Rida Benjelloun wrote:

> I note also that metadata are also added in the in the content. In my
> opinion it would be preferable  that the toString () on the writer return
> only the content of the document and not metadata. The metadata  are
> already  stored in the metadata object

I agree, metadata (such as title) should not be part of content.

--
  Sami Siren