You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Rida Benjelloun <ri...@doculibre.com> on 2008/01/13 02:28:39 UTC
Content extraction problem
Hello, everybody!
Happy New Year 2008.
I am currently testing the extraction of content and I see that when I get
the content of the document via "writer.ToString ()", some words are stuck
together. This is a problem in terms of indexing.
Example of Excel extraction via writer.toString() : Simple Excel
documentSample Excel Worksheet - Numbers and their Squares Number Square 1.0
1.0 2.0 4.0 3.0 9.0 4.0 16.0 5.0 25.0 6.0 36.0 7.0 49.0 8.0 64.0 9.0 81.0
10.0 100.0 11.0 121.0 12.0 144.0 13.0 169.0 14.0 196.0 15.0 225.0 Written
and saved in Microsoft Excel X for Mac Service Release 1.
I note also that metadata are also added in the in the content. In my
opinion it would be preferable that the toString () on the writer return
only the content of the document and not metadata. The metadata are
already stored in the metadata object
Regards.
Rida.
Re: Content extraction problem
Posted by Sami Siren <ss...@gmail.com>.
Rida Benjelloun wrote:
> I note also that metadata are also added in the in the content. In my
> opinion it would be preferable that the toString () on the writer return
> only the content of the document and not metadata. The metadata are
> already stored in the metadata object
I agree, metadata (such as title) should not be part of content.
--
Sami Siren