You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Niall Pemberton <ni...@gmail.com> on 2007/11/20 08:14:36 UTC

How is WriteOutContentHandler supposed to work?

Apologies if this is a stupid question, but I don't understand
WriteOutContentHandler[1] - shouldn't it be implementing the
startElement(), endElement() etc. methods?

For example, ExcelParserTest[2] outputs the following for testEXCEL.xls:

Simple Excel documentSample Excel Worksheet - Numbers and their
Squares Number Square 1.0 1.0 2.0 4.0 3.0 9.0 4.0 16.0 5.0 25.0 6.0
36.0 7.0 49.0 8.0 64.0 9.0 81.0 10.0 100.0 11.0 121.0 12.0 144.0 13.0
169.0 14.0 196.0 15.0 225.0 Written and saved in Microsoft Excel X for
Mac Service Release 1.

..but I would have thought it should be something like

<html>
<head>
<title>Simple Excel document</title>
</head>
<body>
<p>Sample Excel Worksheet - Numbers and their Squares Number Square
1.0 1.0 2.0 4.0 3.0 9.0 4.0 16.0 5.0 25.0 6.0 36.0 7.0 49.0 8.0 64.0
9.0 81.0 10.0 100.0 11.0 121.0 12.0 144.0 13.0 169.0 14.0 196.0 15.0
225.0 Written and saved in Microsoft Excel X for Mac Service Release
1.</p>
</body>
</html>

Niall

[1] http://incubator.apache.org/tika/xref/org/apache/tika/sax/WriteOutContentHandler.html
[2] http://incubator.apache.org/tika/xref-test/org/apache/tika/parser/microsoft/ExcelParserTest.html

Re: How is WriteOutContentHandler supposed to work?

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Nov 20, 2007 3:14 PM, Niall Pemberton <ni...@gmail.com> wrote:
> OK thanks - is the document's title supposed to be written then? If it
> is then why not the rest of the meta data?

Now that you raised the issue, I think it was wrong for me to make
XHTMLContentHandler output the title as a <h1/> element within the
XHTML body. The title as well as other document metadata should go to
the XHTML head section.

> Also theres no separation between the title and content start - which looks like a bug.

You're right, that's a bug.

BR,

Jukka Zitting

Re: How is WriteOutContentHandler supposed to work?

Posted by Niall Pemberton <ni...@gmail.com>.

On Nov 20, 2007 12:52 PM, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> On Nov 20, 2007 9:14 AM, Niall Pemberton <ni...@gmail.com> wrote:
> > Apologies if this is a stupid question, but I don't understand
> > WriteOutContentHandler[1] - shouldn't it be implementing the
> > startElement(), endElement() etc. methods?
>
> There are a lot of use cases where a client is only interested in the
> plain text content of the document without any of the structuring
> encoded in the XHTML SAX events generated by a parser. The
> WriteOutContentHandler was designed to support those use cases as a
> simple and fast way to translate the SAX event stream to a character
> stream that only contains text from the parsed document.
>
> You can use a standard SAX TransformerHandler if you want to serialize
> the full generated XHTML document.

OK thanks - is the document's title supposed to be written then? If it
is then why not the rest of the meta data? Also theres no separation
between the title and content start - which looks like a bug.

Niall

> BR,
>
> Jukka Zitting

Re: How is WriteOutContentHandler supposed to work?

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Nov 20, 2007 9:14 AM, Niall Pemberton <ni...@gmail.com> wrote:
> Apologies if this is a stupid question, but I don't understand
> WriteOutContentHandler[1] - shouldn't it be implementing the
> startElement(), endElement() etc. methods?

There are a lot of use cases where a client is only interested in the
plain text content of the document without any of the structuring
encoded in the XHTML SAX events generated by a parser. The
WriteOutContentHandler was designed to support those use cases as a
simple and fast way to translate the SAX event stream to a character
stream that only contains text from the parsed document.

You can use a standard SAX TransformerHandler if you want to serialize
the full generated XHTML document.

BR,

Jukka Zitting