You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Ken Krugler <kk...@transpac.com> on 2010/08/12 15:12:24 UTC

Re: XHTMLContentHandler's lazyStartDocument can mess up order of elements

Hi Jukka,

On Aug 12, 2010, at 12:43am, Jukka Zitting wrote:

> Hi,
>
> On Wed, Aug 11, 2010 at 4:53 AM, Ken Krugler
> <kk...@transpac.com> wrote:
>> But before I dive in here and start filing issues/hacking on the  
>> code, I'm
>> wondering if somebody (OK, Jukka) can provide some color commentary.
>
> The rationale behind the lazy startup in XHTMLContentHandler is that
> many parsers don't yet have the document title metadata available when
> startDocument() is called. Instead of outputting an empty <title/>
> element, it's better to delay the startup to as late as possible.
>
> Now, more generally the contract of XHTMLContentHandler (see
> start/endDocument javadocs) is that the parser that feeds it should
> only output content that go *inside* the <body/> element. Feeding a
> full <html/> tree to an XHTMLContentHandler will cause trouble.
>
> If you have a parser that wants to output a full <html/> tree along
> with extra <meta/> entries inside the <head/> element, you can always
> directly use the ContentHandler instance given as an argument to the
> parse() method.

Thanks for the input on this. I'll take a look at filing an issue &  
generating a patch today.

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g