You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by Ken Krugler <kk...@transpac.com> on 2010/08/11 04:53:28 UTC

XHTMLContentHandler's lazyStartDocument can mess up order of elements

Hi all,

I was trying to debug why my fix for a problem with the Boilerpipe  
integration wasn't working, and came across  
XHTMLContentHandler.lazyStartDocument().

This, when used by HtmlHandler, essentially skips calling the user- 
provided content handler for the initial element tags (html, head,  
body) until it looks like there's a reason to generate content. Then  
it calls the content handler with no-attribute versions of these  
elements, so attributes in elements like <html lang="en"> will get  
stripped. Which seems like not a great thing, especially given ongoing  
work to make it easier to pass everything through if that's what's  
needed.

But the problem I ran into was with this sequence:

<html>
	<head>
		<title>xxx</title>
		<meta blah>
	</head>
	<body>
	...
	</body>
</html>

The problem is that this call to lazyStartDocument()is made when the  
<meta> element is encountered. So what the content handler gets called  
with is:

<html>
	<head>
		<title>xxx</title>
	</head>
	<body>

and then <meta>

So the <meta> element is getting passed through after the <body>  
element. And that in turn prevents Boilerpipe from behaving as expected.

But before I dive in here and start filing issues/hacking on the code,  
I'm wondering if somebody (OK, Jukka) can provide some color commentary.

Thanks,

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: XHTMLContentHandler's lazyStartDocument can mess up order of elements

Posted by Ken Krugler <kk...@transpac.com>.

Hi Jukka,

On Aug 12, 2010, at 12:43am, Jukka Zitting wrote:

> Hi,
>
> On Wed, Aug 11, 2010 at 4:53 AM, Ken Krugler
> <kk...@transpac.com> wrote:
>> But before I dive in here and start filing issues/hacking on the  
>> code, I'm
>> wondering if somebody (OK, Jukka) can provide some color commentary.
>
> The rationale behind the lazy startup in XHTMLContentHandler is that
> many parsers don't yet have the document title metadata available when
> startDocument() is called. Instead of outputting an empty <title/>
> element, it's better to delay the startup to as late as possible.
>
> Now, more generally the contract of XHTMLContentHandler (see
> start/endDocument javadocs) is that the parser that feeds it should
> only output content that go *inside* the <body/> element. Feeding a
> full <html/> tree to an XHTMLContentHandler will cause trouble.
>
> If you have a parser that wants to output a full <html/> tree along
> with extra <meta/> entries inside the <head/> element, you can always
> directly use the ContentHandler instance given as an argument to the
> parse() method.

Thanks for the input on this. I'll take a look at filing an issue &  
generating a patch today.

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: XHTMLContentHandler's lazyStartDocument can mess up order of elements

Posted by Ken Krugler <kk...@transpac.com>.

On Aug 13, 2010, at 2:06am, Andrzej Bialecki wrote:

> On 2010-08-13 10:34, Jukka Zitting wrote:
>> Hi,
>>
>> On Thu, Aug 12, 2010 at 8:27 PM, Ken Krugler
>> <kk...@transpac.com>  wrote:
>>> I think I'm missing something - which javadocs are your referring  
>>> to here?
>>> What I see for startDocument() is:
>>>
>>>    /**
>>>     * Starts an XHTML document by setting up the namespace mappings.
>>>     * The standard XHTML prefix is generated lazily when the first
>>>     * element is started.
>>>     */
>>
>> I guess the "standard XHTML prefix" is a bit vague here... Mea culpa.
>> The intention was that XHTMLContentHandler would provide everything  
>> up
>> to the opening<body>  tag when startDocument() is called.
>>
>>> I saw your note on the issue in Jira:
>>> [...]
>>> This would work for<meta>, but not<link>  or<base>.
>>
>> I'd argue that we shouldn't output the<base>  element. Instead we
>> should normalize all URLs before giving them out to the client.
>
> Normalization rules may depend on situation... we could provide a  
> sensible default but I think it's safer to delegate this decision to  
> a component that you can override, because in general case  
> normalization rules may be quite complex.
>
> Example 1: you access a page from www.ibm.com/index.html, which  
> redirects to www-8.ibm.com/index.html for load-balancing. The  
> retrieved page may contain <base> that points back to www.ibm.com -  
> again, to ensure proper load-balancing. In this case, base href !=  
> page URL. Now, how do you normalize the links from the retrieved  
> page? (at some point in time this was a real case with this real  
> site ;) ).
>
> Example 2: <base> is http://a.com/index.html/index.html/index.html  
> (which is related to a known bug in some HTTP servers), and the  
> outlink is ../services.html. How do you normalize this?
>
> Of course, you can come up with some sensible defaults in each case,  
> but my point is that this issue is complicated, and there should be  
> a way to redefine this behavior.

I think Julien's idea about pushing more/most of this down into the  
HtmlMapper makes sense, as that feels like the only way to really give  
appropriate control over this behavior in a way that can be easily  
subclassed.

It's a bigger architectural change than what I have time for right  
now, so currently I'm extending the existing architecture to work  
around specific issues I'm hitting.

I did take Jukka's advice and emit all metadata elements in the  
resulting XHTML's <head> section. This provides better support for  
other parsers besides HTML, though it means that the resulting HTML  
can look a bit funky right now - for example, you will often get two  
<meta> tags, one for "Content-Type" and the other for "content-type",  
because HtmlHandler is remapping a <meta http-equiv> element. I've got  
that on my list to resolve.

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: XHTMLContentHandler's lazyStartDocument can mess up order of elements

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 2010-08-13 10:34, Jukka Zitting wrote:
> Hi,
>
> On Thu, Aug 12, 2010 at 8:27 PM, Ken Krugler
> <kk...@transpac.com>  wrote:
>> I think I'm missing something - which javadocs are your referring to here?
>> What I see for startDocument() is:
>>
>>     /**
>>      * Starts an XHTML document by setting up the namespace mappings.
>>      * The standard XHTML prefix is generated lazily when the first
>>      * element is started.
>>      */
>
> I guess the "standard XHTML prefix" is a bit vague here... Mea culpa.
> The intention was that XHTMLContentHandler would provide everything up
> to the opening<body>  tag when startDocument() is called.
>
>> I saw your note on the issue in Jira:
>> [...]
>> This would work for<meta>, but not<link>  or<base>.
>
> I'd argue that we shouldn't output the<base>  element. Instead we
> should normalize all URLs before giving them out to the client.

Normalization rules may depend on situation... we could provide a 
sensible default but I think it's safer to delegate this decision to a 
component that you can override, because in general case normalization 
rules may be quite complex.

Example 1: you access a page from www.ibm.com/index.html, which 
redirects to www-8.ibm.com/index.html for load-balancing. The retrieved 
page may contain <base> that points back to www.ibm.com - again, to 
ensure proper load-balancing. In this case, base href != page URL. Now, 
how do you normalize the links from the retrieved page? (at some point 
in time this was a real case with this real site ;) ).

Example 2: <base> is http://a.com/index.html/index.html/index.html 
(which is related to a known bug in some HTTP servers), and the outlink 
is ../services.html. How do you normalize this?

Of course, you can come up with some sensible defaults in each case, but 
my point is that this issue is complicated, and there should be a way to 
redefine this behavior.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: XHTMLContentHandler's lazyStartDocument can mess up order of elements

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Hey Guys,

>> In the short term, since this is a blocker for a project I'm working on, I
>> plan to slightly modify XHTMLContentHandler to allow it to work properly
>> with <head> elements (specifically, meta/link/base).
> 
> Go for it! You're the one with the itch and the cycles to implement a
> solution, so in the end it's your call on how to do this.

+1 from me too -- Ken, if you've got the time and cycles, go for it.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: XHTMLContentHandler's lazyStartDocument can mess up order of elements

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Thu, Aug 12, 2010 at 8:27 PM, Ken Krugler
<kk...@transpac.com> wrote:
> I think I'm missing something - which javadocs are your referring to here?
> What I see for startDocument() is:
>
>    /**
>     * Starts an XHTML document by setting up the namespace mappings.
>     * The standard XHTML prefix is generated lazily when the first
>     * element is started.
>     */

I guess the "standard XHTML prefix" is a bit vague here... Mea culpa.
The intention was that XHTMLContentHandler would provide everything up
to the opening <body> tag when startDocument() is called.

> I saw your note on the issue in Jira:
> [...]
> This would work for <meta>, but not <link> or <base>.

I'd argue that we shouldn't output the <base> element. Instead we
should normalize all URLs before giving them out to the client.

I agree with your point with <link> though. My solution doesn't
address that case.

> In the short term, since this is a blocker for a project I'm working on, I
> plan to slightly modify XHTMLContentHandler to allow it to work properly
> with <head> elements (specifically, meta/link/base).

Go for it! You're the one with the itch and the cycles to implement a
solution, so in the end it's your call on how to do this.

BR,

Jukka Zitting

Re: XHTMLContentHandler's lazyStartDocument can mess up order of elements

Posted by Ken Krugler <kk...@transpac.com>.

On Aug 12, 2010, at 12:43am, Jukka Zitting wrote:

> Hi,
>
> On Wed, Aug 11, 2010 at 4:53 AM, Ken Krugler
> <kk...@transpac.com> wrote:
>> But before I dive in here and start filing issues/hacking on the  
>> code, I'm
>> wondering if somebody (OK, Jukka) can provide some color commentary.
>
> The rationale behind the lazy startup in XHTMLContentHandler is that
> many parsers don't yet have the document title metadata available when
> startDocument() is called. Instead of outputting an empty <title/>
> element, it's better to delay the startup to as late as possible.
>
> Now, more generally the contract of XHTMLContentHandler (see
> start/endDocument javadocs) is that the parser that feeds it should
> only output content that go *inside* the <body/> element. Feeding a
> full <html/> tree to an XHTMLContentHandler will cause trouble.

I think I'm missing something - which javadocs are your referring to  
here? What I see for startDocument() is:

     /**
      * Starts an XHTML document by setting up the namespace mappings.
      * The standard XHTML prefix is generated lazily when the first
      * element is started.
      */

and for endDocument():

     /**
      * Ends the XHTML document by writing the following footer and
      * clearing the namespace mappings:
      * <pre>
      *   &lt;/body&gt;
      * &lt;/html&gt;
      * </pre>
      */

> If you have a parser that wants to output a full <html/> tree along
> with extra <meta/> entries inside the <head/> element, you can always
> directly use the ContentHandler instance given as an argument to the
> parse() method.

I've opened TIKA-478.  Though working through the complex SAX event  
handling setup for HtmlParser has proven challenging.

Architecturally it feels like we need some major changes in the  
HtmlParser code to handle the somewhat conflicting goals of nice,  
normalized output with getting more content passed through to the user- 
provided content handler. Julien had proposed ways to let the  
HtmlMapper do more of the heavy lifting, to allow for better external  
control of processing, but that hasn't yet turned into a patch.

I saw your note on the issue in Jira:

> Oh, I see now where this problem with <meta/> elements is coming from.
>
> One reasonably clean way to solve this would be to disable the  
> output of <meta/> elements from HtmlHandler while keeping the code  
> that sets the respective Metadata entries. Then in  
> XHTMLContentHandler we'd modify the lazyStartDocument() method to  
> output not just the <title/> element but the full set of collected  
> metadata as <meta/> elements. We could also set the lang attribute  
> (or xml:lang?) of the <html/> element if the respective Metadata  
> entry is set.
>
> The nice thing about this solution would be that the inclusion of  
> metadata in <head/> would work also for other document types beyond  
> HTML.

This would work for <meta>, but not <link> or <base>.

I could add these as additional metadata, but that feels wrong.

In the short term, since this is a blocker for a project I'm working  
on, I plan to slightly modify XHTMLContentHandler to allow it to work  
properly with <head> elements (specifically, meta/link/base).

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: XHTMLContentHandler's lazyStartDocument can mess up order of elements

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Wed, Aug 11, 2010 at 4:53 AM, Ken Krugler
<kk...@transpac.com> wrote:
> But before I dive in here and start filing issues/hacking on the code, I'm
> wondering if somebody (OK, Jukka) can provide some color commentary.

The rationale behind the lazy startup in XHTMLContentHandler is that
many parsers don't yet have the document title metadata available when
startDocument() is called. Instead of outputting an empty <title/>
element, it's better to delay the startup to as late as possible.

Now, more generally the contract of XHTMLContentHandler (see
start/endDocument javadocs) is that the parser that feeds it should
only output content that go *inside* the <body/> element. Feeding a
full <html/> tree to an XHTMLContentHandler will cause trouble.

If you have a parser that wants to output a full <html/> tree along
with extra <meta/> entries inside the <head/> element, you can always
directly use the ContentHandler instance given as an argument to the
parse() method.

BR,

Jukka Zitting

Re: XHTMLContentHandler's lazyStartDocument can mess up order of elements

Posted by "Mattmann, Chris A (388J)" <ch...@jpl.nasa.gov>.

Sorry Guys.

I'm +1 for Ken's proposal, and for potentially including examples of your Bixo tests in the Tika codebase :) Ken, can you attach some of your tests to a new JIRA issue for this and link it to TIKA-379?

Cheers,
Chris

On 8/11/10 7:19 PM, "Ken Krugler" <kk...@transpac.com> wrote:

Hi all,

Digging deeper, the current behavior seems to be causing problems that
were not evident in Tika 0.7. We noticed this when switching the Bixo
code to use Tika 0.8-SNAPSHOT.

For example, if you have a document that looks like:

<html>
        <head>
                <meta http-equiv="content-type" content="text/html; charset=utf-8">
                <title>Some Title</title>
        </head>
        <body>
        ...
</html>

The lazyStartDocument() method is called when the <meta> tag is
encountered by HtmlHandler, because it calls xhtml.startElement() with
the meta tag.

Since this is before <title> has been seen, the output generated has
an empty <title> element. And that causes a bunch of problems for our
tests.

I believe this (and the previous problem I'd reported) is a side-
effect of TIKA-379, which Chris M. rolled in during change 949635.

Unfortunately I think lazyStartDocument() needs to be re-thought. A
rough proposal would be:

1. HtmlHandler should call xhtml start/endElement for all elements,
versus creating a fragile implicit dependency between its behavior and
that of XHTMLContentHandler.

2. In XHTMLContentHandler, the elements received should be queued up
until endElement() is called for <head>, or startElement() is called
for <body>, or endDocument() is called.

-- Ken

On Aug 10, 2010, at 7:53pm, Ken Krugler wrote:

> Hi all,
>
> I was trying to debug why my fix for a problem with the Boilerpipe
> integration wasn't working, and came across
> XHTMLContentHandler.lazyStartDocument().
>
> This, when used by HtmlHandler, essentially skips calling the user-
> provided content handler for the initial element tags (html, head,
> body) until it looks like there's a reason to generate content. Then
> it calls the content handler with no-attribute versions of these
> elements, so attributes in elements like <html lang="en"> will get
> stripped. Which seems like not a great thing, especially given
> ongoing work to make it easier to pass everything through if that's
> what's needed.
>
> But the problem I ran into was with this sequence:
>
> <html>
>       <head>
>               <title>xxx</title>
>               <meta blah>
>       </head>
>       <body>
>       ...
>       </body>
> </html>
>
> The problem is that this call to lazyStartDocument()is made when the
> <meta> element is encountered. So what the content handler gets
> called with is:
>
> <html>
>       <head>
>               <title>xxx</title>
>       </head>
>       <body>
>
> and then <meta>
>
> So the <meta> element is getting passed through after the <body>
> element. And that in turn prevents Boilerpipe from behaving as
> expected.
>
> But before I dive in here and start filing issues/hacking on the
> code, I'm wondering if somebody (OK, Jukka) can provide some color
> commentary.
>
> Thanks,
>
> -- Ken
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: Chris.Mattmann@jpl.nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: XHTMLContentHandler's lazyStartDocument can mess up order of elements

Posted by Ken Krugler <kk...@transpac.com>.

Hi all,

Digging deeper, the current behavior seems to be causing problems that  
were not evident in Tika 0.7. We noticed this when switching the Bixo  
code to use Tika 0.8-SNAPSHOT.

For example, if you have a document that looks like:

<html>
	<head>
		<meta http-equiv="content-type" content="text/html; charset=utf-8">
		<title>Some Title</title>
	</head>
	<body>
	...
</html>

The lazyStartDocument() method is called when the <meta> tag is  
encountered by HtmlHandler, because it calls xhtml.startElement() with  
the meta tag.

Since this is before <title> has been seen, the output generated has  
an empty <title> element. And that causes a bunch of problems for our  
tests.

I believe this (and the previous problem I'd reported) is a side- 
effect of TIKA-379, which Chris M. rolled in during change 949635.

Unfortunately I think lazyStartDocument() needs to be re-thought. A  
rough proposal would be:

1. HtmlHandler should call xhtml start/endElement for all elements,  
versus creating a fragile implicit dependency between its behavior and  
that of XHTMLContentHandler.

2. In XHTMLContentHandler, the elements received should be queued up  
until endElement() is called for <head>, or startElement() is called  
for <body>, or endDocument() is called.

-- Ken

On Aug 10, 2010, at 7:53pm, Ken Krugler wrote:

> Hi all,
>
> I was trying to debug why my fix for a problem with the Boilerpipe  
> integration wasn't working, and came across  
> XHTMLContentHandler.lazyStartDocument().
>
> This, when used by HtmlHandler, essentially skips calling the user- 
> provided content handler for the initial element tags (html, head,  
> body) until it looks like there's a reason to generate content. Then  
> it calls the content handler with no-attribute versions of these  
> elements, so attributes in elements like <html lang="en"> will get  
> stripped. Which seems like not a great thing, especially given  
> ongoing work to make it easier to pass everything through if that's  
> what's needed.
>
> But the problem I ran into was with this sequence:
>
> <html>
> 	<head>
> 		<title>xxx</title>
> 		<meta blah>
> 	</head>
> 	<body>
> 	...
> 	</body>
> </html>
>
> The problem is that this call to lazyStartDocument()is made when the  
> <meta> element is encountered. So what the content handler gets  
> called with is:
>
> <html>
> 	<head>
> 		<title>xxx</title>
> 	</head>
> 	<body>
>
> and then <meta>
>
> So the <meta> element is getting passed through after the <body>  
> element. And that in turn prevents Boilerpipe from behaving as  
> expected.
>
> But before I dive in here and start filing issues/hacking on the  
> code, I'm wondering if somebody (OK, Jukka) can provide some color  
> commentary.
>
> Thanks,
>
> -- Ken
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g