You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Devaraja Swami <de...@gmail.com> on 2014/09/09 04:12:48 UTC

HTML parsing error with tag inside

tag

In the following HTML document, the <a> is inside the <h1> tag which is
inside the <p> tag:
-------------------
<!DOCTYPE html>
<html>
<body>
<div><h1><a href="http://www.google.com">GOOGLE!</a></h1></div>
</body>
</html>
-------------------
But when I parse it with Tika 1.5 HtmlParser,
it adds both the <a> and <h1> tag nodes as direct children of the <p> tag.

The same error happens when I replace the <h1> tag with other header tags
<h2> ... <h5>, and/or the <p> tag with a <div> or <span> tag.
[Haven't experimented with other replacements].

This seems to be a basic issue.
Any help would be deeply appreciated.

Cheers,
Devarajan

Re: HTML parsing error with tag inside

tag

Posted by Chris Mattmann <ma...@apache.org>.

Devarajan,


Ken's answer provides some more detail, so please check that out.

Furthermore, I repeat again, I am not sure you are understanding
what I'm saying. You are comparing Tika to a SAX compliant parser.
Tika is much more than this. This isn't me being "defensive" as you
put it below, it's me trying to share the philosophy behind Tika
with you.

At the end of the day it seems you are interested in a SAX compliant
parser. You have a few options there:

1. Use TagSoup and/or NekoHTML and/or <<insert HTML parsing library
here>> directly if you need SAX compliant HTML parsing with your
concerns abou preserving upstream DOM, etc.

2. Roll your own Parser and add it to Tika through the Java SPI,
like the rest of the Tika Parsers are defined. Declare that it
supports the (X)HTML MIME type.

Cheers,
Chris




-----Original Message-----
From: Devaraja Swami <de...@gmail.com>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Monday, September 8, 2014 10:26 PM
To: "user@tika.apache.org" <us...@tika.apache.org>
Subject: Re: HTML parsing error with <a> tag inside <h1> tag

>That indeed is my, IMHO quite reasonable, expectation!
>
>
>In many content analysis applications, like mine, the resulting DOM
>structure is the objective of parsing, not merely a dump of the textual
>content. 
>In these applications, the DOM structure is generated by a user-provided
>content handler which accepts the stream of SAX content handler calls
>from the Tika parser. As an example [though not my application of
>interest], this is exactly how Nutch uses Tika,
> where the Nutch-provided content handler is called DOMBuilder.
>
>
>[Since you were a contributor to Nutch before spinning off Tika, I am
>sure you can understand its importance :-) ]
>
>
>To shed further light on the problem, and to lighten your defensive
>concern, I don't believe Tika source code is jumbling the order of the
>tags. 
>I think it is your upstream parser - TagSoup to be precise:
>
>
>I just ran the same file directly through the latest TagSoup and the
>latest NekoHTML.
>The former causes the order jumbling above, where the latter faithfully
>forwards the incoming tag order.
>In fact, TagSoup has other problems, like lack of handling of HTML5 tags,
>for which I had to develop workarounds using custom HTML schema class
>(similar to the workaround you posted some time ago).
>
>
>So my second question is, is it possible for you to alter Tika so that
>the user can specify at runtime the present raw HTML parser (TagSoup or
>NekoHTML) to the Tika HtmlParser, and bundle both options in the Tika
>dependencies? Failing this, I have to create
> an internal hack of the Tika HtmlParser to use NekoHTML instead of
>TagSoup. 
>
>
>My concern that Tika should indeed guarantee the faithful forwarding of
>the incoming order of tags and text [just like the contract for any SAX
>compliant parser] still holds though...
>
>
>Cheers,
>Devarajan
>
>
>
>
>
>
>On Mon, Sep 8, 2014 at 10:01 PM, Mattmann, Chris A (3980)
><ch...@jpl.nasa.gov> wrote:
>
>Thanks Devarajan.
>
>I think your expectation below is not the way that Tika handles
>parsing. I don't believe Tika guarantees taking in an XHTML file and
>parsing it into Tika's
>intermediate XHTML structure the same way that the XHTML file came
>in (i.e., with the tags in the same order).
>
>Is that your expectation?
>
>Cheers,
>Chris
>
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: chris.a.mattmann@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: Devaraja Swami <de...@gmail.com>
>Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
>Date: Monday, September 8, 2014 9:29 PM
>To: "user@tika.apache.org" <us...@tika.apache.org>
>Subject: Re: HTML parsing error with <a> tag inside <h1> tag
>
>>Hi Chris,
>>
>>
>>Thanks for your reply.
>>
>>
>>To ad more clarity to my original post, I expect that the Tika 1.5
>>HtmlParser should parse any HTML input source and pass along the tags in
>>the order appearing in the HTML source correctly to the downstream (user
>>supplied) SAX content handler.
>>
>>This is not happening currently.
>>
>>
>>
>>For my HTML source, the Tika upstream parser (HtmlParser) that I call
>>using the Tika API is sending the end tag [ endElement() ] of the
>>enclosing <h1> tag to the (my) downstream content handler before it sends
>>along the start tag [ startElement() ] of
>> the enclosed <a> tag.
>>
>>
>>IMHO, this is a clear, and quite serious, upstream parsing error.
>>
>>
>>
>>If possible, could you please shed some light on this, or explain how I
>>can overcome this?
>>If necessary, I can add a JIRA on this.
>>
>>
>>
>>Thanks,
>>Devarajan
>>
>>
>>
>>
>>On Mon, Sep 8, 2014 at 8:55 PM, Chris Mattmann
>><ch...@gmail.com> wrote:
>>
>>Hi Devarajan,
>>
>>Please see Chapter 5 of the Tika in Action book for more
>>detail on this. The short answer is that the parsed XHTML
>>representation of *any* upstream file does not necessarily
>>correspond to the upstream (X)HTML representation of the
>>file. The XHTML is an intermediate format that Tika uses
>>to represent the parsed structure content around the text.
>>That is, if you have the following scenario:
>>
>>PDF->XHTML->content handlers
>>XHTML->XHTML->content handlers
>>Word Doc->XHTML->content handlers
>>Image->XHTML-content handlers
>>..
>>etc
>>
>>Note that XHTML intermediate is the structured representation
>>of the information around the text in the document (including
>>its metadata). That XHTML is then passed into org.sax.xml.ContentHandlers
>>for stream-based processing downstream.
>>
>>Cheers,
>>Chris
>>
>>------------------------
>>Chris Mattmann
>>chris.mattmann@gmail.com
>>
>>
>>
>>
>>-----Original Message-----
>>From: Devaraja Swami <de...@gmail.com>
>>Reply-To: <us...@tika.apache.org>
>>Date: Monday, September 8, 2014 7:12 PM
>>To: <us...@tika.apache.org>
>>Subject: HTML parsing error with <a> tag inside <h1> tag
>>
>>>In the following HTML document, the <a> is inside the <h1> tag which is
>>>inside the <p> tag:
>>>-------------------
>>><!DOCTYPE html>
>>><html>
>>><body>
>>>       <div><h1><a href="http://www.google.com">GOOGLE!</a></h1></div>
>>></body>
>>></html>
>>>-------------------
>>>But when I parse it with Tika 1.5 HtmlParser,
>>>it adds both the <a> and <h1> tag nodes as direct children of the <p>
>>>tag.
>>>
>>>The same error happens when I replace the <h1> tag with other header
>>>tags
>>><h2> ... <h5>, and/or the <p> tag with a <div> or <span> tag.
>>>[Haven't experimented with other replacements].
>>>
>>>This seems to be a basic issue.
>>>Any help would be deeply appreciated.
>>>
>>>Cheers,
>>>Devarajan
>>>
>>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
>
>
>
>
>

RE: HTML parsing error with tag inside

tag

Posted by Ken Krugler <kk...@transpac.com>.

Hi Devarajan,

You are correct that the issue is with TagSoup, where it assumes you can't have an anchor (<a>) element inside of a header (<h1>, <h2>, etc) element.

And yes, TagSoup is missing support for HTML5 tags; I believe Markus Jelsma was trying to get that fixed, but I don't think he's had much luck with the TagSoup author.

Originally Tika did use NekoHTML, but that was replaced by TagSoup back in 2009. See https://issues.apache.org/jira/browse/TIKA-310 for details.

I haven't looked at how hard it would be to make the HTML parser pluggable. Patches welcome :)

Though as a first cut, you could create your own parser that clones the current HTML support and does a hard-coded replacement, for testing purposes.

One final point - since Tika tries to guarantee XHTML 1.0-compliant output, you cannot assume that whatever you put into Tika will give you a corresponding DOM.

-- Ken

> From: Devaraja Swami
> Sent: September 8, 2014 10:26:24pm PDT
> To: user@tika.apache.org
> Subject: Re: HTML parsing error with <a> tag inside <h1> tag
> 
> That indeed is my, IMHO quite reasonable, expectation!
> 
> In many content analysis applications, like mine, the resulting DOM structure is the objective of parsing, not merely a dump of the textual content. 
> In these applications, the DOM structure is generated by a user-provided content handler which accepts the stream of SAX content handler calls from the Tika parser. As an example [though not my application of interest], this is exactly how Nutch uses Tika, where the Nutch-provided content handler is called DOMBuilder. 
> 
> [Since you were a contributor to Nutch before spinning off Tika, I am sure you can understand its importance :-) ]
> 
> To shed further light on the problem, and to lighten your defensive concern, I don't believe Tika source code is jumbling the order of the tags. 
> I think it is your upstream parser - TagSoup to be precise:
> 
> I just ran the same file directly through the latest TagSoup and the latest NekoHTML.
> The former causes the order jumbling above, where the latter faithfully forwards the incoming tag order.
> In fact, TagSoup has other problems, like lack of handling of HTML5 tags, for which I had to develop workarounds using custom HTML schema class (similar to the workaround you posted some time ago).
> 
> So my second question is, is it possible for you to alter Tika so that the user can specify at runtime the present raw HTML parser (TagSoup or NekoHTML) to the Tika HtmlParser, and bundle both options in the Tika dependencies? Failing this, I have to create an internal hack of the Tika HtmlParser to use NekoHTML instead of TagSoup. 
> 
> My concern that Tika should indeed guarantee the faithful forwarding of the incoming order of tags and text [just like the contract for any SAX compliant parser] still holds though...
> 
> Cheers,
> Devarajan
> 
> 
> 
> On Mon, Sep 8, 2014 at 10:01 PM, Mattmann, Chris A (3980) <ch...@jpl.nasa.gov> wrote:
> Thanks Devarajan.
> 
> I think your expectation below is not the way that Tika handles
> parsing. I don't believe Tika guarantees taking in an XHTML file and
> parsing it into Tika's
> intermediate XHTML structure the same way that the XHTML file came
> in (i.e., with the tags in the same order).
> 
> Is that your expectation?
> 
> Cheers,
> Chris
> 
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Devaraja Swami <de...@gmail.com>
> Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
> Date: Monday, September 8, 2014 9:29 PM
> To: "user@tika.apache.org" <us...@tika.apache.org>
> Subject: Re: HTML parsing error with <a> tag inside <h1> tag
> 
> >Hi Chris,
> >
> >
> >Thanks for your reply.
> >
> >
> >To ad more clarity to my original post, I expect that the Tika 1.5
> >HtmlParser should parse any HTML input source and pass along the tags in
> >the order appearing in the HTML source correctly to the downstream (user
> >supplied) SAX content handler.
> >
> >This is not happening currently.
> >
> >
> >
> >For my HTML source, the Tika upstream parser (HtmlParser) that I call
> >using the Tika API is sending the end tag [ endElement() ] of the
> >enclosing <h1> tag to the (my) downstream content handler before it sends
> >along the start tag [ startElement() ] of
> > the enclosed <a> tag.
> >
> >
> >IMHO, this is a clear, and quite serious, upstream parsing error.
> >
> >
> >
> >If possible, could you please shed some light on this, or explain how I
> >can overcome this?
> >If necessary, I can add a JIRA on this.
> >
> >
> >
> >Thanks,
> >Devarajan
> >
> >
> >
> >
> >On Mon, Sep 8, 2014 at 8:55 PM, Chris Mattmann
> ><ch...@gmail.com> wrote:
> >
> >Hi Devarajan,
> >
> >Please see Chapter 5 of the Tika in Action book for more
> >detail on this. The short answer is that the parsed XHTML
> >representation of *any* upstream file does not necessarily
> >correspond to the upstream (X)HTML representation of the
> >file. The XHTML is an intermediate format that Tika uses
> >to represent the parsed structure content around the text.
> >That is, if you have the following scenario:
> >
> >PDF->XHTML->content handlers
> >XHTML->XHTML->content handlers
> >Word Doc->XHTML->content handlers
> >Image->XHTML-content handlers
> >..
> >etc
> >
> >Note that XHTML intermediate is the structured representation
> >of the information around the text in the document (including
> >its metadata). That XHTML is then passed into org.sax.xml.ContentHandlers
> >for stream-based processing downstream.
> >
> >Cheers,
> >Chris
> >
> >------------------------
> >Chris Mattmann
> >chris.mattmann@gmail.com
> >
> >
> >
> >
> >-----Original Message-----
> >From: Devaraja Swami <de...@gmail.com>
> >Reply-To: <us...@tika.apache.org>
> >Date: Monday, September 8, 2014 7:12 PM
> >To: <us...@tika.apache.org>
> >Subject: HTML parsing error with <a> tag inside <h1> tag
> >
> >>In the following HTML document, the <a> is inside the <h1> tag which is
> >>inside the <p> tag:
> >>-------------------
> >><!DOCTYPE html>
> >><html>
> >><body>
> >>       <div><h1><a href="http://www.google.com">GOOGLE!</a></h1></div>
> >></body>
> >></html>
> >>-------------------
> >>But when I parse it with Tika 1.5 HtmlParser,
> >>it adds both the <a> and <h1> tag nodes as direct children of the <p>
> >>tag.
> >>
> >>The same error happens when I replace the <h1> tag with other header tags
> >><h2> ... <h5>, and/or the <p> tag with a <div> or <span> tag.
> >>[Haven't experimented with other replacements].
> >>
> >>This seems to be a basic issue.
> >>Any help would be deeply appreciated.
> >>
> >>Cheers,
> >>Devarajan
> >>
> >>
> >
> >
> >
> >
> >
> >
> >
> >
> >
> 
> 

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Re: HTML parsing error with tag inside

tag

Posted by Devaraja Swami <de...@gmail.com>.

That indeed is my, IMHO quite reasonable, expectation!

In many content analysis applications, like mine, the resulting DOM
structure is the objective of parsing, not merely a dump of the textual
content.
In these applications, the DOM structure is generated by a user-provided
content handler which accepts the stream of SAX content handler calls from
the Tika parser. As an example [though not my application of interest],
this is exactly how Nutch uses Tika, where the Nutch-provided content
handler is called DOMBuilder.

[Since you were a contributor to Nutch before spinning off Tika, I am sure
you can understand its importance :-) ]

To shed further light on the problem, and to lighten your defensive
concern, I don't believe Tika source code is jumbling the order of the
tags.
I think it is your upstream parser - TagSoup to be precise:

I just ran the same file directly through the latest TagSoup and the latest
NekoHTML.
The former causes the order jumbling above, where the latter faithfully
forwards the incoming tag order.
In fact, TagSoup has other problems, like lack of handling of HTML5 tags,
for which I had to develop workarounds using custom HTML schema class
(similar to the workaround you posted some time ago).

So my second question is, is it possible for you to alter Tika so that the
user can specify at runtime the present raw HTML parser (TagSoup or
NekoHTML) to the Tika HtmlParser, and bundle both options in the Tika
dependencies? Failing this, I have to create an internal hack of the Tika
HtmlParser to use NekoHTML instead of TagSoup.

My concern that Tika should indeed guarantee the faithful forwarding of the
incoming order of tags and text [just like the contract for any SAX
compliant parser] still holds though...

Cheers,
Devarajan



On Mon, Sep 8, 2014 at 10:01 PM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Thanks Devarajan.
>
> I think your expectation below is not the way that Tika handles
> parsing. I don't believe Tika guarantees taking in an XHTML file and
> parsing it into Tika's
> intermediate XHTML structure the same way that the XHTML file came
> in (i.e., with the tags in the same order).
>
> Is that your expectation?
>
> Cheers,
> Chris
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Devaraja Swami <de...@gmail.com>
> Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
> Date: Monday, September 8, 2014 9:29 PM
> To: "user@tika.apache.org" <us...@tika.apache.org>
> Subject: Re: HTML parsing error with <a> tag inside <h1> tag
>
> >Hi Chris,
> >
> >
> >Thanks for your reply.
> >
> >
> >To ad more clarity to my original post, I expect that the Tika 1.5
> >HtmlParser should parse any HTML input source and pass along the tags in
> >the order appearing in the HTML source correctly to the downstream (user
> >supplied) SAX content handler.
> >
> >This is not happening currently.
> >
> >
> >
> >For my HTML source, the Tika upstream parser (HtmlParser) that I call
> >using the Tika API is sending the end tag [ endElement() ] of the
> >enclosing <h1> tag to the (my) downstream content handler before it sends
> >along the start tag [ startElement() ] of
> > the enclosed <a> tag.
> >
> >
> >IMHO, this is a clear, and quite serious, upstream parsing error.
> >
> >
> >
> >If possible, could you please shed some light on this, or explain how I
> >can overcome this?
> >If necessary, I can add a JIRA on this.
> >
> >
> >
> >Thanks,
> >Devarajan
> >
> >
> >
> >
> >On Mon, Sep 8, 2014 at 8:55 PM, Chris Mattmann
> ><ch...@gmail.com> wrote:
> >
> >Hi Devarajan,
> >
> >Please see Chapter 5 of the Tika in Action book for more
> >detail on this. The short answer is that the parsed XHTML
> >representation of *any* upstream file does not necessarily
> >correspond to the upstream (X)HTML representation of the
> >file. The XHTML is an intermediate format that Tika uses
> >to represent the parsed structure content around the text.
> >That is, if you have the following scenario:
> >
> >PDF->XHTML->content handlers
> >XHTML->XHTML->content handlers
> >Word Doc->XHTML->content handlers
> >Image->XHTML-content handlers
> >..
> >etc
> >
> >Note that XHTML intermediate is the structured representation
> >of the information around the text in the document (including
> >its metadata). That XHTML is then passed into org.sax.xml.ContentHandlers
> >for stream-based processing downstream.
> >
> >Cheers,
> >Chris
> >
> >------------------------
> >Chris Mattmann
> >chris.mattmann@gmail.com
> >
> >
> >
> >
> >-----Original Message-----
> >From: Devaraja Swami <de...@gmail.com>
> >Reply-To: <us...@tika.apache.org>
> >Date: Monday, September 8, 2014 7:12 PM
> >To: <us...@tika.apache.org>
> >Subject: HTML parsing error with <a> tag inside <h1> tag
> >
> >>In the following HTML document, the <a> is inside the <h1> tag which is
> >>inside the <p> tag:
> >>-------------------
> >><!DOCTYPE html>
> >><html>
> >><body>
> >>       <div><h1><a href="http://www.google.com">GOOGLE!</a></h1></div>
> >></body>
> >></html>
> >>-------------------
> >>But when I parse it with Tika 1.5 HtmlParser,
> >>it adds both the <a> and <h1> tag nodes as direct children of the <p>
> >>tag.
> >>
> >>The same error happens when I replace the <h1> tag with other header tags
> >><h2> ... <h5>, and/or the <p> tag with a <div> or <span> tag.
> >>[Haven't experimented with other replacements].
> >>
> >>This seems to be a basic issue.
> >>Any help would be deeply appreciated.
> >>
> >>Cheers,
> >>Devarajan
> >>
> >>
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>

Re: HTML parsing error with tag inside

tag

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

Thanks Devarajan.

I think your expectation below is not the way that Tika handles
parsing. I don't believe Tika guarantees taking in an XHTML file and
parsing it into Tika's
intermediate XHTML structure the same way that the XHTML file came
in (i.e., with the tags in the same order).

Is that your expectation?

Cheers,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Devaraja Swami <de...@gmail.com>
Reply-To: "user@tika.apache.org" <us...@tika.apache.org>
Date: Monday, September 8, 2014 9:29 PM
To: "user@tika.apache.org" <us...@tika.apache.org>
Subject: Re: HTML parsing error with <a> tag inside <h1> tag

>Hi Chris,
>
>
>Thanks for your reply.
>
>
>To ad more clarity to my original post, I expect that the Tika 1.5
>HtmlParser should parse any HTML input source and pass along the tags in
>the order appearing in the HTML source correctly to the downstream (user
>supplied) SAX content handler.
>
>This is not happening currently.
>
>
>
>For my HTML source, the Tika upstream parser (HtmlParser) that I call
>using the Tika API is sending the end tag [ endElement() ] of the
>enclosing <h1> tag to the (my) downstream content handler before it sends
>along the start tag [ startElement() ] of
> the enclosed <a> tag.
>
>
>IMHO, this is a clear, and quite serious, upstream parsing error.
>
>
>
>If possible, could you please shed some light on this, or explain how I
>can overcome this?
>If necessary, I can add a JIRA on this.
>
>
>
>Thanks,
>Devarajan
>
>
>
>
>On Mon, Sep 8, 2014 at 8:55 PM, Chris Mattmann
><ch...@gmail.com> wrote:
>
>Hi Devarajan,
>
>Please see Chapter 5 of the Tika in Action book for more
>detail on this. The short answer is that the parsed XHTML
>representation of *any* upstream file does not necessarily
>correspond to the upstream (X)HTML representation of the
>file. The XHTML is an intermediate format that Tika uses
>to represent the parsed structure content around the text.
>That is, if you have the following scenario:
>
>PDF->XHTML->content handlers
>XHTML->XHTML->content handlers
>Word Doc->XHTML->content handlers
>Image->XHTML-content handlers
>..
>etc
>
>Note that XHTML intermediate is the structured representation
>of the information around the text in the document (including
>its metadata). That XHTML is then passed into org.sax.xml.ContentHandlers
>for stream-based processing downstream.
>
>Cheers,
>Chris
>
>------------------------
>Chris Mattmann
>chris.mattmann@gmail.com
>
>
>
>
>-----Original Message-----
>From: Devaraja Swami <de...@gmail.com>
>Reply-To: <us...@tika.apache.org>
>Date: Monday, September 8, 2014 7:12 PM
>To: <us...@tika.apache.org>
>Subject: HTML parsing error with <a> tag inside <h1> tag
>
>>In the following HTML document, the <a> is inside the <h1> tag which is
>>inside the <p> tag:
>>-------------------
>><!DOCTYPE html>
>><html>
>><body>
>>       <div><h1><a href="http://www.google.com">GOOGLE!</a></h1></div>
>></body>
>></html>
>>-------------------
>>But when I parse it with Tika 1.5 HtmlParser,
>>it adds both the <a> and <h1> tag nodes as direct children of the <p>
>>tag.
>>
>>The same error happens when I replace the <h1> tag with other header tags
>><h2> ... <h5>, and/or the <p> tag with a <div> or <span> tag.
>>[Haven't experimented with other replacements].
>>
>>This seems to be a basic issue.
>>Any help would be deeply appreciated.
>>
>>Cheers,
>>Devarajan
>>
>>
>
>
>
>
>
>
>
>
>

Re: HTML parsing error with tag inside

tag

Posted by Devaraja Swami <de...@gmail.com>.

More trace data: This is the sequence of startElement and endElement calls
from the Tika 1.5 HtmlParser to my downstream content handler:
---------------------------------------------------------------------------------------------
STARTED TIKA PARSING

START ELEMENT <html> <http://www.w3.org/1999/xhtml> <html>
START ELEMENT <head> <http://www.w3.org/1999/xhtml> <head>
START ELEMENT <meta> <http://www.w3.org/1999/xhtml> <meta>
ELEMENT ATTRIBUTES <meta>:  <[content] ---> [ISO-8859-1], [name] --->
[Content-Encoding]>
END ELEMENT <meta> <http://www.w3.org/1999/xhtml> <meta>
START ELEMENT <meta> <http://www.w3.org/1999/xhtml> <meta>
ELEMENT ATTRIBUTES <meta>:  <[content] ---> [text/html;
charset=ISO-8859-1], [name] ---> [Content-Type]>
END ELEMENT <meta> <http://www.w3.org/1999/xhtml> <meta>
START ELEMENT <title> <http://www.w3.org/1999/xhtml> <title>
END ELEMENT <title> <http://www.w3.org/1999/xhtml> <title>
END ELEMENT <head> <http://www.w3.org/1999/xhtml> <head>
START ELEMENT <body> <http://www.w3.org/1999/xhtml> <body>
START ELEMENT <address> <http://www.w3.org/1999/xhtml> <address>
START ELEMENT <cite> <http://www.w3.org/1999/xhtml> <cite>
START ELEMENT <h1> <http://www.w3.org/1999/xhtml> <h1>
END ELEMENT <h1> <http://www.w3.org/1999/xhtml> <h1>
START ELEMENT <a> <http://www.w3.org/1999/xhtml> <a>
ELEMENT ATTRIBUTES <a>:  <[href] ---> [http://www.google.com], [shape] --->
[rect]>
TEXT <GOOGLE!>
END ELEMENT <a> <http://www.w3.org/1999/xhtml> <a>
END ELEMENT <cite> <http://www.w3.org/1999/xhtml> <cite>
END ELEMENT <address> <http://www.w3.org/1999/xhtml> <address>
END ELEMENT <body> <http://www.w3.org/1999/xhtml> <body>
END ELEMENT <html> <http://www.w3.org/1999/xhtml> <html>

COMPLETED TIKA PARSING
---------------------------------------------------------------------------------------------

[I skipped traces for calls to characters(...) which are passing along pure
whitespace.]
[Also looks like Tika (or TagSoup) is adding a <head> and two meta> tags.]

Hope this makes the problem clearer.

Cheers,
Devarajan


On Mon, Sep 8, 2014 at 9:29 PM, Devaraja Swami <de...@gmail.com>
wrote:

> Hi Chris,
>
> Thanks for your reply.
>
> To ad more clarity to my original post, I expect that the Tika 1.5
> HtmlParser should parse any HTML input source and pass along the tags in
> the order appearing in the HTML source correctly to the downstream (user
> supplied) SAX content handler.
> This is not happening currently.
>
> For my HTML source, the Tika upstream parser (HtmlParser) that I call
> using the Tika API is sending the end tag [ endElement() ] of the enclosing
> <h1> tag to the (my) downstream content handler before it sends along the
> start tag [ startElement() ] of the enclosed <a> tag.
>
> IMHO, this is a clear, and quite serious, upstream parsing error.
>
> If possible, could you please shed some light on this, or explain how I
> can overcome this?
> If necessary, I can add a JIRA on this.
>
> Thanks,
> Devarajan
>
>
> On Mon, Sep 8, 2014 at 8:55 PM, Chris Mattmann <ch...@gmail.com>
> wrote:
>
>> Hi Devarajan,
>>
>> Please see Chapter 5 of the Tika in Action book for more
>> detail on this. The short answer is that the parsed XHTML
>> representation of *any* upstream file does not necessarily
>> correspond to the upstream (X)HTML representation of the
>> file. The XHTML is an intermediate format that Tika uses
>> to represent the parsed structure content around the text.
>> That is, if you have the following scenario:
>>
>> PDF->XHTML->content handlers
>> XHTML->XHTML->content handlers
>> Word Doc->XHTML->content handlers
>> Image->XHTML-content handlers
>> ..
>> etc
>>
>> Note that XHTML intermediate is the structured representation
>> of the information around the text in the document (including
>> its metadata). That XHTML is then passed into org.sax.xml.ContentHandlers
>> for stream-based processing downstream.
>>
>> Cheers,
>> Chris
>>
>> ------------------------
>> Chris Mattmann
>> chris.mattmann@gmail.com
>>
>>
>>
>>
>> -----Original Message-----
>> From: Devaraja Swami <de...@gmail.com>
>> Reply-To: <us...@tika.apache.org>
>> Date: Monday, September 8, 2014 7:12 PM
>> To: <us...@tika.apache.org>
>> Subject: HTML parsing error with <a> tag inside <h1> tag
>>
>> >In the following HTML document, the <a> is inside the <h1> tag which is
>> >inside the <p> tag:
>> >-------------------
>> ><!DOCTYPE html>
>> ><html>
>> ><body>
>> >       <div><h1><a href="http://www.google.com">GOOGLE!</a></h1></div>
>> ></body>
>> ></html>
>> >-------------------
>> >But when I parse it with Tika 1.5 HtmlParser,
>> >it adds both the <a> and <h1> tag nodes as direct children of the <p>
>> tag.
>> >
>> >The same error happens when I replace the <h1> tag with other header tags
>> ><h2> ... <h5>, and/or the <p> tag with a <div> or <span> tag.
>> >[Haven't experimented with other replacements].
>> >
>> >This seems to be a basic issue.
>> >Any help would be deeply appreciated.
>> >
>> >Cheers,
>> >Devarajan
>> >
>> >
>>
>>
>>
>

Re: HTML parsing error with tag inside

tag

Posted by Devaraja Swami <de...@gmail.com>.

Hi Chris,

Thanks for your reply.

To ad more clarity to my original post, I expect that the Tika 1.5
HtmlParser should parse any HTML input source and pass along the tags in
the order appearing in the HTML source correctly to the downstream (user
supplied) SAX content handler.
This is not happening currently.

For my HTML source, the Tika upstream parser (HtmlParser) that I call using
the Tika API is sending the end tag [ endElement() ] of the enclosing <h1>
tag to the (my) downstream content handler before it sends along the start
tag [ startElement() ] of the enclosed <a> tag.

IMHO, this is a clear, and quite serious, upstream parsing error.

If possible, could you please shed some light on this, or explain how I can
overcome this?
If necessary, I can add a JIRA on this.

Thanks,
Devarajan

On Mon, Sep 8, 2014 at 8:55 PM, Chris Mattmann <ch...@gmail.com>
wrote:

> Hi Devarajan,
>
> Please see Chapter 5 of the Tika in Action book for more
> detail on this. The short answer is that the parsed XHTML
> representation of *any* upstream file does not necessarily
> correspond to the upstream (X)HTML representation of the
> file. The XHTML is an intermediate format that Tika uses
> to represent the parsed structure content around the text.
> That is, if you have the following scenario:
>
> PDF->XHTML->content handlers
> XHTML->XHTML->content handlers
> Word Doc->XHTML->content handlers
> Image->XHTML-content handlers
> ..
> etc
>
> Note that XHTML intermediate is the structured representation
> of the information around the text in the document (including
> its metadata). That XHTML is then passed into org.sax.xml.ContentHandlers
> for stream-based processing downstream.
>
> Cheers,
> Chris
>
> ------------------------
> Chris Mattmann
> chris.mattmann@gmail.com
>
>
>
>
> -----Original Message-----
> From: Devaraja Swami <de...@gmail.com>
> Reply-To: <us...@tika.apache.org>
> Date: Monday, September 8, 2014 7:12 PM
> To: <us...@tika.apache.org>
> Subject: HTML parsing error with <a> tag inside <h1> tag
>
> >In the following HTML document, the <a> is inside the <h1> tag which is
> >inside the <p> tag:
> >-------------------
> ><!DOCTYPE html>
> ><html>
> ><body>
> >       <div><h1><a href="http://www.google.com">GOOGLE!</a></h1></div>
> ></body>
> ></html>
> >-------------------
> >But when I parse it with Tika 1.5 HtmlParser,
> >it adds both the <a> and <h1> tag nodes as direct children of the <p> tag.
> >
> >The same error happens when I replace the <h1> tag with other header tags
> ><h2> ... <h5>, and/or the <p> tag with a <div> or <span> tag.
> >[Haven't experimented with other replacements].
> >
> >This seems to be a basic issue.
> >Any help would be deeply appreciated.
> >
> >Cheers,
> >Devarajan
> >
> >
>
>
>

Re: HTML parsing error with tag inside

tag

Posted by Chris Mattmann <ch...@gmail.com>.

Hi Devarajan,

Please see Chapter 5 of the Tika in Action book for more
detail on this. The short answer is that the parsed XHTML
representation of *any* upstream file does not necessarily
correspond to the upstream (X)HTML representation of the
file. The XHTML is an intermediate format that Tika uses
to represent the parsed structure content around the text.
That is, if you have the following scenario:

PDF->XHTML->content handlers
XHTML->XHTML->content handlers
Word Doc->XHTML->content handlers
Image->XHTML-content handlers
..
etc

Note that XHTML intermediate is the structured representation
of the information around the text in the document (including
its metadata). That XHTML is then passed into org.sax.xml.ContentHandlers
for stream-based processing downstream.

Cheers,
Chris

------------------------
Chris Mattmann
chris.mattmann@gmail.com




-----Original Message-----
From: Devaraja Swami <de...@gmail.com>
Reply-To: <us...@tika.apache.org>
Date: Monday, September 8, 2014 7:12 PM
To: <us...@tika.apache.org>
Subject: HTML parsing error with <a> tag inside <h1> tag

>In the following HTML document, the <a> is inside the <h1> tag which is
>inside the <p> tag:
>-------------------
><!DOCTYPE html>
><html>
><body>
>	<div><h1><a href="http://www.google.com">GOOGLE!</a></h1></div>
></body>
></html>
>-------------------
>But when I parse it with Tika 1.5 HtmlParser,
>it adds both the <a> and <h1> tag nodes as direct children of the <p> tag.
>
>The same error happens when I replace the <h1> tag with other header tags
><h2> ... <h5>, and/or the <p> tag with a <div> or <span> tag.
>[Haven't experimented with other replacements].
>
>This seems to be a basic issue.
>Any help would be deeply appreciated.
>
>Cheers,
>Devarajan
>
>