You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Jim Idle <ji...@proofpoint.com> on 2017/07/03 00:37:49 UTC

RE: HTML parsing, script tags,

Hi Tim,

I can create but reports but I looked in to it via the debugger and what is happening is things like META and LINK in the <head> section are first gathered into maps and then used later to send the events (deep in tag soup this, not Tika). This of course does not preserve order. Raising bugs would be a waste of someone’s time as they would just find, as I did, that this is how tag soup works. It also does not completely conform to HTML standards (by design).

The DOCTYPE absence was just that I needed to implement the lexical interface and handle startDTD. Unknown elements are ignored because for some reason (not explained in the comments in Tika), tag soup is not thread safe it is not told to ignore unknown tags - probably storing them in data structure that is not thread safe would be my bet.

I have switched back to validator.nu which is kept up to date and conforms to SAX correctly etc. I think that it would be trivial to add this parser in to Tika as an option (so as not to break existing code using tag soup), with a configurator. I don’t think that there is anything wrong with Tag Soup per se, as it is not really trying to be a parser like validator.nu. It is probably just fine for most things, unless you need structure order.

For now, I can do this because the same SAX handler works for both Tika and validator.nu with a few tweaks concerning accumulating meta data. Which is in fact why I think it would be relatively easy to make validator.nu and option as at the HTML parser for Tika.

Hope that helps,

Jim

From: Allison, Timothy B. [mailto:tallison@mitre.org]
Sent: Friday, June 30, 2017 23:13
To: user@tika.apache.org
Subject: RE: HTML parsing, script tags,

Wait, Tagsoup is not returning the start element events in the same order as the html?  I don’t know think we can fix that or your other points, but would you be willing to share triggering documents and open an issue for each problem.

We should include those issues in our ongoing conversation about swapping out the underlying html parser for something more modern.

Sorry Tika isn’t working for you on this, and thank you!

From: Jim Idle [mailto:jidle@proofpoint.com]
Sent: Friday, June 30, 2017 1:23 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: HTML parsing, script tags,

Well I got a long way with the Tika wrapper around tag soup but then while chasing down a bug I realized that I was not getting the startElement events in the order that they are seen in the HTML file. It also ignores <!doctype> and unknown elements.

I can’t see anyway to change that and as knowing the structure of the document is very important then I will have to stop using Tika for HTML I guess and go back to validator.nu

Just posting this here for posterity really.

Jim

From: Ken Krugler [mailto:kkrugler_lists@transpac.com]
Sent: Wednesday, June 28, 2017 23:06
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: HTML parsing, script tags,

Hi Jim,

On Jun 28, 2017, at 12:07am, Jim Idle <ji...@proofpoint.com>> wrote:

So right now it looks the HTML parser only sends through script tags if the hay a src attribute. Is this likely to change or should I use another parser for HTML? I could submit a patch for this of course.

You can use a custom mapper if you want to alter which tags get passed through.

E.g. check out IdentityHtmlMapper in Tika for a mapper that passes through everything.


Also, does anyone have an opinion if the underlying tag soup stuff is tolerant of HTML in a similar manner to browsers which will try to render anything) or is expecting well-formed HTML. I can go look at the Tag Soup stuff directly of course, but just wondered if anyone has experience of using Tika to parse HTML.

TagSoup (and JSoup and NekoHTML) are all Java libraries that try to fix up broken HTML, with varying degrees of success, depending on the way that HTML is broken.

— Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.scaleunlimited.com&d=DwMFaQ&c=Vxt5e0Osvvt2gflwSlsJ5DmPGcPvTRKLJyp031rXjhg&r=LQ_Q8ZxvkO2zK857fAbj5MDtaB4Bvrpw3bihfO3Bhbw&m=zuXxc_gqb1VxiPCWTZMAcxEylZFKvjehEPUN183MkaM&s=CeitiWqk1nlp0ZL44NBYgX8weEIk24cx2yU7HA2AWFs&e=>
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr