You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Markus Jelsma <ma...@openindex.io> on 2018/08/29 11:08:27 UTC

Attributes of HTML element not reported in ContentHandler

Hello,

We parse HTML using a ContentHandler. Tika uses TagSoup, which does not support modern HTML but we work-around the problem by fiddling with its HMTLSchema. Now we have access to HTML5 elements, and other curiosities such as allowing META anywhere in the body.

What we never managed to get to work, is reading attributes of the HTML element. So, any ideas on how to get attributes reported always?

Many thanks,
Markus