You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Halil Ibrahim Simsek <ha...@simsek.email> on 2015/05/24 11:26:17 UTC

HTML5 Support

Hi,

I have applied Google Summer of Code this year for Apache Nutch project on giving support for HTML5 specifications. As you know Nutch uses nekoHtml(by default), tagSoup and tika for parsing html pages. What I wonder is, in what proportion tika supports HTML5 specifications. What parsers tika has applicable for it. Are there any relevant issues on JIRA tracker? Kindly advice me. Any help will be appreciated. 

Thanks in advance