You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Markus Jelsma <ma...@openindex.io> on 2013/03/06 15:49:34 UTC

Javascript 'incorrectly' extracted as text

Hi,

In following case Javascript is extracted:

<script language='JavaScript1.1' type='text/javascript' />
<!--
alert("blaat");
//-->
</script>

This is strictly speaking correct behaviour but we all know this is an error in the HTML where the opening tag is closed immediately. Modern browsers do interpret this as Javascript, not text. Any hints on how we can let Tika/TagSoup deal with this issue appropriately?

Thanks
Markus