You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2018/10/18 11:10:00 UTC

[jira] [Created] (TIKA-2759) ScriptsExtractor incorrectly reports Javascript to characters() in SAX ContentHandler

Markus Jelsma created TIKA-2759:
-----------------------------------

             Summary: ScriptsExtractor incorrectly reports Javascript to characters() in SAX ContentHandler
                 Key: TIKA-2759
                 URL: https://issues.apache.org/jira/browse/TIKA-2759
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.18
            Reporter: Markus Jelsma
             Fix For: 1.20


We extract Javascript as text content while instead it is actually a script tag with base64 inline. This inline code is decoded and reported in the characters() method of our custom ContentHandler, and ends up as text being extracted, but it seems the Javascript start tag itself is never reported to startElement(). The Javascript is reported to characters() after we left the head and entered the body.

HTML file is attached



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)