You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Markus Jelsma (JIRA)" <ji...@apache.org> on 2018/10/18 11:11:00 UTC

[jira] [Updated] (TIKA-2759) ScriptsExtractor incorrectly reports Javascript to characters() in SAX ContentHandler

     [ https://issues.apache.org/jira/browse/TIKA-2759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated TIKA-2759:
--------------------------------
    Attachment: petrolicious.html

> ScriptsExtractor incorrectly reports Javascript to characters() in SAX ContentHandler
> -------------------------------------------------------------------------------------
>
>                 Key: TIKA-2759
>                 URL: https://issues.apache.org/jira/browse/TIKA-2759
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.18
>            Reporter: Markus Jelsma
>            Priority: Major
>             Fix For: 1.20
>
>         Attachments: petrolicious.html
>
>
> We extract Javascript as text content while instead it is actually a script tag with base64 inline. This inline code is decoded and reported in the characters() method of our custom ContentHandler, and ends up as text being extracted, but it seems the Javascript start tag itself is never reported to startElement(). The Javascript is reported to characters() after we left the head and entered the body.
> HTML file is attached



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)