You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Markus Jelsma <ma...@openindex.io> on 2018/10/17 14:24:13 UTC

Encoding issues when upgrading Tika 1.17 to 1.19.1

Hello,

I started to upgrade our SAX parser Tika dependency from 1.17 to 1.19, ran all 995 unit tests and observed three failures, two encoding issues and one other weird thing. The tests use real HTML.

Where we previously extracted text  such as 'Spokane, Wash. [— The solar' we now got 'Spokane, Wash. [â€" The solar' in one test. The other had 'could take ["weeks, or' but we not get 'could take [“weeks, or' extracted. Our tests pass with 1.17 but fail with 1.18 and 1.19.1. 

The other test fails because we suddenly extracted a bunch of Javascript as text content while instead it is actually a script tag with base64 inline. This inline code is decoded and reported in the characters() method of our custom ContentHandler, and ends up as text being extracted, but it seems the Javascript start tag itself is never reported to startElement(). The Javascript is reported to characters() after we left the head and entered the body.

Any idea on how to fix this encoding issue and the weird inline base64 Javascript? Are there any Tika options that i am unaware of? Are these bugs? 

Of course, i can share the HTML files if needed.

Many thanks,
Markus

Re: Encoding issues when upgrading Tika 1.17 to 1.19.1

Posted by Tim Allison <ta...@apache.org>.
Hi Markus,

  On the scripts...we added an "extractScripts" option, but the
default is false, and the idea is that the scripts should be extracted
as embedded documents, which with xhtml, would be inlined.  But, with
the default as false, you shouldn't be seeing anything from scripts.

  On charset detection, that was likely caused by our "upgrade" to a
more recent copy of icu4j's charset detector.

  Thank you for letting us know about these.  Please do open issues
and share files.

   Cheers,

              Tim
On Wed, Oct 17, 2018 at 10:24 AM Markus Jelsma
<ma...@openindex.io> wrote:
>
> Hello,
>
> I started to upgrade our SAX parser Tika dependency from 1.17 to 1.19, ran all 995 unit tests and observed three failures, two encoding issues and one other weird thing. The tests use real HTML.
>
> Where we previously extracted text  such as 'Spokane, Wash. [— The solar' we now got 'Spokane, Wash. [â€" The solar' in one test. The other had 'could take ["weeks, or' but we not get 'could take [“weeks, or' extracted. Our tests pass with 1.17 but fail with 1.18 and 1.19.1.
>
> The other test fails because we suddenly extracted a bunch of Javascript as text content while instead it is actually a script tag with base64 inline. This inline code is decoded and reported in the characters() method of our custom ContentHandler, and ends up as text being extracted, but it seems the Javascript start tag itself is never reported to startElement(). The Javascript is reported to characters() after we left the head and entered the body.
>
> Any idea on how to fix this encoding issue and the weird inline base64 Javascript? Are there any Tika options that i am unaware of? Are these bugs?
>
> Of course, i can share the HTML files if needed.
>
> Many thanks,
> Markus