You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/11/02 13:44:00 UTC

[jira] [Resolved] (TIKA-2485) HTMLEncodingDetector content limit to be configurable

     [ https://issues.apache.org/jira/browse/TIKA-2485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison resolved TIKA-2485.
-------------------------------
    Resolution: Fixed

markLimit is now configurable in the HTMLEncodingDetector, Icu4JEncodingDetector and in the UniversalCharsetDetector.

Try something like this in tika-config.xml:

{noformat}
    <encodingDetectors>
        <encodingDetector class="org.apache.tika.parser.html.HtmlEncodingDetector">
            <params>
                <param name="markLimit" type="int">64000</param>
            </params>
        </encodingDetector>
        <encodingDetector class="org.apache.tika.parser.txt.UniversalEncodingDetector">
            <params>
                <param name="markLimit" type="int">64001</param>
            </params>
        </encodingDetector>
        <encodingDetector class="org.apache.tika.parser.txt.Icu4jEncodingDetector">
            <params>
                <param name="markLimit" type="int">64002</param>
            </params>
        </encodingDetector>
    </encodingDetectors>
{noformat}


> HTMLEncodingDetector content limit to be configurable
> -----------------------------------------------------
>
>                 Key: TIKA-2485
>                 URL: https://issues.apache.org/jira/browse/TIKA-2485
>             Project: Tika
>          Issue Type: Improvement
>          Components: detector
>    Affects Versions: 1.16
>            Reporter: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.17
>
>
> Tim's response to my question:
> -----Original message-----
> > From:Allison, Timothy B. <ta...@mitre.org>
> > Sent: Friday 27th October 2017 14:53
> > To: user@tika.apache.org
> > Subject: RE: Incorrect encoding detected
> > 
> > Hi Markus,
> >   
> > My guess is that the ~32,000 characters of mostly ascii-ish <script/> are what is actually being used for encoding detection.  The HTMLEncodingDetector only looks in the first 8,192 characters, and the other encoding detectors have similar (but longer?) restrictions.
> >  
> > At some point, I had a dev version of a stripper that removed contents of <script/> and <style/> before trying to detect the encoding[0]...perhaps it is time to resurrect that code and integrate it?
> > 
> > Or, given that HTML has been, um, blossoming, perhaps, more simply, we should expand how far we look into a stream for detection?
> > 
> > Cheers,
> > 
> >                Tim
> > 
> > [0] https://issues.apache.org/jira/browse/TIKA-2038
> >    
> > 
> > -----Original Message-----
> > From: Markus Jelsma [mailto:markus.jelsma@openindex.io] 
> > Sent: Friday, October 27, 2017 8:39 AM
> > To: user@tika.apache.org
> > Subject: Incorrect encoding detected
> > 
> > Hello,
> > 
> > We have a problem with Tika, encoding and pages on this website: https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser
> > 
> > Using Nutch and Tika 1.12, but also using Tika 1.16, we found out that the regular HTML parser does a fine job, but our TikaParser has a tough job dealing with this HTML. For some reason Tika thinks Content-Encoding=windows-1252 is what this webpage says it is, instead the page identifies itself properly as UTF-8.
> > 
> > Of all websites we index, this is so far the only one giving trouble indexing accents, getting fÃ¥ instead of a regular få.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)