You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Dave French <df...@jsitelecom.com> on 2013/06/20 17:32:40 UTC

Html Parser autodetect charset

Hey,

In my use case of tika, I am rendering a webpage, taking the contents of the page and feeding this into tika.  The contents of the webpage are encoded in UTF-8 when I feed it into tika, but the HtmlParser is using the AutoDetectReader to try and determine the charset.  This means tika is using the meta-data tag of the page to determine the charset.

Is there a way to not use this AutoDetectReader and just specify the charset?  Or better yet, inject the Detector that will be used?

Thanks for your help,
Dave

RE: Html Parser autodetect charset

Posted by Dave French <df...@jsitelecom.com>.

Thanks Allison!  I am not using the tika-app.jar, however I just created my own encoding detector then updated the corresponding file in the tika-parsers jar.

Everything is working perfectly.

Thanks again!
Dave

From: Allison, Timothy B. [mailto:tallison@mitre.org]
Sent: Friday, June 21, 2013 11:38 AM
To: user@tika.apache.org
Cc: Dave French
Subject: RE: Html Parser autodetect charset

In the tika-app.jar, go to WEB-INF/services; there's a file that specifies the order of the application of the encoding detectors (org.apache.tika.detect.EncodingDetector).  The AutoDetectReader applies these in order and stops as soon as one of the detectors thinks that it detects an encoding.

If you flip the order so that icu4j is first (as below), you should be set.

org.apache.tika.parser.txt.Icu4jEncodingDetector
org.apache.tika.parser.html.HtmlEncodingDetector
org.apache.tika.parser.txt.UniversalEncodingDetector

You could also create your own dummy EncodingDetector (always returns "UTF-8") and register it in the service file.

From: Dave French [mailto:dfrench@jsitelecom.com]
Sent: Thursday, June 20, 2013 11:33 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Html Parser autodetect charset

Hey,

In my use case of tika, I am rendering a webpage, taking the contents of the page and feeding this into tika.  The contents of the webpage are encoded in UTF-8 when I feed it into tika, but the HtmlParser is using the AutoDetectReader to try and determine the charset.  This means tika is using the meta-data tag of the page to determine the charset.

Is there a way to not use this AutoDetectReader and just specify the charset?  Or better yet, inject the Detector that will be used?

Thanks for your help,
Dave

RE: Html Parser autodetect charset

Posted by "Allison, Timothy B." <ta...@mitre.org>.

In the tika-app.jar, go to WEB-INF/services; there's a file that specifies the order of the application of the encoding detectors (org.apache.tika.detect.EncodingDetector).  The AutoDetectReader applies these in order and stops as soon as one of the detectors thinks that it detects an encoding.

If you flip the order so that icu4j is first (as below), you should be set.

org.apache.tika.parser.txt.Icu4jEncodingDetector
org.apache.tika.parser.html.HtmlEncodingDetector
org.apache.tika.parser.txt.UniversalEncodingDetector

You could also create your own dummy EncodingDetector (always returns "UTF-8") and register it in the service file.

From: Dave French [mailto:dfrench@jsitelecom.com]
Sent: Thursday, June 20, 2013 11:33 AM
To: user@tika.apache.org
Subject: Html Parser autodetect charset

Hey,

In my use case of tika, I am rendering a webpage, taking the contents of the page and feeding this into tika.  The contents of the webpage are encoded in UTF-8 when I feed it into tika, but the HtmlParser is using the AutoDetectReader to try and determine the charset.  This means tika is using the meta-data tag of the page to determine the charset.

Is there a way to not use this AutoDetectReader and just specify the charset?  Or better yet, inject the Detector that will be used?

Thanks for your help,
Dave