You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by Aleksandr Dubinsky <ad...@almson.net> on 2012/11/02 14:38:34 UTC

org.apache.tika.parser.txt.UniversalEncodingListener

I am having a problem with text files saved in Windows-1252 (or similar)
encoding with LF linebreaks. Characters in the range 80 to 9F are returning
as control codes.

Question: why is this class second-guessing Mozilla's 1252 determination
and returning ISO 8859-1 (line 62)? What purpose does that serve?

Aleksandr Dubinsky
Almson Corp / x0x Source
98-10 64th Ave. Ste 3D
Rego Park, NY 11374
+1 (303) 800-4484

Re: org.apache.tika.parser.txt.UniversalEncodingListener

Posted by Ken Krugler <kk...@transpac.com>.
On Nov 2, 2012, at 6:38am, Aleksandr Dubinsky wrote:

> I am having a problem with text files saved in Windows-1252 (or similar)
> encoding with LF linebreaks. Characters in the range 80 to 9F are returning
> as control codes.
> 
> Question: why is this class second-guessing Mozilla's 1252 determination
> and returning ISO 8859-1 (line 62)? What purpose does that serve?

When you say "Mozilla's 1252 determination", where is that coming from and how is that being communicated to Tika?

Are you passing it in via the CONTENT_TYPE field in the Metadata?

-- Ken

--------------------------------------------
http://about.me/kkrugler
+1 530-210-6378