You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2017/04/03 16:19:01 UTC

[Bug 50955] An error occurred while retrieving the text file.

https://bz.apache.org/bugzilla/show_bug.cgi?id=50955

--- Comment #11 from Tim Allison <ta...@mitre.org> ---
Turns out that 51944.doc is not UTF-16LE.  It looks from this file and 2 other
files from our common crawl corpus like this is actually Big5, but MS appears
to zero-pad ascii characters.  

Has anyone worked with this?  Do we have something in our codebase that deals
with this already?

If not, we may need some extra code to imitate MS's big5 en/decoding...not
within the scope of this ticket.

It looks from ~1300 Word 6.0 files in our corpus, that the proposed solution
works.  Unfortunately, there are only a few handfuls of files that aren't
encoded with WIN-1252.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org