You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2017/04/04 12:16:42 UTC

[Bug 60953] New: Improve Big5 handling for Word 6.0

https://bz.apache.org/bugzilla/show_bug.cgi?id=60953

            Bug ID: 60953
           Summary: Improve Big5 handling for Word 6.0
           Product: POI
           Version: 3.16-dev
          Hardware: PC
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: HWPF
          Assignee: dev@poi.apache.org
          Reporter: tallison@mitre.org
  Target Milestone: ---

Created attachment 34898
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=34898&action=edit
Example bilingual English/Chinese Big5 Word 6.0 file

While working on Bug 50955, I found that MS had their own encoding of Big5,
which included zero padding for ascii characters.

I included some code that ought to be cleaned up.

An example of Big5 used to encode English is already in our set: Bug51944.doc.

Some notes will follow.

I'm also attaching a better bilingual Big5 English/Chinese example from Apache
Tika's Common Crawl corpus.

Many thanks, again, to Common Crawl, Dominik Stadler and Rackspace.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 60953] Improve Big5 handling for Word 6.0

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=60953

--- Comment #3 from Tim Allison <ta...@mitre.org> ---
Created attachment 34899
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=34899&action=edit
Another example file

This file comes from the same source as the other attachment.  It contained a
few 0xf9xx characters that the original file did not.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 60953] Improve Big5 handling for Word 6.0

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=60953

--- Comment #2 from Tim Allison <ta...@mitre.org> ---
Useful references:
https://en.wikipedia.org/wiki/Code_page_950 

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit950.txt

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


[Bug 60953] Improve Big5 handling for Word 6.0

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=60953

Tim Allison <ta...@mitre.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 OS|                            |All

--- Comment #1 from Tim Allison <ta...@mitre.org> ---
It would also be handy if we could find some Shift-JIS examples.

Word95.doc has a Shift-JIS encoded font, but the text is all single byte
English.  Given that we can't map from fonts to text pieces, it isn't clear to
me that this is actually what Shift-JIS looks like or if the English is really
Times New Roman

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org