You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2019/05/23 15:36:10 UTC

[Bug 63462] New: Problems with MAPIMessage.guess7BitEncoding/MAPIMessage.getHtmlBody

https://bz.apache.org/bugzilla/show_bug.cgi?id=63462

            Bug ID: 63462
           Summary: Problems with
                    MAPIMessage.guess7BitEncoding/MAPIMessage.getHtmlBody
           Product: POI
           Version: unspecified
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: HSMF
          Assignee: dev@poi.apache.org
          Reporter: dominik.hoelzl@fabasoft.com
  Target Milestone: ---

Created attachment 36597
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=36597&action=edit
Example MSG files with different code pages

Some E-Mails run into encoding problems when reading the subject, text body or
html body and using MAPIMessage.guess7BitEncoding.

Example: E-Mail defines PR_INTERNET_CPID -> UTF-8, PR_MESSAGE_LOCALE_ID ->
1031, PR_MESSAGE_CODEPAGE -> undefined, no headers.

* Outlook wants PR_SUBJECT to be CP1252 (as PR_INTERNET_CPID is only for
PR_BODY and PR_BODY_HTML; currently read as UTF-8 as guess7BitEncoding sets
this)
* Outlook wants binary PR_BODY_HTML to be UTF-8 (Would currently read as
CP1252, as getBodyHtml does not take care of any code page in case it is
binary)
* Outlook wants ASCII PR_BODY_HTML to be UTF-8 (Currently correct)
* Outlook wants PR_BODY to be CP1252 for an unknown reason (Would currently
read as UTF-8, as guess7BitEncoding sets this)

In the docs PR_INTERNET_CPID may only be used to indicate the code page for
PR_BODY and PR_BODY_HTML:

https://docs.microsoft.com/en-us/office/client-developer/outlook/mapi/pidtaginternetcodepage-canonical-property

In my tests Outlook never looks at the charset information inside the HTML; it
only relies on PR_INTERNET_CPID.

In case of PR_MESSAGE_CODEPAGE is undefined, and no headers are present, using
the default ANSI codepage for the locale defined by PR_MESSAGE_LOCALE_ID may be
the only hint to get the correct code page, as PR_INTERNET_CPID is only for
text/html body.

Suggestion:

https://github.com/apache/poi/pull/149

(With this patch all existing Unit-Tests succeed without modification)

Attachments:
MSG-Files where the text body and html body should be decoded correctly.
Outlook displays them as expected.

Regards,
Dominik

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org