You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mime4j-dev@james.apache.org by Markus Wiederkehr <ma...@gmail.com> on 2009/02/24 20:59:18 UTC

Re: [jira] Commented: (MIME4J-118) MIME stream parser handles non-ASCII fields incorrectly

On Tue, Feb 24, 2009 at 2:46 PM, Robert Burrell Donkin (JIRA)
<mi...@james.apache.org> wrote:
>
>    [ https://issues.apache.org/jira/browse/MIME4J-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676270#action_12676270 ]
>
> Robert Burrell Donkin commented on MIME4J-118:
> ----------------------------------------------
>
> I suspect that there may be longer term issues with this general approach but i think we should accept that the current proposal is good enough for this release. release early, release often.

+1 on the release part but I need a few days to clean up that patch.

> I think that the best way to approach is to preserve the original document together with boundary meta-data. In other words, that a 'Content-Type' header starts at byte 99 in the document rather than trying to slice up the document and re-assemble from lots of small byte buffers. But this is related to other issues which should wait until after this release so I think we should patch and look to ship.

We can cross that bridge when we come to it but I don't particularly
like the idea of having to open a file, seek to position 99 and read
50 bytes just to obtain the raw value of a Content-Type field, for
example.

Also please mind that Field instances may be shared between multiple
messages and they can be created from a constructor or factory without
an original document to back them up.

And last but not least with nested encodings there is no meaningful
offset into a file..

Markus


>> MIME stream parser handles non-ASCII fields incorrectly
>> -------------------------------------------------------
>>
>>                 Key: MIME4J-118
>>                 URL: https://issues.apache.org/jira/browse/MIME4J-118
>>             Project: JAMES Mime4j
>>          Issue Type: Bug
>>            Reporter: Oleg Kalnichevski
>>            Assignee: Oleg Kalnichevski
>>             Fix For: 0.6
>>
>>         Attachments: mime4j-118-bytesequence-draft.patch, mime4j-118-field.patch, mimej4-118.patch
>>
>>
>> Presently MIME stream parser handles non-ASCII fields incorrectly. Binary field content gets converted to its textual representation too early in the parsing process using simple byte to char cast. The decision about appropriate char encoding should be left up to individual ContentHandler implementations.
>> Oleg
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

Re: [jira] Commented: (MIME4J-118) MIME stream parser handles non-ASCII fields incorrectly

Posted by Markus Wiederkehr <ma...@gmail.com>.
On Tue, Feb 24, 2009 at 10:23 PM, Robert Burrell Donkin
<ro...@gmail.com> wrote:
> <snip>
>
> i worry about the quantity of copying and new buffers that will need
> to be created to store a single complex, large document when every
> component has to be stored as a string and also as bytes to ensure
> round tripping in non-compliant corner cases.

Well at least I am confident that Mime4j does not perform worse than
it did before. Field always held the raw data. Only not it's a
ByteSequence instead of a String. Both are immutable and are not
copied because of that.

Markus

Re: [jira] Commented: (MIME4J-118) MIME stream parser handles non-ASCII fields incorrectly

Posted by Stefano Bagnara <ap...@bago.org>.
Robert Burrell Donkin ha scritto:
> [...]
> IIRC in a multipart document, the mime headers must be encoded in
> ASCII. so, the first level headers can all be access through byte
> offsets. a part may contain a transfer encoded document. there are a
> couple of distinct cases which are interesting: when the document is
> an embedded message or an embedded multipart document. when this is
> encoded in Base64 then a bytewise offset is not available in the
> original stream but is from the decoded stream. so, the bytewise
> offset in the decoding stream can be used. this is a rare use case and
> though the approach would be slow in this case, it would be a rare
> one.

FYI
http://issues.apache.org/jira/browse/MIME4J-114?focusedCommentId=12671463#action_12671463
-----
the content-transfer-encoding for a multipart message SHOULD always be
7bit, 8bit or binary to avoid nested encoding/decoding operations.

Javamail from 1.4 ignores a content-transfer-encoding quoted-printable
or base64 for a multipart message by default while previous javamail
versions parsed correctly nested encoding. Javamail 1.4 provides a flag
to enable the nested encodings (for backward compatibility):
mail.mime.ignoremultipartencoding
-----

Stefano

Re: [jira] Commented: (MIME4J-118) MIME stream parser handles non-ASCII fields incorrectly

Posted by Robert Burrell Donkin <ro...@gmail.com>.
On Tue, Feb 24, 2009 at 7:59 PM, Markus Wiederkehr
<ma...@gmail.com> wrote:
> On Tue, Feb 24, 2009 at 2:46 PM, Robert Burrell Donkin (JIRA)
> <mi...@james.apache.org> wrote:
>>
>>    [ https://issues.apache.org/jira/browse/MIME4J-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676270#action_12676270 ]
>>
>> Robert Burrell Donkin commented on MIME4J-118:
>> ----------------------------------------------
>>
>> I suspect that there may be longer term issues with this general approach but i think we should accept that the current proposal is good enough for this release. release early, release often.
>
> +1 on the release part but I need a few days to clean up that patch.

fine

>> I think that the best way to approach is to preserve the original document together with boundary meta-data. In other words, that a 'Content-Type' header starts at byte 99 in the document rather than trying to slice up the document and re-assemble from lots of small byte buffers. But this is related to other issues which should wait until after this release so I think we should patch and look to ship.
>
> We can cross that bridge when we come to it but I don't particularly
> like the idea of having to open a file, seek to position 99 and read
> 50 bytes just to obtain the raw value of a Content-Type field, for
> example.

nio manages this quite adequately ;-)

i worry about the quantity of copying and new buffers that will need
to be created to store a single complex, large document when every
component has to be stored as a string and also as bytes to ensure
round tripping in non-compliant corner cases. i would much rather
encourage users to retain the original when absolute fidelity is
required.

> Also please mind that Field instances may be shared between multiple
> messages and they can be created from a constructor or factory without
> an original document to back them up.

the difficult problems with round tripping should not occur when
fields are created programmatically

> And last but not least with nested encodings there is no meaningful
> offset into a file..

i'm not sure i agree with that

IIRC in a multipart document, the mime headers must be encoded in
ASCII. so, the first level headers can all be access through byte
offsets. a part may contain a transfer encoded document. there are a
couple of distinct cases which are interesting: when the document is
an embedded message or an embedded multipart document. when this is
encoded in Base64 then a bytewise offset is not available in the
original stream but is from the decoded stream. so, the bytewise
offset in the decoding stream can be used. this is a rare use case and
though the approach would be slow in this case, it would be a rare
one.

- robert