You are viewing a plain text version of this content. The canonical link for it is here.

Posted to server-dev@james.apache.org by Robert Burrell Donkin <ro...@gmail.com> on 2007/07/24 23:56:20 UTC

[mime4j] Please Review Cursor API

http://svn.apache.org/repos/asf/james/mime4j/trunk/src/main/java/org/apache/james/mime4j/Cursor.java
contains a first cut at a cursor API

comments and improvements welcomed :-)

a few particular points:

1. exception handling strategy - opted to through IOExceptions almost
everywhere. when required, custom subclasses will be created.

2. there are currently some methods which seem like they may be
stateful. for example, it's not certain how to interpret
advanceToBoundary if boundary has not previously been set. the
question is whether to specify reasonable behaviour (for example, when
the boundary has not been set, advancedToBoundary should do nothing)
or insist that exceptions be thrown.

3. the API uses a string to represent the MIME boundary. i'm not sure
that this is right. AIUI (hopefully people will correct me if i'm
wrong) this can only be 8 bit ASCII characters. in general, passing a
string should mean worrying about encoding. realistically, the string
will just be stripped to it's low order bytes.

- robert

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [mime4j] Please Review Cursor API

Posted by Robert Burrell Donkin <ro...@gmail.com>.

On 7/25/07, Robert Burrell Donkin <ro...@gmail.com> wrote:
> On 7/25/07, Bernd Fondermann <bf...@brainlounge.de> wrote:
> > Robert Burrell Donkin wrote:
> > > http://svn.apache.org/repos/asf/james/mime4j/trunk/src/main/java/org/apache/james/mime4j/Cursor.java
> > >
> > > contains a first cut at a cursor API
> > >
> > > comments and improvements welcomed :-)
> >
> > +1
> >
> > one question (disclosure: I am not at all a MIME expert):
> > Is
> >    boolean moreMimeParts()
> > possible, without look ahead, e.g reading the full part first?
>
> the API is just factored out - same code, different interface so i'll
> need to take a look at the code...
>
> MimeBoundaryInputStream reads someway forward then pushes back. on
> closer inspection, not sure that moreMimeParts is very aptly named. it
> seems to be the negative of parent EOF.  moreMimeParts does seems a
> rather unreasonable and potentially expensive method for the API. i'll
> see if i can remove it safely...

it only needs to check the next two characters so it's not very expensive

in the current form of the API, it would be difficult to remove

- robert

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [mime4j] Please Review Cursor API

Posted by Robert Burrell Donkin <ro...@gmail.com>.

On 7/25/07, Bernd Fondermann <bf...@brainlounge.de> wrote:
> Robert Burrell Donkin wrote:
> > http://svn.apache.org/repos/asf/james/mime4j/trunk/src/main/java/org/apache/james/mime4j/Cursor.java
> >
> > contains a first cut at a cursor API
> >
> > comments and improvements welcomed :-)
>
> +1
>
> one question (disclosure: I am not at all a MIME expert):
> Is
>    boolean moreMimeParts()
> possible, without look ahead, e.g reading the full part first?

the API is just factored out - same code, different interface so i'll
need to take a look at the code...

MimeBoundaryInputStream reads someway forward then pushes back. on
closer inspection, not sure that moreMimeParts is very aptly named. it
seems to be the negative of parent EOF.  moreMimeParts does seems a
rather unreasonable and potentially expensive method for the API. i'll
see if i can remove it safely...

- robert

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [mime4j] Please Review Cursor API

Posted by Robert Burrell Donkin <ro...@gmail.com>.

On 8/6/07, Jochen Wiedmann <jo...@gmail.com> wrote:
> Bernd Fondermann wrote:
> >
> > one question (disclosure: I am not at all a MIME expert):
> > Is
> >    boolean moreMimeParts()
> > possible, without look ahead, e.g reading the full part first?
> >
>
> No, it isn't.

an optimised read ahead would be possible

(this is the performance area highlighted by andrew c oliver)

- robert

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [mime4j] Please Review Cursor API

Posted by Jochen Wiedmann <jo...@gmail.com>.



Bernd Fondermann wrote:
> 
> one question (disclosure: I am not at all a MIME expert):
> Is
>    boolean moreMimeParts()
> possible, without look ahead, e.g reading the full part first?
> 

No, it isn't.

-- 
View this message in context: http://www.nabble.com/-mime4j--Please-Review-Cursor-API-tf4138858.html#a12009900
Sent from the James - Dev mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [mime4j] Please Review Cursor API

Posted by Bernd Fondermann <bf...@brainlounge.de>.

Robert Burrell Donkin wrote:
> http://svn.apache.org/repos/asf/james/mime4j/trunk/src/main/java/org/apache/james/mime4j/Cursor.java 
> 
> contains a first cut at a cursor API
> 
> comments and improvements welcomed :-)

+1

one question (disclosure: I am not at all a MIME expert):
Is
   boolean moreMimeParts()
possible, without look ahead, e.g reading the full part first?

   Bernd


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [mime4j] Please Review Cursor API

Posted by Robert Burrell Donkin <ro...@gmail.com>.

On 7/25/07, Stefano Bagnara <ap...@bago.org> wrote:
> Robert Burrell Donkin ha scritto:
> > On 7/25/07, Stefano Bagnara <ap...@bago.org> wrote:
> >> I'd go for an exception. But I don't know the code enough to understand
> >> how likely this will happen and how likely this is a programmer error or
> >> something else.
> >
> > AFAICT it would be an implementation error
> >
> > state is maintained in both the pull parser and the cursor
> >
> > the cursor needs to understand whether it is within a part in a mime
> > message or within not since the input stream reads only within a part.
> > the pull parser also records this information.
> >
> > would probably be cleaner to maintain this is one place. ideas welcomed.
>
> what about adding a Cursor.isInMimePart() or something similar?

an possible alternative would be to rework cursor as a minimal pull parser

the boundaries of each header would be located but the contents not
parsed (except for the mime information that the parser needs)

the boundaries of each part would be located but the contents not parsed

this might be useful in general and would be a cleaner API at the
price of each cursor implementation requiring more intelligence

opinions?

- robert

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [mime4j] Please Review Cursor API

Posted by Robert Burrell Donkin <ro...@gmail.com>.

On 7/25/07, Stefano Bagnara <ap...@bago.org> wrote:
> I understand this is a long message and I made many question, so: if
> this does not help your research just ignore it and go ahead with your
> ideas. I'll review the code ;-)

it's good to talk things through (anyone who isn't interested will
probably have stopped reading this thread by now)

> >> Not sure I understand the problem. Can't we ignore the encoding issue,
> >> at all? The important thing is that the API uses a string and a string
> >> always can contain a 7bit sequence in a lossless way. If you write such
> >> string to bytes using the US-ASCII charset the result will be unchanged,
> >> right?
> >
> > if the string contains only US-ACSII then yes, the transformation will
> > be lossless
>
> Well, the String object is only a "container" large enough for our
> purpose. In OOP we often use an Integer to pass data that should be a
> subset of an integer. The important fact is that if the meaning of the
> data we want to transfer is kept.
> That's why we can use the string and simply do a parameter check to see
> it is really an US-ASCII only sequence or we can use anything else. IMHO
> the choice does not depend on the charset support of the String object,
> but the easy of use. You are developing the API, you are more entitled
> to decide whether a byte[] is better than String.

designing good APIs is too hard to be left to one developer

> > my point is that by including a string in the API the caller is forced
> > to decode the natural representation (bytes) to a string which will
> > then be encoded to bytes by the cursor implementation. this approach
> > seems wrong to me.
>
> Well, bytes are the natural representation for every information we
> manage in IT ;-)
>
> My point is that String have very convenient methods and they are really
> well optimized in the JVM, so maybe sometimes String handling is not so
> worse than manual byte handling but they are more usable than byte-arrays.

depends on how the caller has the data

> FWIW you can also introduce a "Boundary" object so that implementation
> can be optimized without altering the API.

or introduce a helper method for CharSequence

> >> (if you had non US-ASCII they will be instead converted to "?").
> >
> > that depends on the way the encoding is done
> >
> > String.getBytes() is JVM and charset dependent
>
> shouldn't getBytes("US-ASCII") work always fine for a String including
> 7bit only chars and use "?" for chars outside the 7bit ?

no - the javadocs specify that the behaviour is undefined

for MIME boundaries, IMHO the right behaviour would be to throw an
exception (rather than converting) so this means using the more
reliable nio charset encoders

> > using the more flexible nio encoders, then bad characters can be
> > reported, ignored or replaced
>
> Not sure I understand this point: do we need to recognize/ignore/replace
> bad chars in the Boundary wrt to that api call?

the more flexible nio charset encoders all the conversion behaviour to
be set programmatically

> >> The only problems are when we try to use non US-ASCII chars as a
> >> boundary, but this should not be allowed as it is an illegal argument:
> >> maybe we may want to check this in the
> >> public·void·boundary(String·boundary)·throws·IOException. Maybe a throw
> >> a new IllegalArgumentException on a boundary including non US-ASCII
> >> chars is enough (maybe a check for "?" presence is enough).
> >
> > throwing an exception does seem reasonable
> >
> > i prefer to offer subclasses for cases such as this so that they can
> > be caught and (perhaps) dealt with
> >
> > i generally prefer checked to runtime exceptions but perhaps an
> > IOException may be wrong here
>
> IMHO the specific check is an argument validity check and an
> IllegalArgumentException better fits in. I see IOException more related
> to IO problems and not related to content/argument.
> Btw I'm also fine with IOException, and as you are the one with the
> dirty hands now, you should decide, IMHO ;-)

throwing runtime exceptions has downsides when running in many containers

> >> Passing byte
> >> sequences IMHO would not solve the issue as you would have to check the
> >> 8th bit anyway.
> >
> > true but the check is much quicker and the failure more precise
>
> I agree. It is a tradeoff of easy of use vs speed/precision. In my
> understanding we didn't need *that* speed and precision for the
> boundary, but I don't know exactly what code you're talking about, so
> I'm fine with the low level operations too.

there are existing performance worries about mime4j and i'd like to
try to avoid baking any more into the API (if possible)

- robert

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [mime4j] Please Review Cursor API

Posted by Stefano Bagnara <ap...@bago.org>.

I understand this is a long message and I made many question, so: if
this does not help your research just ignore it and go ahead with your
ideas. I'll review the code ;-)

--

Robert Burrell Donkin ha scritto:
> On 7/25/07, Stefano Bagnara <ap...@bago.org> wrote:
>> what about adding a Cursor.isInMimePart() or something similar?
> 
> not sure it would be so simple as that. the cursor would probably need
> to become a first pass parser.
> 
> the cursor would need to perform basic parsing of the email to find
> the appropriate mime headers and so the appropriate boundary. it would
> be possible to model the API so that the cursor performed basic
> non-recursive pull parsing (header lines, parts but not part headers).

You lost me here. I don't have enough understanding of what we are
talking about to bring more useful hints. I'll wait for the code to
review ;-)

>> Not sure I understand the problem. Can't we ignore the encoding issue,
>> at all? The important thing is that the API uses a string and a string
>> always can contain a 7bit sequence in a lossless way. If you write such
>> string to bytes using the US-ASCII charset the result will be unchanged,
>> right?
> 
> if the string contains only US-ACSII then yes, the transformation will
> be lossless

Well, the String object is only a "container" large enough for our
purpose. In OOP we often use an Integer to pass data that should be a
subset of an integer. The important fact is that if the meaning of the
data we want to transfer is kept.
That's why we can use the string and simply do a parameter check to see
it is really an US-ASCII only sequence or we can use anything else. IMHO
the choice does not depend on the charset support of the String object,
but the easy of use. You are developing the API, you are more entitled
to decide whether a byte[] is better than String.

> my point is that by including a string in the API the caller is forced
> to decode the natural representation (bytes) to a string which will
> then be encoded to bytes by the cursor implementation. this approach
> seems wrong to me.

Well, bytes are the natural representation for every information we
manage in IT ;-)

My point is that String have very convenient methods and they are really
well optimized in the JVM, so maybe sometimes String handling is not so
worse than manual byte handling but they are more usable than byte-arrays.

FWIW you can also introduce a "Boundary" object so that implementation
can be optimized without altering the API.

>> (if you had non US-ASCII they will be instead converted to "?").
> 
> that depends on the way the encoding is done
> 
> String.getBytes() is JVM and charset dependent

shouldn't getBytes("US-ASCII") work always fine for a String including
7bit only chars and use "?" for chars outside the 7bit ?

> using the more flexible nio encoders, then bad characters can be
> reported, ignored or replaced

Not sure I understand this point: do we need to recognize/ignore/replace
bad chars in the Boundary wrt to that api call?

>> The only problems are when we try to use non US-ASCII chars as a
>> boundary, but this should not be allowed as it is an illegal argument:
>> maybe we may want to check this in the
>> public·void·boundary(String·boundary)·throws·IOException. Maybe a throw
>> a new IllegalArgumentException on a boundary including non US-ASCII
>> chars is enough (maybe a check for "?" presence is enough).
> 
> throwing an exception does seem reasonable
> 
> i prefer to offer subclasses for cases such as this so that they can
> be caught and (perhaps) dealt with
> 
> i generally prefer checked to runtime exceptions but perhaps an
> IOException may be wrong here

IMHO the specific check is an argument validity check and an
IllegalArgumentException better fits in. I see IOException more related
to IO problems and not related to content/argument.
Btw I'm also fine with IOException, and as you are the one with the
dirty hands now, you should decide, IMHO ;-)

>> Passing byte
>> sequences IMHO would not solve the issue as you would have to check the
>> 8th bit anyway.
> 
> true but the check is much quicker and the failure more precise

I agree. It is a tradeoff of easy of use vs speed/precision. In my
understanding we didn't need *that* speed and precision for the
boundary, but I don't know exactly what code you're talking about, so
I'm fine with the low level operations too.

> there are various way that an encoding might fail and there would be
> effort involved in determining the exact cause

Well, as far as I can tell from CharToByteASCII.convert sources there
are no failures involved (if you don't pass wrong buffer sizes, and this
should not happen using String functions).

>> The details depends mainly on the usage of the boundary by the
>> underlying system: if the system works with bytes then maybe it is ok to
>> use bytes also for the boundary method, otherwise IMHO it's safe to keep
>> using the String (and maybe add the argument check).
> 
> MIME works with 8-bit bytes not 16-bit UNICODE so bytes are the
> natural way of representing boundaries in java
> 
> - robert

I don't agree with your interpretation of MIME working with 8bit bytes:
to be more precise, I agree that a byte contains 8 bits ;-) . UNICODE
and bytes are not something we can compare and put as alternative.

UNICODE is a way to represent chars using bits. We can compare single
byte Chars with 2 byte chars, but not UNICODE vs bytes. MIME does not
work with 8bit bytes more than any other PC related specification.

Maybe I'm not understanding your point at all, that's why I keep trying
to give you details on my mis/understanding.

As I said previously everything in IT is mapped by bits and bytes
(because of the available hardware) but MIME is just another thing that
we represent with bytes: MIME define what is a line, what is CRLF, what
is 7bit data, what is 8bit data, what is binary data. Working directly
with bytes IMHO does not means you use the RIGHT way to represent MIME
data , it simply means working in a low-level raw byte representation
without any further abstraction.

Are you going to write your own CharSequence implementation/wrapper for
everything so to avoid the memory abuse of Java's UNICODE based Strings?
Or maybe a "CompactString" that is able to wrap a bytebuffer or a
CharSequence and can convert to/from them (but differently from JAVA do
not convert them to UNICODE by default) would simply do the trick?

To be sure you're not misunderstanding me let me repeat I'm not against
your approach (I don't even understand it, so I cannot be against it). I
just want to understand what problem you're trying to solve and how you
do propose to solve it.

Are we still discussing the "boundary" or does this concerns belongs
also/only to MimePart contents?

Maybe it add something to this dicussion a corner case I see very often
discussed in mime related mailing lists: some non-rfc-compliant client
simply put 8bit chars in the headers values and encode that data using
the same encoding of the mime body. This is not a compliant behavior but
some mime application manage to try to understand/recover this case and
no server I'm aware of simply reject the message as being non compliant
(even if this seems the only compliant option from RFC reading): what is
our position wrt to this issue? Can our position influence the decisions
about the way we parse and move around this data?

Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [mime4j] Please Review Cursor API

Posted by Robert Burrell Donkin <ro...@gmail.com>.

On 7/25/07, Stefano Bagnara <ap...@bago.org> wrote:
> Robert Burrell Donkin ha scritto:
> > On 7/25/07, Stefano Bagnara <ap...@bago.org> wrote:
> >> I'd go for an exception. But I don't know the code enough to understand
> >> how likely this will happen and how likely this is a programmer error or
> >> something else.
> >
> > AFAICT it would be an implementation error
> >
> > state is maintained in both the pull parser and the cursor
> >
> > the cursor needs to understand whether it is within a part in a mime
> > message or within not since the input stream reads only within a part.
> > the pull parser also records this information.
> >
> > would probably be cleaner to maintain this is one place. ideas welcomed.
>
> what about adding a Cursor.isInMimePart() or something similar?

not sure it would be so simple as that. the cursor would probably need
to become a first pass parser.

the cursor would need to perform basic parsing of the email to find
the appropriate mime headers and so the appropriate boundary. it would
be possible to model the API so that the cursor performed basic
non-recursive pull parsing (header lines, parts but not part headers).

> >> > 3. the API uses a string to represent the MIME boundary. i'm not sure
> >> > that this is right. AIUI (hopefully people will correct me if i'm
> >> > wrong) this can only be 8 bit ASCII characters. in general, passing a
> >> > string should mean worrying about encoding. realistically, the string
> >> > will just be stripped to it's low order bytes.
> >> >
> >> > - robert
> >>
> >> Why 8 bit ASCII ? Shouldn't it be 7 bit ASCII? The first 7 bit of the
> >> US-ASCII should be present in every encoding, right?
> >
> > sorry: forgot that 7-bit, 8-bit has special meaning in the email context
> >
> > AIUI the boundary consists of ASCII each encoded as one 8-bit byte
> > with one clean bit. java strings (and chars) are UNICODE. this is
> > usually encoded as two 8-bit bytes (no clean bits), one 16-bit byte
> > (no clean bits) or variable (one, two or three) 8-bit bytes.
> >
> > accepting a string might require a byte in the input to be decoded to
> > a char then encoded to a byte to be used to compare the boundary.
> >
> > an alternative strategy would be to push enough intelligence into the
> > cursor for it to be able to work out MIME and header boundaries for
> > itself.
> >
> > - robert
>
> Not sure I understand the problem. Can't we ignore the encoding issue,
> at all? The important thing is that the API uses a string and a string
> always can contain a 7bit sequence in a lossless way. If you write such
> string to bytes using the US-ASCII charset the result will be unchanged,
> right?

if the string contains only US-ACSII then yes, the transformation will
be lossless

my point is that by including a string in the API the caller is forced
to decode the natural representation (bytes) to a string which will
then be encoded to bytes by the cursor implementation. this approach
seems wrong to me.

> (if you had non US-ASCII they will be instead converted to "?").

that depends on the way the encoding is done

String.getBytes() is JVM and charset dependent

using the more flexible nio encoders, then bad characters can be
reported, ignored or replaced

> The only problems are when we try to use non US-ASCII chars as a
> boundary, but this should not be allowed as it is an illegal argument:
> maybe we may want to check this in the
> public·void·boundary(String·boundary)·throws·IOException. Maybe a throw
> a new IllegalArgumentException on a boundary including non US-ASCII
> chars is enough (maybe a check for "?" presence is enough).

throwing an exception does seem reasonable

i prefer to offer subclasses for cases such as this so that they can
be caught and (perhaps) dealt with

i generally prefer checked to runtime exceptions but perhaps an
IOException may be wrong here

> Passing byte
> sequences IMHO would not solve the issue as you would have to check the
> 8th bit anyway.

true but the check is much quicker and the failure more precise

there are various way that an encoding might fail and there would be
effort involved in determining the exact cause

> The details depends mainly on the usage of the boundary by the
> underlying system: if the system works with bytes then maybe it is ok to
> use bytes also for the boundary method, otherwise IMHO it's safe to keep
> using the String (and maybe add the argument check).

MIME works with 8-bit bytes not 16-bit UNICODE so bytes are the
natural way of representing boundaries in java

- robert

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [mime4j] Please Review Cursor API

Posted by Stefano Bagnara <ap...@bago.org>.

Robert Burrell Donkin ha scritto:
> On 7/25/07, Stefano Bagnara <ap...@bago.org> wrote:
>> I'd go for an exception. But I don't know the code enough to understand
>> how likely this will happen and how likely this is a programmer error or
>> something else.
> 
> AFAICT it would be an implementation error
> 
> state is maintained in both the pull parser and the cursor
> 
> the cursor needs to understand whether it is within a part in a mime
> message or within not since the input stream reads only within a part.
> the pull parser also records this information.
> 
> would probably be cleaner to maintain this is one place. ideas welcomed.

what about adding a Cursor.isInMimePart() or something similar?

>> > 3. the API uses a string to represent the MIME boundary. i'm not sure
>> > that this is right. AIUI (hopefully people will correct me if i'm
>> > wrong) this can only be 8 bit ASCII characters. in general, passing a
>> > string should mean worrying about encoding. realistically, the string
>> > will just be stripped to it's low order bytes.
>> >
>> > - robert
>>
>> Why 8 bit ASCII ? Shouldn't it be 7 bit ASCII? The first 7 bit of the
>> US-ASCII should be present in every encoding, right?
> 
> sorry: forgot that 7-bit, 8-bit has special meaning in the email context
> 
> AIUI the boundary consists of ASCII each encoded as one 8-bit byte
> with one clean bit. java strings (and chars) are UNICODE. this is
> usually encoded as two 8-bit bytes (no clean bits), one 16-bit byte
> (no clean bits) or variable (one, two or three) 8-bit bytes.
> 
> accepting a string might require a byte in the input to be decoded to
> a char then encoded to a byte to be used to compare the boundary.
> 
> an alternative strategy would be to push enough intelligence into the
> cursor for it to be able to work out MIME and header boundaries for
> itself.
> 
> - robert

Not sure I understand the problem. Can't we ignore the encoding issue,
at all? The important thing is that the API uses a string and a string
always can contain a 7bit sequence in a lossless way. If you write such
string to bytes using the US-ASCII charset the result will be unchanged,
right? (if you had non US-ASCII they will be instead converted to "?").

The only problems are when we try to use non US-ASCII chars as a
boundary, but this should not be allowed as it is an illegal argument:
maybe we may want to check this in the
public·void·boundary(String·boundary)·throws·IOException. Maybe a throw
a new IllegalArgumentException on a boundary including non US-ASCII
chars is enough (maybe a check for "?" presence is enough). Passing byte
sequences IMHO would not solve the issue as you would have to check the
8th bit anyway.

The details depends mainly on the usage of the boundary by the
underlying system: if the system works with bytes then maybe it is ok to
use bytes also for the boundary method, otherwise IMHO it's safe to keep
using the String (and maybe add the argument check).

Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [mime4j] Please Review Cursor API

Posted by Robert Burrell Donkin <ro...@gmail.com>.

On 7/25/07, Stefano Bagnara <ap...@bago.org> wrote:
> Robert Burrell Donkin ha scritto:

<snip>

> > 2. there are currently some methods which seem like they may be
> > stateful. for example, it's not certain how to interpret
> > advanceToBoundary if boundary has not previously been set. the
> > question is whether to specify reasonable behaviour (for example, when
> > the boundary has not been set, advancedToBoundary should do nothing)
> > or insist that exceptions be thrown.
>
> I'd go for an exception. But I don't know the code enough to understand
> how likely this will happen and how likely this is a programmer error or
> something else.

AFAICT it would be an implementation error

state is maintained in both the pull parser and the cursor

the cursor needs to understand whether it is within a part in a mime
message or within not since the input stream reads only within a part.
the pull parser also records this information.

would probably be cleaner to maintain this is one place. ideas welcomed.

> > 3. the API uses a string to represent the MIME boundary. i'm not sure
> > that this is right. AIUI (hopefully people will correct me if i'm
> > wrong) this can only be 8 bit ASCII characters. in general, passing a
> > string should mean worrying about encoding. realistically, the string
> > will just be stripped to it's low order bytes.
> >
> > - robert
>
> Why 8 bit ASCII ? Shouldn't it be 7 bit ASCII? The first 7 bit of the
> US-ASCII should be present in every encoding, right?

sorry: forgot that 7-bit, 8-bit has special meaning in the email context

AIUI the boundary consists of ASCII each encoded as one 8-bit byte
with one clean bit. java strings (and chars) are UNICODE. this is
usually encoded as two 8-bit bytes (no clean bits), one 16-bit byte
(no clean bits) or variable (one, two or three) 8-bit bytes.

accepting a string might require a byte in the input to be decoded to
a char then encoded to a byte to be used to compare the boundary.

an alternative strategy would be to push enough intelligence into the
cursor for it to be able to work out MIME and header boundaries for
itself.

- robert

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [mime4j] Please Review Cursor API

Posted by Stefano Bagnara <ap...@bago.org>.

Robert Burrell Donkin ha scritto:
> http://svn.apache.org/repos/asf/james/mime4j/trunk/src/main/java/org/apache/james/mime4j/Cursor.java
> 
> contains a first cut at a cursor API
> 
> comments and improvements welcomed :-)
> 
> a few particular points:
> 
> 1. exception handling strategy - opted to through IOExceptions almost
> everywhere. when required, custom subclasses will be created.

+1

> 2. there are currently some methods which seem like they may be
> stateful. for example, it's not certain how to interpret
> advanceToBoundary if boundary has not previously been set. the
> question is whether to specify reasonable behaviour (for example, when
> the boundary has not been set, advancedToBoundary should do nothing)
> or insist that exceptions be thrown.

I'd go for an exception. But I don't know the code enough to understand
how likely this will happen and how likely this is a programmer error or
something else.

> 3. the API uses a string to represent the MIME boundary. i'm not sure
> that this is right. AIUI (hopefully people will correct me if i'm
> wrong) this can only be 8 bit ASCII characters. in general, passing a
> string should mean worrying about encoding. realistically, the string
> will just be stripped to it's low order bytes.
> 
> - robert

Why 8 bit ASCII ? Shouldn't it be 7 bit ASCII? The first 7 bit of the
US-ASCII should be present in every encoding, right?

RFC2046:
   As stated in the definition of the Content-Transfer-Encoding field
   [RFC 2045], no encoding other than "7bit", "8bit", or "binary" is
   permitted for entities of type "multipart".  The "multipart" boundary
   delimiters and header fields are always represented as 7bit US-ASCII
   in any case (though the header fields may encode non-US-ASCII header
   text as per RFC 2047) and data within the body parts can be encoded
   on a part-by-part basis, with Content-Transfer-Encoding fields for
   each appropriate body part.

Stefano


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org