You are viewing a plain text version of this content. The canonical link for it is here.

Posted to server-dev@james.apache.org by Robert Burrell Donkin <ro...@gmail.com> on 2007/10/30 22:29:13 UTC

[IMAP] MessageResult += Content

i've been reworking https://issues.apache.org/jira/browse/JAMES-808 to
factor out an interface for content which exposes the size and allows
the content to be written. there are quite a number of different bits
of content which would benefit from the size+write approach and so
IMHO an extra interface will help to keep the API consider and
readable

hope to commit a version today. i'd be grateful if people would take a
look at the commit diffs and either patch any design improvements they
can see or reply to list

- robert

    /**
     * IMAP needs to know the size of the content before it starts to
write it out.
     * This interface allows direct writing whilst exposing total size.
     */
    public interface Content {
        /**
         * Writes content into the given buffer.
         * @param buffer <code>StringBuffer</code>, not null
         * @throws MessagingException
         */
        public void write(StringBuffer buffer) throws MessagingException;

        /**
         * Size (in octets) of the content.
         * @return number of octets to be written
         * @throws MessagingException
         */
        public long size() throws MessagingException;
    }

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [IMAP] MessageResult += Content

Posted by Robert Burrell Donkin <ro...@gmail.com>.

On Nov 5, 2007 12:03 PM, Stefano Bagnara <ap...@bago.org> wrote:
> Robert Burrell Donkin ha scritto:
> > On Nov 5, 2007 9:01 AM, Stefano Bagnara <ap...@bago.org> wrote:
> >> Is it only metadata+headers or they also ask something about bodies
> >> (excluding length) during the opening?
> >
> > the meta-data includes information about the structure body content
> > including access to the MIME meta-data and encoding. also other
> > assorted data such as number of lines.
>
> "structure body content" means that during opening they want to know how
> many parts compose each message how they are nested, what kind of
> encoding, disposition and other part headers they have? Or they need to
> know only the first level or anything simpler?

full nesting including lines and octet length as they will output

> As you told there are many ways for IMAP clients to do the same thing,
> but I would probably target Thunderbird and Outlook as the most used
> clients: is this a correct assuption?

i don't use either of them

thunderbird is peculiar and didn't work well with JAMES last time i checked

outlook is worse since it's not a standard IMAP client

evolution used to crash constantly with JAMES but all versions work
ok(ish) with my local fork and the later versions don't crash even
without my local fixes

basically, i've come to the conclusion that the only way to have a
practical IMAP is a full, standard implementation that is reasonably
quick for all operations

> In this case, do we already know
> the exact "query" Thunderbird and Outlook do during opening? (or maybe
> they changed from version to version, too?)

i haven't monitored versions of thunderbird. different versions of
evolution use different queries.

IMO the only way to do IMAP is to do everything quick and correctly

- robert

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [IMAP] MessageResult += Content

Posted by Stefano Bagnara <ap...@bago.org>.

Robert Burrell Donkin ha scritto:
> On Nov 5, 2007 9:01 AM, Stefano Bagnara <ap...@bago.org> wrote:
>> Is it only metadata+headers or they also ask something about bodies
>> (excluding length) during the opening?
> 
> the meta-data includes information about the structure body content
> including access to the MIME meta-data and encoding. also other
> assorted data such as number of lines.

"structure body content" means that during opening they want to know how
many parts compose each message how they are nested, what kind of
encoding, disposition and other part headers they have? Or they need to
know only the first level or anything simpler?

As you told there are many ways for IMAP clients to do the same thing,
but I would probably target Thunderbird and Outlook as the most used
clients: is this a correct assuption? In this case, do we already know
the exact "query" Thunderbird and Outlook do during opening? (or maybe
they changed from version to version, too?)

Stefano

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [IMAP] MessageResult += Content

Posted by Robert Burrell Donkin <ro...@gmail.com>.

On Nov 5, 2007 9:01 AM, Stefano Bagnara <ap...@bago.org> wrote:
> Robert Burrell Donkin ha scritto:
> > in order to create a usable IMAP server, reading meta-data must be
> > very fast and reading normal (non-MIME) mail fast. to create an IMAP
> > server which will run on a machine of moderate power, meta-data
> > reading and normal message reading must not consume a lot of memory.
>
> Do you know what are the "queries" made by most common IMAP clients?

yes but it depends on the client

IMAP suffers from design by committee: there are typically several
different mechanisms to achieve any one goal

> If we know what are the metadata/data required at opening we can try to
> optimize them.

yep (that's this is all about :-)

but full message speed is an effective barrier to adoption since it
imposes an effective practical upper limit on the size of mailboxes
and on the number of concurrent clients that can be supported by the
server

> Is it only metadata+headers or they also ask something about bodies
> (excluding length) during the opening?

the meta-data includes information about the structure body content
including access to the MIME meta-data and encoding. also other
assorted data such as number of lines.

- robert

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [IMAP] MessageResult += Content

Posted by Stefano Bagnara <ap...@bago.org>.

Robert Burrell Donkin ha scritto:
> in order to create a usable IMAP server, reading meta-data must be
> very fast and reading normal (non-MIME) mail fast. to create an IMAP
> server which will run on a machine of moderate power, meta-data
> reading and normal message reading must not consume a lot of memory.

Do you know what are the "queries" made by most common IMAP clients?
If we know what are the metadata/data required at opening we can try to
optimize them.

Is it only metadata+headers or they also ask something about bodies
(excluding length) during the opening?

Stefano


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [IMAP] MessageResult += Content

Posted by Robert Burrell Donkin <ro...@gmail.com>.

On Nov 4, 2007 11:30 PM, Stefano Bagnara <ap...@bago.org> wrote:
> Robert Burrell Donkin ha scritto:
> >> I've not even investigated this option, but before thinking what the
> >> real problems could be I want to be sure I'm understanding what you propose!
> >
> > messages are typically read more often than they are written. unless
> > the API is able to offer some gaurantees about the output, it is
> > forced to assume the worse.
> >
> > in practice, this implies re-parsing and re-encoding the complete
> > message each time any information needs to be read. the code which
> > took this approach is too slow and uses too much memory to be
> > reasonably usable even on a quick machine. (several minutes to open a
> > new mailbox on my AMD64 with 1G RAM allocated to JAMES.)
>
> In SMTP and POP3 this is not a real issue. I don't know IMAP too much.
> Is it a common case that a message content is read over and over again?
> I thought that most things was cached on the client side and read very
> few times from the server. Is this a wrong assumption?

in theory, that's correct. in practice, though, it's not quite so easy.

typically, IMAP clients write message body content very rarely - it's
mainly a reading protocol. IMAP clients try to read body content only
once. IMAP clients write meta-data often and read meta-data very
frequently. this meta-data refers to the correctly encoded MIME form
of the message.

IMAP clients typically read all text messages they haven't seen in a
mailbox (MIME messages are typically only read on display). opening a
mailbox for the first time (or one which has not been opened in a long
while) means reading a lot of mail. if reading each mail takes (on
average) 6 seconds (say) then a moderately sized mailbox with 100
messages will take 10 minutes which is far too long to be usable.

in order to create a usable IMAP server, reading meta-data must be
very fast and reading normal (non-MIME) mail fast. to create an IMAP
server which will run on a machine of moderate power, meta-data
reading and normal message reading must not consume a lot of memory.

- robert

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [IMAP] MessageResult += Content

Posted by Stefano Bagnara <ap...@bago.org>.

Robert Burrell Donkin ha scritto:
>> I've not even investigated this option, but before thinking what the
>> real problems could be I want to be sure I'm understanding what you propose!
> 
> messages are typically read more often than they are written. unless
> the API is able to offer some gaurantees about the output, it is
> forced to assume the worse.
> 
> in practice, this implies re-parsing and re-encoding the complete
> message each time any information needs to be read. the code which
> took this approach is too slow and uses too much memory to be
> reasonably usable even on a quick machine. (several minutes to open a
> new mailbox on my AMD64 with 1G RAM allocated to JAMES.)

In SMTP and POP3 this is not a real issue. I don't know IMAP too much.
Is it a common case that a message content is read over and over again?
I thought that most things was cached on the client side and read very
few times from the server. Is this a wrong assumption?

Stefano

> the MailboxAPI layer is in a position to perform optimisations. it may
> elect to re-encode or cache 8bit mime parts. it may decide to
> re-encode on the way in or on the way out.
> 
> - robert



---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [IMAP] MessageResult += Content

Posted by Robert Burrell Donkin <ro...@gmail.com>.

On Nov 4, 2007 11:21 PM, Stefano Bagnara <ap...@bago.org> wrote:
> Robert Burrell Donkin ha scritto:
> > On Nov 2, 2007 7:22 PM, Robert Burrell Donkin
> > <ro...@gmail.com> wrote:
> >> On Nov 2, 2007 12:34 AM, Stefano Bagnara <ap...@bago.org> wrote:
> >>> Robert Burrell Donkin ha scritto:
> >>>>> I'm not sure I understand the size in octect. You write a StringBuffer,
> >>>>> so it is an unicode string, how can you calculate the real octects if
> >>>>> you don't know the charset/encoding that will be used when the buffer
> >>>>> will be written out?
> >>>> the content must be prior encoded into US-ASCII. probably should be javadoc'd.
> >>> At least SMTP supports 8bitmime feature and binary encoding. Do you mean
> >>> that we'll have to re-encode that messages in order to store them using
> >>> the MailboxManager API ?
> >> this is an output API: the input API is a different matter
> >>
> >> IMHO the MailboxAPI should be liberal in what it accepts but precise
> >> in what it outputs
> >
> > there is a fundemantal conflict between the needs of a system that
> > just wants to store a MimeMessage quickly and then retrieve it a small
> > number of times with absolute fidelity at some future time, and the
> > needs of protocols that need to read that data quickly many times.
>
> Right. Something we should care about is also RFC compliance.

+1

> To keep
> SMTP compliance we should make sure that a message is not normalized or
> "fixed" before it is relayed (as an example).

i've add this example to
http://wiki.apache.org/james/BackendMailboxAPI but it would be great
if other SMTP requirements could be added

> IIRC SMTP tell us that we can reject an invalid message but we can't fix
> it and relay it. So we can normalize/fix it/be liberal only if we keep
> the result for ourselves, but we need a way to "relay" the original message.

ok

> Maybe we should simply avoid using this mailbox stuff also for spooling
> and keep the spooling very "stream/buffer" oriented while
> parsing/normalizing/fixing when storing to the mailboxes.

i'm not sure that the MailboxAPI is right to fix or normalise at all.
all that the backend can do is to convert all isolated CRs and LFs to
CRLFs. my reading of RFC822 is that the responsibility to fix line
endings is at the transport boundary. this makes sense to me: only at
that boundary can the incoming encoding be correctly understood.

ATM the MailboxAPI corrects line endings. IMHO this is wrong and
should be changed.

> This for sure
> is a performance leak as while spooling we often have mailets looking up
> message content/structure/headers. Maybe the "liberal" parsed version
> during spooling can simply be cached and only stored when the message is
> delivered to the mailbox. Or maybe we should keep the stream/buffers in
> the spool and then lazily create a structured representation of the
> message as soon as a parsing is needed and only use the parsed version
> once the message is altered (let's still rememeber that adding the
> Received header on top of the message should be done without altering
> anything else in the message by a relaying server).

either of these options sounds acceptable to me

- robert

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [IMAP] MessageResult += Content

Posted by Stefano Bagnara <ap...@bago.org>.

Robert Burrell Donkin ha scritto:
> On Nov 2, 2007 7:22 PM, Robert Burrell Donkin
> <ro...@gmail.com> wrote:
>> On Nov 2, 2007 12:34 AM, Stefano Bagnara <ap...@bago.org> wrote:
>>> Robert Burrell Donkin ha scritto:
>>>>> I'm not sure I understand the size in octect. You write a StringBuffer,
>>>>> so it is an unicode string, how can you calculate the real octects if
>>>>> you don't know the charset/encoding that will be used when the buffer
>>>>> will be written out?
>>>> the content must be prior encoded into US-ASCII. probably should be javadoc'd.
>>> At least SMTP supports 8bitmime feature and binary encoding. Do you mean
>>> that we'll have to re-encode that messages in order to store them using
>>> the MailboxManager API ?
>> this is an output API: the input API is a different matter
>>
>> IMHO the MailboxAPI should be liberal in what it accepts but precise
>> in what it outputs
> 
> there is a fundemantal conflict between the needs of a system that
> just wants to store a MimeMessage quickly and then retrieve it a small
> number of times with absolute fidelity at some future time, and the
> needs of protocols that need to read that data quickly many times.

Right. Something we should care about is also RFC compliance. To keep
SMTP compliance we should make sure that a message is not normalized or
"fixed" before it is relayed (as an example).
IIRC SMTP tell us that we can reject an invalid message but we can't fix
it and relay it. So we can normalize/fix it/be liberal only if we keep
the result for ourselves, but we need a way to "relay" the original message.

Maybe we should simply avoid using this mailbox stuff also for spooling
and keep the spooling very "stream/buffer" oriented while
parsing/normalizing/fixing when storing to the mailboxes. This for sure
is a performance leak as while spooling we often have mailets looking up
message content/structure/headers. Maybe the "liberal" parsed version
during spooling can simply be cached and only stored when the message is
delivered to the mailbox. Or maybe we should keep the stream/buffers in
the spool and then lazily create a structured representation of the
message as soon as a parsing is needed and only use the parsed version
once the message is altered (let's still rememeber that adding the
Received header on top of the message should be done without altering
anything else in the message by a relaying server).

Stefano

> for example, examining every byte and then normalising line endings is
> expensive if it's done for every read. it should be done before the
> message is stored. however, simply normalising means losing
> information.
> 
> - robert

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [IMAP] MessageResult += Content

Posted by Robert Burrell Donkin <ro...@gmail.com>.

On Nov 2, 2007 7:22 PM, Robert Burrell Donkin
<ro...@gmail.com> wrote:
> On Nov 2, 2007 12:34 AM, Stefano Bagnara <ap...@bago.org> wrote:
> > Robert Burrell Donkin ha scritto:
> > >> I'm not sure I understand the size in octect. You write a StringBuffer,
> > >> so it is an unicode string, how can you calculate the real octects if
> > >> you don't know the charset/encoding that will be used when the buffer
> > >> will be written out?
> > >
> > > the content must be prior encoded into US-ASCII. probably should be javadoc'd.
> >
> > At least SMTP supports 8bitmime feature and binary encoding. Do you mean
> > that we'll have to re-encode that messages in order to store them using
> > the MailboxManager API ?
>
> this is an output API: the input API is a different matter
>
> IMHO the MailboxAPI should be liberal in what it accepts but precise
> in what it outputs

there is a fundemantal conflict between the needs of a system that
just wants to store a MimeMessage quickly and then retrieve it a small
number of times with absolute fidelity at some future time, and the
needs of protocols that need to read that data quickly many times.

for example, examining every byte and then normalising line endings is
expensive if it's done for every read. it should be done before the
message is stored. however, simply normalising means losing
information.

- robert

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [IMAP] MessageResult += Content

Posted by Robert Burrell Donkin <ro...@gmail.com>.

On Nov 2, 2007 12:34 AM, Stefano Bagnara <ap...@bago.org> wrote:
> Robert Burrell Donkin ha scritto:
> >> I'm not sure I understand the size in octect. You write a StringBuffer,
> >> so it is an unicode string, how can you calculate the real octects if
> >> you don't know the charset/encoding that will be used when the buffer
> >> will be written out?
> >
> > the content must be prior encoded into US-ASCII. probably should be javadoc'd.
>
> At least SMTP supports 8bitmime feature and binary encoding. Do you mean
> that we'll have to re-encode that messages in order to store them using
> the MailboxManager API ?

this is an output API: the input API is a different matter

IMHO the MailboxAPI should be liberal in what it accepts but precise
in what it outputs

> I've not even investigated this option, but before thinking what the
> real problems could be I want to be sure I'm understanding what you propose!

messages are typically read more often than they are written. unless
the API is able to offer some gaurantees about the output, it is
forced to assume the worse.

in practice, this implies re-parsing and re-encoding the complete
message each time any information needs to be read. the code which
took this approach is too slow and uses too much memory to be
reasonably usable even on a quick machine. (several minutes to open a
new mailbox on my AMD64 with 1G RAM allocated to JAMES.)

the MailboxAPI layer is in a position to perform optimisations. it may
elect to re-encode or cache 8bit mime parts. it may decide to
re-encode on the way in or on the way out.

- robert

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [IMAP] MessageResult += Content

Posted by Stefano Bagnara <ap...@bago.org>.

Robert Burrell Donkin ha scritto:
>> I'm not sure I understand the size in octect. You write a StringBuffer,
>> so it is an unicode string, how can you calculate the real octects if
>> you don't know the charset/encoding that will be used when the buffer
>> will be written out?
> 
> the content must be prior encoded into US-ASCII. probably should be javadoc'd.

At least SMTP supports 8bitmime feature and binary encoding. Do you mean
that we'll have to re-encode that messages in order to store them using
the MailboxManager API ?

I've not even investigated this option, but before thinking what the
real problems could be I want to be sure I'm understanding what you propose!

Stefano


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [IMAP] MessageResult += Content

Posted by Robert Burrell Donkin <ro...@gmail.com>.

On Oct 31, 2007 9:44 AM, Stefano Bagnara <ap...@bago.org> wrote:
> +1 for the interface.
>
> Maybe "writeTo" is better than "write" (the first thought when I read
> write(StringBuffer) is that the method write the content of StringBuffer
>  somewhere and not viceversa).

+1

> I'm not sure I understand the size in octect. You write a StringBuffer,
> so it is an unicode string, how can you calculate the real octects if
> you don't know the charset/encoding that will be used when the buffer
> will be written out?

the content must be prior encoded into US-ASCII. probably should be javadoc'd.

IMO use of StringBuffer is a poor design choice (but some work would
be required to change it) but has no negative practical effects.
should probably deprecate.

> Do we need to know the charset and the transfer encoding from the header
> of this content to be able to correctly evaluate the content or the
> content has already been correctly "decoded" ?

the content needs to be appropriately prior encoded. if the content
has been decoded by the backend then it needs to be appropriately
re-encoded.

- robert

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: [IMAP] MessageResult += Content

Posted by Stefano Bagnara <ap...@bago.org>.

+1 for the interface.

Maybe "writeTo" is better than "write" (the first thought when I read
write(StringBuffer) is that the method write the content of StringBuffer
 somewhere and not viceversa).

I'm not sure I understand the size in octect. You write a StringBuffer,
so it is an unicode string, how can you calculate the real octects if
you don't know the charset/encoding that will be used when the buffer
will be written out?

Do we need to know the charset and the transfer encoding from the header
of this content to be able to correctly evaluate the content or the
content has already been correctly "decoded" ?

Stefano

Robert Burrell Donkin ha scritto:
> i've been reworking https://issues.apache.org/jira/browse/JAMES-808 to
> factor out an interface for content which exposes the size and allows
> the content to be written. there are quite a number of different bits
> of content which would benefit from the size+write approach and so
> IMHO an extra interface will help to keep the API consider and
> readable
> 
> hope to commit a version today. i'd be grateful if people would take a
> look at the commit diffs and either patch any design improvements they
> can see or reply to list
> 
> - robert
> 
>     /**
>      * IMAP needs to know the size of the content before it starts to
> write it out.
>      * This interface allows direct writing whilst exposing total size.
>      */
>     public interface Content {
>         /**
>          * Writes content into the given buffer.
>          * @param buffer <code>StringBuffer</code>, not null
>          * @throws MessagingException
>          */
>         public void write(StringBuffer buffer) throws MessagingException;
> 
>         /**
>          * Size (in octets) of the content.
>          * @return number of octets to be written
>          * @throws MessagingException
>          */
>         public long size() throws MessagingException;
>     }



---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org