You are viewing a plain text version of this content. The canonical link for it is here.

Posted to server-dev@james.apache.org by Jukka Zitting <ju...@gmail.com> on 2007/05/31 16:08:45 UTC

Mime4j and buffering

Hi,

I've been looking at MIME4J-5 and I have a few ideas on how to speed
up parsing. However, I'm not sure about how the underlying mime stream
should be treated. I would use a lookahead buffer but that would leave
the underlying stream in an undefined state for example when parsing
is stopped with MimeStreamParser.stop(). Is this OK?

BR,

Jukka Zitting

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: Mime4j and buffering

Posted by Niklas Therning <ni...@trillian.se>.

Jukka Zitting wrote:
> Hi,
>
> I've been looking at MIME4J-5 and I have a few ideas on how to speed
> up parsing. However, I'm not sure about how the underlying mime stream
> should be treated. I would use a lookahead buffer but that would leave
> the underlying stream in an undefined state for example when parsing
> is stopped with MimeStreamParser.stop(). Is this OK?
>
I think it would be ok. Actually, I'm not sure that the current code
leaves the stream in a defined state when stop() is called.

-- 
Niklas Therning
www.spamdrain.net


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: Mime4j and buffering

Posted by robert burrell donkin <ro...@gmail.com>.

On 7/11/07, robert burrell donkin <ro...@gmail.com> wrote:
> On 7/11/07, Jochen Wiedmann <jo...@gmail.com> wrote:
> > robert burrell donkin-2 wrote:
> > >
> > > On 5/31/07, Jukka Zitting <ju...@gmail.com> wrote:
> > >
> > >> I'm planning to use a buffer (even a mapped one if using nio) to load
> > >> larger chunks of the message being parsed. The parser can then "look
> > >> ahead" in the buffer to find the next multipart boundary without
> > >> having to check each byte individually.
> > >
> > > any progress?
> > >
> >
> > It took me about 4 hours to create the patch for MIME4J-19. (Pull parser
> > API)
>
> cool :-)
>
> (i've heard that cxf also has an email pull parser - be interesting to
> compare designs)
>
> > Given my experiences with commons-fileupload, I believe it would take
> > another 6 hours or so to rewrite MIME4J-19 a second time in order to use a
> > single, buffered InputStream, which would even be able to provide
> > information like line and column number and byte offset.
>
> sounds interesting :-)

i've had a play around this afternoon and think this could be the best way to go

> i've been thinking about nio and parsers for bytebuffers recently.
> (the current JAMES IMAP implementation stores the bodies in byte
> arrays.)

it should be reasonably easy to add basic support for parsing data in
a byte buffer but it'll need to wait until the pull parser is
committed

- robert

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: Mime4j and buffering

Posted by robert burrell donkin <ro...@gmail.com>.

On 7/11/07, Jochen Wiedmann <jo...@gmail.com> wrote:
> robert burrell donkin-2 wrote:
> >
> > On 5/31/07, Jukka Zitting <ju...@gmail.com> wrote:
> >
> >> I'm planning to use a buffer (even a mapped one if using nio) to load
> >> larger chunks of the message being parsed. The parser can then "look
> >> ahead" in the buffer to find the next multipart boundary without
> >> having to check each byte individually.
> >
> > any progress?
> >
>
> It took me about 4 hours to create the patch for MIME4J-19. (Pull parser
> API)

cool :-)

(i've heard that cxf also has an email pull parser - be interesting to
compare designs)

> Given my experiences with commons-fileupload, I believe it would take
> another 6 hours or so to rewrite MIME4J-19 a second time in order to use a
> single, buffered InputStream, which would even be able to provide
> information like line and column number and byte offset.

sounds interesting :-)

i've been thinking about nio and parsers for bytebuffers recently.
(the current JAMES IMAP implementation stores the bodies in byte
arrays.)

- robert

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: Mime4j and buffering

Posted by Serge Knystautas <sk...@gmail.com>.

On 7/11/07, Jochen Wiedmann <jo...@gmail.com> wrote:
> It took me about 4 hours to create the patch for MIME4J-19. (Pull parser
> API) Given my experiences with commons-fileupload, I believe it would take
> another 6 hours or so to rewrite MIME4J-19 a second time in order to use a
> single, buffered InputStream, which would even be able to provide
> information like line and column number and byte offset.

This sounds very promising.  I had to work on parsing a huge XML doc
that would have been impossible with DOM, and SAX is a pain to use.  I
found the StAX parser [1] that is available in Java 6, and it's a
cursor based/pull-style XML parser.  I found the pattern to be very
very effective and seems like it could address identical issues that
affect mime parsing.

I'm not sure if/how they handle writing/modilfying XML documents, but
might be interesting to see how they address that for Mime4j.  It
looks like this is one of the rare JCP groups that actually did a good
job collecting (rather than inventing) requirements and solutions.

[1] http://www.javabeat.net/javabeat/java6/articles/2007/06/java-6-0-new-features-part-2/2

-- 
Serge Knystautas
Lokitech >> software . strategy . design >> http://www.lokitech.com
p. 301.656.5501
e. sergek@lokitech.com

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: Mime4j and buffering

Posted by Jochen Wiedmann <jo...@gmail.com>.

robert burrell donkin-2 wrote:
> 
> On 5/31/07, Jukka Zitting <ju...@gmail.com> wrote:
> 
>> I'm planning to use a buffer (even a mapped one if using nio) to load
>> larger chunks of the message being parsed. The parser can then "look
>> ahead" in the buffer to find the next multipart boundary without
>> having to check each byte individually.
> 
> any progress?
> 

It took me about 4 hours to create the patch for MIME4J-19. (Pull parser
API) Given my experiences with commons-fileupload, I believe it would take
another 6 hours or so to rewrite MIME4J-19 a second time in order to use a
single, buffered InputStream, which would even be able to provide
information like line and column number and byte offset.

Jochen

-- 
View this message in context: http://www.nabble.com/Mime4j-and-buffering-tf3846749.html#a11548847
Sent from the James - Dev mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: Mime4j and buffering

Posted by robert burrell donkin <ro...@gmail.com>.

On 5/31/07, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> On 5/31/07, Norman Maurer <no...@apache.org> wrote:
> > Can you please elaborate what a lookahead buffer is ?
>
> I'm planning to use a buffer (even a mapped one if using nio) to load
> larger chunks of the message being parsed. The parser can then "look
> ahead" in the buffer to find the next multipart boundary without
> having to check each byte individually.

any progress?

the biggest issue ATM with JAMES IMAP is that MIMEMessage really isn't
a suitable intermediary representation. it's nether efficient nor
accurate. i've taken a look at the current mime4j API. i like it but
try as i might, a streaming push API just isn't efficient for IMAP.

- robert

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: Mime4j and buffering

Posted by Norman Maurer <no...@apache.org>.

Am Donnerstag, den 31.05.2007, 17:23 +0100 schrieb robert burrell
donkin:
> On 5/31/07, Jukka Zitting <ju...@gmail.com> wrote:
> > Hi,
> >
> > On 5/31/07, Norman Maurer <no...@apache.org> wrote:
> > > Can you please elaborate what a lookahead buffer is ?
> >
> > I'm planning to use a buffer (even a mapped one if using nio) to load
> > larger chunks of the message being parsed. The parser can then "look
> > ahead" in the buffer to find the next multipart boundary without
> > having to check each byte individually.
> 
> sounds great :-)
> 
> i thinking of switching IMAP to use mime4j (to avoid issues with
> MimeMessage) and being able to nio fits in very well with my plans
> 
> - robert

+1

bye
Norman


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: Mime4j and buffering

Posted by robert burrell donkin <ro...@gmail.com>.

On 5/31/07, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> On 5/31/07, Norman Maurer <no...@apache.org> wrote:
> > Can you please elaborate what a lookahead buffer is ?
>
> I'm planning to use a buffer (even a mapped one if using nio) to load
> larger chunks of the message being parsed. The parser can then "look
> ahead" in the buffer to find the next multipart boundary without
> having to check each byte individually.

sounds great :-)

i thinking of switching IMAP to use mime4j (to avoid issues with
MimeMessage) and being able to nio fits in very well with my plans

- robert

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: Mime4j and buffering

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On 5/31/07, Norman Maurer <no...@apache.org> wrote:
> Can you please elaborate what a lookahead buffer is ?

I'm planning to use a buffer (even a mapped one if using nio) to load
larger chunks of the message being parsed. The parser can then "look
ahead" in the buffer to find the next multipart boundary without
having to check each byte individually.

BR,

Jukka Zitting

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: Mime4j and buffering

Posted by Norman Maurer <no...@apache.org>.

Can you please elaborate what a lookahead buffer is ?

bye
Norman

Am Donnerstag, den 31.05.2007, 17:08 +0300 schrieb Jukka Zitting:
> Hi,
> 
> I've been looking at MIME4J-5 and I have a few ideas on how to speed
> up parsing. However, I'm not sure about how the underlying mime stream
> should be treated. I would use a lookahead buffer but that would leave
> the underlying stream in an undefined state for example when parsing
> is stopped with MimeStreamParser.stop(). Is this OK?
> 
> BR,
> 
> Jukka Zitting
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
> For additional commands, e-mail: server-dev-help@james.apache.org
> 
> 
> !DSPAM:1,465ed71579388192752921!
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: Mime4j and buffering

Posted by Stefano Bagnara <ap...@bago.org>.

Jukka Zitting ha scritto:
> Hi,
> 
> By the way, do we have somewhere a good set of test messages I could
> use when testing my Mime4j modifications?
> 
> BR,
> 
> Jukka Zitting

I attached here the messages I removed (for copyright issues)
http://issues.apache.org/jira/browse/MIME4J-11

I think you should better create some custom message specific to the
area you change: e.g. if you need a big message to check memory usage
and elapsed then you can create a big message. You could also create a
worst case message where the boundary text with a removed char is used
as a fill pattern for the body content.

You create the message, place it in the test folder, run the tests. It
will create "expected" files and fail. The in you manually inspect
created files for correctness and you rename them to be the "official"
test files. The next runs they will be part of the tests.

Stefano


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: Mime4j and buffering

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

By the way, do we have somewhere a good set of test messages I could
use when testing my Mime4j modifications?

BR,

Jukka Zitting

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: Mime4j and buffering

Posted by robert burrell donkin <ro...@gmail.com>.

On 6/7/07, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> On 6/7/07, Andrew C. Oliver <ac...@buni.org> wrote:
> > although you can't use it (due to Apache's anti-LGPL dogma)
> > http://blog.buni.org/blog/mbarker/Meldware/2007/06/04/Panto-0-4-release-Still-really-fast
> >
> > I suggest looking at the technique used by Buni's panto.
>
> Thanks for the tip! I actually considered using a similar approach but
> with quick testing it seems like the benefit of skipping bytes in the
> Boyer-Moore algorithm is not too big for typical MIME boundaries that
> are something like 20-40 bytes long. I guess the cache lines of
> typical processors are already that big, so fetching just a single
> byte within the boundary range is roughly equivalent to fetching all
> the bytes especially if you have slow RAM.
>
> I'm currently experimenting with an algorithm that does a sequential
> scan of the data, but instead of doing it one byte at a time the
> algorithm tries to mach 4 or 8 byte sequences depending on how long
> the boundary string is.

perhaps multiple algorithms would be useful. it is often possible to
know or guestimate the size of the message.

- robert

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: Mime4j and buffering

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On 6/7/07, Andrew C. Oliver <ac...@buni.org> wrote:
> although you can't use it (due to Apache's anti-LGPL dogma)
> http://blog.buni.org/blog/mbarker/Meldware/2007/06/04/Panto-0-4-release-Still-really-fast
>
> I suggest looking at the technique used by Buni's panto.

Thanks for the tip! I actually considered using a similar approach but
with quick testing it seems like the benefit of skipping bytes in the
Boyer-Moore algorithm is not too big for typical MIME boundaries that
are something like 20-40 bytes long. I guess the cache lines of
typical processors are already that big, so fetching just a single
byte within the boundary range is roughly equivalent to fetching all
the bytes especially if you have slow RAM.

I'm currently experimenting with an algorithm that does a sequential
scan of the data, but instead of doing it one byte at a time the
algorithm tries to mach 4 or 8 byte sequences depending on how long
the boundary string is.

BR,

Jukka Zitting

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: Mime4j and buffering

Posted by Stefano Bagnara <ap...@bago.org>.

Andrew C. Oliver ha scritto:
> although you can't use it (due to Apache's anti-LGPL dogma)
> http://blog.buni.org/blog/mbarker/Meldware/2007/06/04/Panto-0-4-release-Still-really-fast
> 
> 
> I suggest looking at the technique used by Buni's panto.

Hi Andrew,

I didn't think at it before, but Boyer–Moore string search is really
appropriate for boundary matching!

Thank you for the hint,
Stefano

PS: if you used MPL or a similar license I would have evaluated part of
your meldware suite for a project of mine, but LGPL is something no one
can really use (and feel safe) in a java project, imho.


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: Mime4j and buffering

Posted by Stefano Bagnara <ap...@bago.org>.

Stefano Bagnara ha scritto:
> Andrew C. Oliver ha scritto:
>> although you can't use it (due to Apache's anti-LGPL dogma)
>> http://blog.buni.org/blog/mbarker/Meldware/2007/06/04/Panto-0-4-release-Still-really-fast
>>
>>
>> I suggest looking at the technique used by Buni's panto.
> 
> Hi Andrew,
> 
> I didn't think at it before, but Boyer–Moore string search is really
> appropriate for boundary matching!

If anyone is interested we already have an ASF implementation of
Boyer-Moore algorythm:
org.apache.xerces.impl.xpath.regex.BMPattern
http://svn.apache.org/viewvc/xerces/java/trunk/src/org/apache/xerces/impl/xpath/regex/BMPattern.java?revision=446721&view=markup

Stefano


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org

Re: Mime4j and buffering

Posted by "Andrew C. Oliver" <ac...@buni.org>.

although you can't use it (due to Apache's anti-LGPL dogma)
http://blog.buni.org/blog/mbarker/Meldware/2007/06/04/Panto-0-4-release-Still-really-fast

I suggest looking at the technique used by Buni's panto.

"
Parsing a 2MB message 20 times:

Apache Mime4J: 4049ms
Buni Panto: 233ms
"

(such is a somewhat absurd test but it is because I whined when he 
checked in a much larger message into CVS).  Another thing not here is 
the memory requirements.  Note that Meldware doesn't keep the entire 
mail in memory the way javamail requires..

Jukka Zitting wrote:
> Hi,
> 
> I've been looking at MIME4J-5 and I have a few ideas on how to speed
> up parsing. However, I'm not sure about how the underlying mime stream
> should be treated. I would use a lookahead buffer but that would leave
> the underlying stream in an undefined state for example when parsing
> is stopped with MimeStreamParser.stop(). Is this OK?
> 
> BR,
> 
> Jukka Zitting
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
> For additional commands, e-mail: server-dev-help@james.apache.org


-- 
Buni Meldware Communication Suite
http://buni.org
Multi-platform and extensible Email,
Calendaring (including freebusy),
Rich Webmail, Web-calendaring, ease
of installation/administration.

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org