You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-dev@james.apache.org by Niklas Therning <ni...@trillian.se> on 2004/08/10 11:41:42 UTC
SAX-like MIME parser
I read a discussion
(http://www.mail-archive.com/server-dev@james.apache.org/msg01985.html)
on this list a while ago concerning a stream-based MIME parser which was
going to be donated to ASF. Since then I haven't been able to find any
more information about this. Is this not going to happen?
The reason why I ask is that we might have developed something similar.
It's a MIME message parser similar to SAX using a callback object to
report parsing event such as the start of a new entity, the start of a
header, etc. On top of this we have another parser parsing messages into
more of a DOM-like structure using temporary files for large attachments
to save on memory.
We would like to release this library as open source (Apache licensed of
course) but if there are other similar projects due to be released
shortly maybe we should try to join forces with them instead.
Our library has been tested on a large number of messages (~10000
messages - a lot of them spam) and the Perl MIME:Tools parser has been
used for comparison. In most cases (99%) the results of our parsers and
the MIME:Tools parser are identical. When the results differ (which has
only happened for illegal MIME encoded spam messages) we think our
parser does a better job.
Regards,
Niklas Therning
---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org
Re: SAX-like MIME parser
Posted by Joe Cheng <co...@joecheng.com>.
Sourceforge is not responding right now... when it comes back up I'll
take a look at the code.
> Joe, looking at your code gave me a lot of ideas how our parser can be
> improved! One thing I don't like about our parser is that it is line
> oriented. It reads line by line and (at least for the body of an
> entity) the callback is called for each line. I would like to
> incorporate your MIMEBoundaryInputStream into our parser and change
> the signature of the body, preamble and epilogue callback methods so
> they get an InputStream to read the contents from instead. Since our
> philosophy is to have a parser which gives the user exactly what is
> read from the message, I don't think the InputStream should be wrapped
> in InputStreams that do base64 or qp decoding. This could easily be
> done at a higher level for those who require it.
Fair enough... you should be able to do this by a trivial change in
MIMEParser.java.
> What does MIMEBoundaryInputStream do if the message stream is missing
> the boundary it is looking for but still contains higher level
> boundaries? Of course this violates the standards but I have seen it
> occur (at least in spam messages). I think it should look not only for
> the expected next boundary but should also detect any boundary of a
> higher level (closer to the root of the mime entity tree) part and
> stop. That's what we're doing and it gives a nice result for bad
> messages.
Actually, this should already be handled correctly. The higher level
MIMEBoundaryInputStream will see its boundary. The nested
MIMEBoundaryInputStream will simply see a premature EOF, which it is
designed to handle. So are the Base64 and QP decoders.
The symmetry of the nested InputStream design makes a lot of these edge
conditions disappear. In fact, you could even embed a base64-encoded
message/rfc822 that has multiple MIME parts (which is a horrible
violation of the spec) and it would still work--you'd get callbacks on
the nested message, even though the boundaries and everything are base64.
> I'd like to share some ideas on how we're testing the parser. The perl
> MIME::Tools package (used by SpamAssassin and lots of other projects)
> contains a number of faulty or hard to parse test messages which I
> have included in the sources. We have a perl script which uses the
> MIME::Parser to parse those messages and produces XML files for each
> one. We use these XML files in our tests to compare with the output of
> our parser. We have also used the same testing method to test with a
> large set of real world messages. Comparing the output with another
> parser (which you should trust of course ;-) ) is a great help
> because it would be impossible to compare the results by hand when you
> run the parser through a couple of thousand messages.
That is a great idea. I tested the hardest messages I could find by
hand (vs. Mozilla Thunderbird and Outlook Express) and for the rest,
just ran them through the parser and sanity-checked the messages that
the parser didn't like. Your way is much better.
> I strongly think we should try to collaborate on this project somehow
> as suggested by Noel. We should try to set up some common design goals
> if possible and start from there. I don't think it would be a problem
> for us to donate our code to ASF. By the way, is Joe's code going to
> be a standalone project (in commons for example) or will it be
> included in James? We would prefer to have a seperate jar which would
> make it much easier to use it in our application.
Did we ever come to a conclusion on this? I think we were talking about
starting with it in James, so we can start using it without waiting for
it to gestate in Commons?
Now I'm all itchy to implement the field header parsers. I'll start
getting reacquainted with the spec tonight...
---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org
RE: SAX-like MIME parser
Posted by "Noel J. Bergman" <no...@devtech.com>.
Niklas Therning wrote:
> It's in the CVS now at sourceforge, project name mime4j.
I see http://cvs.sourceforge.net/viewcvs.py/mime4j/. Looks like one of them
is Joe's and the other is yours. I am preparing to import Joe's code into
Subversion. If you would please have a Software Grant
(http://www.apache.org/licenses/) for your code, and provide the code to us,
we can import it as well. The other thing we need to do is fill out a
form[1] for the Incubator that basically documents that we have done due
diligance to make sure that the IP issues are taken care of. Joe's code
pre-dates that form, but we do have a Software Grant from his company and a
CLA from him.
Seeing the great ideas and energy you guys are starting to generate, I'd
like to see it put to work. :-)
As for how it is packaged, that is TDB, and we don't have to package it the
same for JAMES as you want it for your application. I do agree that it
makes sense to keep it suitable for standalone packaging, and would like to
see unit tests included in the package tree.
One thing that is absent from Joe's code is the ability to edit messages.
When Joe mentioned that you have a DOM-like model included in your code, as
well as a SAX-like model, it sounded to me as if your approach might be
amenable to having edit support. Yes?
--- Noel
[1]
http://cvs.apache.org/viewcvs/incubator/site/projects/ip-clearance-template.
cwiki
---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org
Re: SAX-like MIME parser
Posted by Niklas Therning <ni...@trillian.se>.
Noel J. Bergman wrote:
>>They have access to my code, but I have not seen theirs yet (they are
>>working on getting it loaded into SourceForge).
>>
>>
>
>Well, I will get yours into source control before end of weekend (if not
>tonight -- depends upon the weather). They can also submit a Software Grant
>for it, and we could put that in source control, too. What I could envision
>would be you and they collaborating here on the further development of a
>"best of" codebase.
>
>
It's in the CVS now at sourceforge, project name mime4j. I can't see it
using viewcvs but I can check it out (I haven't tried checking out by
anon cvs though). Maybe viewcvs caches things for a while.
Joe, looking at your code gave me a lot of ideas how our parser can be
improved! One thing I don't like about our parser is that it is line
oriented. It reads line by line and (at least for the body of an entity)
the callback is called for each line. I would like to incorporate your
MIMEBoundaryInputStream into our parser and change the signature of the
body, preamble and epilogue callback methods so they get an InputStream
to read the contents from instead. Since our philosophy is to have a
parser which gives the user exactly what is read from the message, I
don't think the InputStream should be wrapped in InputStreams that do
base64 or qp decoding. This could easily be done at a higher level for
those who require it.
What does MIMEBoundaryInputStream do if the message stream is missing
the boundary it is looking for but still contains higher level
boundaries? Of course this violates the standards but I have seen it
occur (at least in spam messages). I think it should look not only for
the expected next boundary but should also detect any boundary of a
higher level (closer to the root of the mime entity tree) part and stop.
That's what we're doing and it gives a nice result for bad messages.
I'd like to share some ideas on how we're testing the parser. The perl
MIME::Tools package (used by SpamAssassin and lots of other projects)
contains a number of faulty or hard to parse test messages which I have
included in the sources. We have a perl script which uses the
MIME::Parser to parse those messages and produces XML files for each
one. We use these XML files in our tests to compare with the output of
our parser. We have also used the same testing method to test with a
large set of real world messages. Comparing the output with another
parser (which you should trust of course ;-) ) is a great help because
it would be impossible to compare the results by hand when you run the
parser through a couple of thousand messages.
I strongly think we should try to collaborate on this project somehow as
suggested by Noel. We should try to set up some common design goals if
possible and start from there. I don't think it would be a problem for
us to donate our code to ASF. By the way, is Joe's code going to be a
standalone project (in commons for example) or will it be included in
James? We would prefer to have a seperate jar which would make it much
easier to use it in our application.
/Niklas Therning
---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org
RE: SAX-like MIME parser
Posted by "Noel J. Bergman" <no...@devtech.com>.
> They have access to my code, but I have not seen theirs yet (they are
> working on getting it loaded into SourceForge).
Well, I will get yours into source control before end of weekend (if not
tonight -- depends upon the weather). They can also submit a Software Grant
for it, and we could put that in source control, too. What I could envision
would be you and they collaborating here on the further development of a
"best of" codebase.
--- Noel
---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org
Re: SAX-like MIME parser
Posted by Joe Cheng <co...@joecheng.com>.
>I would be very pleased to see you and Joe Cheng collaborate. Have you had
>a chance to compare notes and code, and see how much common ground exists
>within the two MIME parsers?
>
>
They have access to my code, but I have not seen theirs yet (they are
working on getting it loaded into SourceForge).
Here's what I've been able to glean from an e-mail exchange with Niklas
yesterday.
Their streaming parser is more low-level than mine: the way mine is
currently written, you don't have the option of examining an attachment
in its raw base64 state; instead you are given the decoded stream.
IIUC, their streaming parser calls you back with each line of the pure,
unadulterated RFC822 message; if a streaming base64/qp-decoding parser
is needed, you would build it on top of the low-level parser. They also
provide callbacks for the preamble and epilogue, which I currently discard.
My parser could easily be modified to work the way theirs does, and I
can only assume the opposite is true as well (or at least that the
layered decoding streaming parser would be quite easy to write). But it
does reflect the difference between our two approaches. I wrote a
streaming parser to be directly used; if a somewhat different type of
parser is needed, my thought was to reuse the underlying input stream
filters, because they are natural building blocks and outside of them
there is really very little code. Niklas and his team wrote a low-level
streaming parser, with the intention that user-level parsers would be
built on top, of which the DOM parser is the first.
Both parsers should do a good job not crashing on real-world messages,
even ones that are not quite legal MIME.
Their providing a DOM model is a nice bonus. I don't have one, but when
I considered it I wanted to build it such that almost no decoded data
would need to be held in memory. (It'd essentially be like faking
continuations at appropriate times during the streaming run.)
Their DOM parser, which does the content transfer decoding, fully
decodes all attachments when you run the parser. Niklas says: "Altough
constructing a complete parse tree this parser tries to be memory
efficient by using temporary files to store large attachments." My
parser doesn't charge you (much of) a penalty for attachments you don't
actually do anything with; in other words, only the bytes you ask for
are decoded. And mine doesn't use any temporary files.
On the other hand, perhaps their DOM layer could be modified to take a
body part selector, that would let you decide what parts you want (i.e.
only if content-type matches text/*, etc.).
The most critical piece missing from my parser is that you don't get
much help with parsing most structured headers (From, To, Received,
Date, etc.). Unfortunately, their parser currently is in the same state
as mine. We both have unfolding and full parsing of MIME-relevant
fields, but that's all. Niklas said they have plans to implement
parsers for the other header fields, and I... well, I have the best of
intentions. :)
I think I've said more than enough considering I haven't even seen their
code yet!
-joe
---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org
RE: SAX-like MIME parser
Posted by "Noel J. Bergman" <no...@devtech.com>.
Niklas Therning wrote:
> I read a discussion
> (http://www.mail-archive.com/server-dev@james.apache.org/msg01985.html)
Yes, we have that code, and need to put it into the MAIN branch, now that we
have shipped version 2.2. Actually, since we're going to end up migrating
into SVN, I might put it into Subversion to start. We already have jSieve
there.
> The reason why I ask is that we might have developed something similar.
> It's a MIME message parser similar to SAX using a callback object to
> report parsing event such as the start of a new entity, the start of a
> header, etc. On top of this we have another parser parsing messages into
> more of a DOM-like structure using temporary files for large attachments
> to save on memory.
> We would like to release this library as open source (Apache licensed of
> course) but if there are other similar projects due to be released
> shortly maybe we should try to join forces with them instead.
I would be very pleased to see you and Joe Cheng collaborate. Have you had
a chance to compare notes and code, and see how much common ground exists
within the two MIME parsers?
--- Noel
---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org
Re: SAX-like MIME parser
Posted by Joe Cheng <co...@joecheng.com>.
Hi Niklas,
I was waiting for Noel et al to be ready for James 3.0 development. At
the time, everyone was starting to focus on getting v2.2.0 or whatever
out the door.
As we've discussed off-list today, I don't have a problem with stepping
aside if your library is superior and/or you and your team have more
time and resource to dedicate to this (which is virtually certain). And
I also have no problem with you taking as much or as little of my code
as you want.
-joe
---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org