You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-dev@james.apache.org by Niklas Therning <ni...@trillian.se> on 2004/08/10 11:41:42 UTC

SAX-like MIME parser

I read a discussion 
(http://www.mail-archive.com/server-dev@james.apache.org/msg01985.html) 
on this list a while ago concerning a stream-based MIME parser which was 
going to be donated to ASF. Since then I haven't been able to find any 
more information about this. Is this not going to happen?

The reason why I ask is that we might have developed something similar. 
It's a MIME message parser similar to SAX using a callback object to 
report parsing event such as the start of a new entity, the start of a 
header, etc. On top of this we have another parser parsing messages into 
more of a DOM-like structure using temporary files for large attachments 
to save on memory.

We would like to release this library as open source (Apache licensed of 
course) but if there are other similar projects due to be released 
shortly maybe we should try to join forces with them instead.

Our library has been tested on a large number of messages (~10000 
messages - a lot of them spam) and the Perl MIME:Tools parser has been 
used for comparison. In most cases (99%) the results of our parsers and 
the MIME:Tools parser are identical. When the results differ (which has 
only happened for illegal MIME encoded spam messages) we think our 
parser does a better job.

Regards,
Niklas Therning


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: SAX-like MIME parser

Posted by Joe Cheng <co...@joecheng.com>.
Sourceforge is not responding right now... when it comes back up I'll 
take a look at the code.

> Joe, looking at your code gave me a lot of ideas how our parser can be 
> improved! One thing I don't like about our parser is that it is line 
> oriented. It reads line by line and (at least for the body of an 
> entity) the callback is called for each line. I would like to 
> incorporate your MIMEBoundaryInputStream into our parser and change 
> the signature of the body, preamble and epilogue callback methods so 
> they get an InputStream to read the contents from instead. Since our 
> philosophy is to have a parser which gives the user exactly what is 
> read from the message, I don't think the InputStream should be wrapped 
> in InputStreams that do base64 or qp decoding. This could easily be 
> done at a higher level for those who require it.


Fair enough... you should be able to do this by a trivial change in 
MIMEParser.java.

> What does MIMEBoundaryInputStream do if the message stream is missing 
> the boundary it is looking for but still contains higher level 
> boundaries? Of course this violates the standards but I have seen it 
> occur (at least in spam messages). I think it should look not only for 
> the expected next boundary but should also detect any boundary of a 
> higher level (closer to the root of the mime entity tree) part and 
> stop. That's what we're doing and it gives a nice result for bad 
> messages.

Actually, this should already be handled correctly.  The higher level 
MIMEBoundaryInputStream will see its boundary.  The nested 
MIMEBoundaryInputStream will simply see a premature EOF, which it is 
designed to handle.  So are the Base64 and QP decoders.

The symmetry of the nested InputStream design makes a lot of these edge 
conditions disappear.  In fact, you could even embed a base64-encoded 
message/rfc822 that has multiple MIME parts (which is a horrible 
violation of the spec) and it would still work--you'd get callbacks on 
the nested message, even though the boundaries and everything are base64.

> I'd like to share some ideas on how we're testing the parser. The perl 
> MIME::Tools package (used by SpamAssassin and lots of other projects) 
> contains a number of faulty or hard to parse test messages which I 
> have included in the sources. We have a perl script which uses the 
> MIME::Parser to parse those messages and produces XML files for each 
> one. We use these XML files in our tests to compare with the output of 
> our parser. We have also used the same testing method to test with a 
> large set of real world messages. Comparing the output with another 
> parser (which you should trust of course  ;-) ) is a great help 
> because it would be impossible to compare the results by hand when you 
> run the parser through a couple of thousand messages.

That is a great idea.  I tested the hardest messages I could find by 
hand (vs. Mozilla Thunderbird and Outlook Express) and for the rest, 
just ran them through the parser and sanity-checked the messages that 
the parser didn't like.  Your way is much better.

> I strongly think we should try to collaborate on this project somehow 
> as suggested by Noel. We should try to set up some common design goals 
> if possible and start from there. I don't think it would be a problem 
> for us to donate our code to ASF. By the way, is Joe's code going to 
> be a standalone project (in commons for example) or will it be 
> included in James? We would prefer to have a seperate jar which would 
> make it much easier to use it in our application.

Did we ever come to a conclusion on this?  I think we were talking about 
starting with it in James, so we can start using it without waiting for 
it to gestate in Commons?

Now I'm all itchy to implement the field header parsers.  I'll start 
getting reacquainted with the spec tonight...

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


RE: SAX-like MIME parser

Posted by "Noel J. Bergman" <no...@devtech.com>.
Niklas Therning wrote:

> It's in the CVS now at sourceforge, project name mime4j.

I see http://cvs.sourceforge.net/viewcvs.py/mime4j/.  Looks like one of them
is Joe's and the other is yours.  I am preparing to import Joe's code into
Subversion.  If you would please have a Software Grant
(http://www.apache.org/licenses/) for your code, and provide the code to us,
we can import it as well.  The other thing we need to do is fill out a
form[1] for the Incubator that basically documents that we have done due
diligance to make sure that the IP issues are taken care of.  Joe's code
pre-dates that form, but we do have a Software Grant from his company and a
CLA from him.

Seeing the great ideas and energy you guys are starting to generate, I'd
like to see it put to work.  :-)

As for how it is packaged, that is TDB, and we don't have to package it the
same for JAMES as you want it for your application.  I do agree that it
makes sense to keep it suitable for standalone packaging, and would like to
see unit tests included in the package tree.

One thing that is absent from Joe's code is the ability to edit messages.
When Joe mentioned that you have a DOM-like model included in your code, as
well as a SAX-like model, it sounded to me as if your approach might be
amenable to having edit support.  Yes?

	--- Noel

[1]
http://cvs.apache.org/viewcvs/incubator/site/projects/ip-clearance-template.
cwiki


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: SAX-like MIME parser

Posted by Niklas Therning <ni...@trillian.se>.
Noel J. Bergman wrote:

>>They have access to my code, but I have not seen theirs yet (they are
>>working on getting it loaded into SourceForge).
>>    
>>
>
>Well, I will get yours into source control before end of weekend (if not
>tonight -- depends upon the weather).  They can also submit a Software Grant
>for it, and we could put that in source control, too.  What I could envision
>would be you and they collaborating here on the further development of a
>"best of" codebase.
>  
>
It's in the CVS now at sourceforge, project name mime4j. I can't see it 
using viewcvs but I can check it out (I haven't tried checking out by 
anon cvs though). Maybe viewcvs caches things for a while.

Joe, looking at your code gave me a lot of ideas how our parser can be 
improved! One thing I don't like about our parser is that it is line 
oriented. It reads line by line and (at least for the body of an entity) 
the callback is called for each line. I would like to incorporate your 
MIMEBoundaryInputStream into our parser and change the signature of the 
body, preamble and epilogue callback methods so they get an InputStream 
to read the contents from instead. Since our philosophy is to have a 
parser which gives the user exactly what is read from the message, I 
don't think the InputStream should be wrapped in InputStreams that do 
base64 or qp decoding. This could easily be done at a higher level for 
those who require it.

What does MIMEBoundaryInputStream do if the message stream is missing 
the boundary it is looking for but still contains higher level 
boundaries? Of course this violates the standards but I have seen it 
occur (at least in spam messages). I think it should look not only for 
the expected next boundary but should also detect any boundary of a 
higher level (closer to the root of the mime entity tree) part and stop. 
That's what we're doing and it gives a nice result for bad messages.

I'd like to share some ideas on how we're testing the parser. The perl 
MIME::Tools package (used by SpamAssassin and lots of other projects) 
contains a number of faulty or hard to parse test messages which I have 
included in the sources. We have a perl script which uses the 
MIME::Parser to parse those messages and produces XML files for each 
one. We use these XML files in our tests to compare with the output of 
our parser. We have also used the same testing method to test with a 
large set of real world messages. Comparing the output with another 
parser (which you should trust of course  ;-) ) is a great help because 
it would be impossible to compare the results by hand when you run the 
parser through a couple of thousand messages.

I strongly think we should try to collaborate on this project somehow as 
suggested by Noel. We should try to set up some common design goals if 
possible and start from there. I don't think it would be a problem for 
us to donate our code to ASF. By the way, is Joe's code going to be a 
standalone project (in commons for example) or will it be included in 
James? We would prefer to have a seperate jar which would make it much 
easier to use it in our application.

/Niklas Therning


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


RE: SAX-like MIME parser

Posted by "Noel J. Bergman" <no...@devtech.com>.
> They have access to my code, but I have not seen theirs yet (they are
> working on getting it loaded into SourceForge).

Well, I will get yours into source control before end of weekend (if not
tonight -- depends upon the weather).  They can also submit a Software Grant
for it, and we could put that in source control, too.  What I could envision
would be you and they collaborating here on the further development of a
"best of" codebase.

	--- Noel


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: SAX-like MIME parser

Posted by Joe Cheng <co...@joecheng.com>.
>I would be very pleased to see you and Joe Cheng collaborate.  Have you had
>a chance to compare notes and code, and see how much common ground exists
>within the two MIME parsers?
>  
>
They have access to my code, but I have not seen theirs yet (they are 
working on getting it loaded into SourceForge).

Here's what I've been able to glean from an e-mail exchange with Niklas 
yesterday.

Their streaming parser is more low-level than mine: the way mine is 
currently written, you don't have the option of examining an attachment 
in its raw base64 state; instead you are given the decoded stream.  
IIUC, their streaming parser calls you back with each line of the pure, 
unadulterated RFC822 message; if a streaming base64/qp-decoding parser 
is needed, you would build it on top of the low-level parser.  They also 
provide callbacks for the preamble and epilogue, which I currently discard.

My parser could easily be modified to work the way theirs does, and I 
can only assume the opposite is true as well (or at least that the 
layered decoding streaming parser would be quite easy to write).  But it 
does reflect the difference between our two approaches.  I wrote a 
streaming parser to be directly used; if a somewhat different type of 
parser is needed, my thought was to reuse the underlying input stream 
filters, because they are natural building blocks and outside of them 
there is really very little code.  Niklas and his team wrote a low-level 
streaming parser, with the intention that user-level parsers would be 
built on top, of which the DOM parser is the first.

Both parsers should do a good job not crashing on real-world messages, 
even ones that are not quite legal MIME.

Their providing a DOM model is a nice bonus.  I don't have one, but when 
I considered it I wanted to build it such that almost no decoded data 
would need to be held in memory.  (It'd essentially be like faking 
continuations at appropriate times during the streaming run.)

Their DOM parser, which does the content transfer decoding, fully 
decodes all attachments when you run the parser.  Niklas says: "Altough 
constructing a complete parse tree this parser tries to be memory 
efficient by using temporary files to store large attachments."  My 
parser doesn't charge you (much of) a penalty for attachments you don't 
actually do anything with; in other words, only the bytes you ask for 
are decoded.  And mine doesn't use any temporary files.

On the other hand, perhaps their DOM layer could be modified to take a 
body part selector, that would let you decide what parts you want (i.e. 
only if content-type matches text/*, etc.).

The most critical piece missing from my parser is that you don't get 
much help with parsing most structured headers (From, To, Received, 
Date, etc.).  Unfortunately, their parser currently is in the same state 
as mine.  We both have unfolding and full parsing of MIME-relevant 
fields, but that's all.  Niklas said they have plans to implement 
parsers for the other header fields, and I... well, I have the best of 
intentions. :)

I think I've said more than enough considering I haven't even seen their 
code yet!

-joe

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


RE: SAX-like MIME parser

Posted by "Noel J. Bergman" <no...@devtech.com>.
Niklas Therning wrote:
> I read a discussion
> (http://www.mail-archive.com/server-dev@james.apache.org/msg01985.html)

Yes, we have that code, and need to put it into the MAIN branch, now that we
have shipped version 2.2.  Actually, since we're going to end up migrating
into SVN, I might put it into Subversion to start.  We already have jSieve
there.

> The reason why I ask is that we might have developed something similar.
> It's a MIME message parser similar to SAX using a callback object to
> report parsing event such as the start of a new entity, the start of a
> header, etc. On top of this we have another parser parsing messages into
> more of a DOM-like structure using temporary files for large attachments
> to save on memory.

> We would like to release this library as open source (Apache licensed of
> course) but if there are other similar projects due to be released
> shortly maybe we should try to join forces with them instead.

I would be very pleased to see you and Joe Cheng collaborate.  Have you had
a chance to compare notes and code, and see how much common ground exists
within the two MIME parsers?

	--- Noel


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org


Re: SAX-like MIME parser

Posted by Joe Cheng <co...@joecheng.com>.
Hi Niklas,

I was waiting for Noel et al to be ready for James 3.0 development.  At 
the time, everyone was starting to focus on getting v2.2.0 or whatever 
out the door.

As we've discussed off-list today, I don't have a problem with stepping 
aside if your library is superior and/or you and your team have more 
time and resource to dedicate to this (which is virtually certain).  And 
I also have no problem with you taking as much or as little of my code 
as you want.

-joe


---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org