You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mime4j-dev@james.apache.org by Robert Burrell Donkin <ro...@gmail.com> on 2009/02/01 22:34:26 UTC

Re: Duplicate parsing of some header fields

On Fri, Jan 30, 2009 at 7:37 PM, Markus Wiederkehr
<ma...@gmail.com> wrote:
> When Mime4j is used to build a DOM some header fields are parsed more
> than once..

...i suspected that this might be the case...

> This is what happens:
>  * AbstractEntity.parseField parses the raw string into a name-value pair
>  * AbstractEntity.parseField calls MutableBodyDescriptor.addField for
> valid fields
>  * DefaultBodyDescriptor parses header fields such as Content-Type
>  * MaximalBodyDescriptor parses additional header fields such as Mime-Version
>  * eventually MimeStreamParser.parse retrieves the raw field from
> AbstractEntity and notifies a ContentHandler
>  * the ContentHandler (again) has to parse the raw string into name and value
>  * the ContentHandler (again) has to parse certain structural fields
> already parsed by a body descriptor

IIRC the descriptor stuff was added to satisfy requirements for SAX
and pull parsing downstream (in particular IMAP which uses a lot of
MIME meta-data)

> In case of building a DOM the latter two items are done by
> MessageBuilder by calling Field.parse().
>
> There are several issues here:
>  * the ContentHandler has to do a lot of work that has already been
> done before by AbstractEntity.

well, probably done before (dependent of parser settings)

>  * Field.parse() uses a JavaCC-generated parser whereas the body
> descriptors have their own parsing code.

yes - the body descriptors support a wider variety of fields than the
Field stuff

>  * we have unnecessary duplicate code and the two parsers are very
> likely inconsistent with each other.

yes - i trust the code in AbstractEntity more highly then that in Field

>  * parsing issues have to be addressed twice.

to support the wider range of fields in the DOM, then yes

> To resolve these problems I would like to propose a drastic change to the API:
>  * AbstractEntity should use Field.parse to parse the raw field
>  * The body descriptor can extract information from the Field
> instances without need of their own parsing code
>  * A ContentHandler would be notified of a Field instead of a string
>  * MessageBuilder could simply store the field and would not have to
> parse it again.
>
> Please note that with the recent changes to the API all concrete Field
> classes parse the field value lazily so there should be no significant
> performance impact.

i have no essential objections to the strategy though i think maybe
some discussion on tactics might be worthwhile

the pull parsing stuff only stores a limited number of headers. for
any header which isn't of interest to the user, the field is only
parsed into header and value. this is about the minimum work required
to discover the header name and so whether it's of interest.

the typical use case for not being interested in headers is
performance. so, though creating a field is a small hit, it would be
in keeping with the spirit of the API to add something like
isFieldSkippable (but with a better name) as well as modifying
addField to accept a field. then only when the descriptor is
interested in the result would the field need to be created. the SAX
handler would then take every field in full which is fine since it's a
higher level API.

if this is the approach adopted then there are some improvements which
could be made to the structure of the descriptor classes. i should be
able to find time to help out with them.

- robert

Re: Duplicate parsing of some header fields

Posted by Robert Burrell Donkin <ro...@gmail.com>.
On 2/3/09, Markus Wiederkehr <ma...@gmail.com> wrote:
> On Tue, Feb 3, 2009 at 9:27 PM, Robert Burrell Donkin
> <ro...@gmail.com> wrote:
>> On 2/3/09, Markus Wiederkehr <ma...@gmail.com> wrote:
>>> On Sun, Feb 1, 2009 at 10:34 PM, Robert Burrell Donkin
>>> <ro...@gmail.com> wrote:
>>>> On Fri, Jan 30, 2009 at 7:37 PM, Markus Wiederkehr
>>>> <ma...@gmail.com> wrote:
>>>>> When Mime4j is used to build a DOM some header fields are parsed more
>>>>> than once..
>>>>
>>>> ...i suspected that this might be the case...
>>>>
>>>>> This is what happens:
>>>>>  * AbstractEntity.parseField parses the raw string into a name-value
>>>>> pair
>>>>>  * AbstractEntity.parseField calls MutableBodyDescriptor.addField for
>>>>> valid fields
>>>>>  * DefaultBodyDescriptor parses header fields such as Content-Type
>>>>>  * MaximalBodyDescriptor parses additional header fields such as
>>>>> Mime-Version
>>>>>  * eventually MimeStreamParser.parse retrieves the raw field from
>>>>> AbstractEntity and notifies a ContentHandler
>>>>>  * the ContentHandler (again) has to parse the raw string into name and
>>>>> value
>>>>>  * the ContentHandler (again) has to parse certain structural fields
>>>>> already parsed by a body descriptor
>>>>
>>>> IIRC the descriptor stuff was added to satisfy requirements for SAX
>>>> and pull parsing downstream (in particular IMAP which uses a lot of
>>>> MIME meta-data)
>>>>
>>>>> In case of building a DOM the latter two items are done by
>>>>> MessageBuilder by calling Field.parse().
>>>>>
>>>>> There are several issues here:
>>>>>  * the ContentHandler has to do a lot of work that has already been
>>>>> done before by AbstractEntity.
>>>>
>>>> well, probably done before (dependent of parser settings)
>>>>
>>>>>  * Field.parse() uses a JavaCC-generated parser whereas the body
>>>>> descriptors have their own parsing code.
>>>>
>>>> yes - the body descriptors support a wider variety of fields than the
>>>> Field stuff
>>>>
>>>>>  * we have unnecessary duplicate code and the two parsers are very
>>>>> likely inconsistent with each other.
>>>>
>>>> yes - i trust the code in AbstractEntity more highly then that in Field
>>>
>>> I'm not so sure.. For example AbstractEntity uses
>>> MimeUtil.getHeaderParams() to parse Content-Type and
>>> Content-Disposition. It looks like getHeaderParams does not correctly
>>> handle comments that are allowed in quoted-strings.. Consider this
>>> example:
>>>
>>>         String test = "text/plain; key=value; foo= (test) \"bar\"";
>>>         Map<String, String> headerParams =
>>> MimeUtil.getHeaderParams(test);
>>>         System.out.println(headerParams);
>>>
>>> The result is {=text/plain, foo=(test), key=value} which is wrong.
>>
>> Comments are a real PITA - users of high level APIs don't want to have
>> to reparse values to strip comments. Most of this stuff was intended
>> to be called after full normalization. If we switch to fields and
>> JavaCC then it might be possible to support comment extraction which
>> might be useful one day.
>
> Unfortunately this full normalization does not happen anywhere so I
> think this is actually a bug. I'm not sure whether comment extraction
> would make much sense but comments definitely have to be ignored for
> now.
>
> Check it out for yourself: try parsing a message that has a a content
> type like "multipart/mixed; boundary=(comment)boundary" and watch
> Mime4j fail..

I'm pretty sure that there a number of other ways that the failure to
fully normalize causes issues - for example, I doubt unescaping
international characters works correctly.

>>> The Field equivalent..
>>>
>>>         ContentTypeField f = Fields.contentType(test);
>>>         System.out.println(f.getParameters());
>>>
>>> ..gives the correct result: {foo=bar, key=value}.
>>>
>>> But of course we could fix this and use MimeUtil.getHeaderParams() for
>>> parsing Fields, too. In fact I would not mind if most of the javacc
>>> parsers were replaced by handcrafted equivalents..
>>
>> I'm happy to use the JavaCC parsers provided that they are available
>> and debugged for all the required types.
>>
>>>
>>>>>  * parsing issues have to be addressed twice.
>>>>
>>>> to support the wider range of fields in the DOM, then yes
>>>>
>>>>> To resolve these problems I would like to propose a drastic change to
>>>>> the
>>>>> API:
>>>>>  * AbstractEntity should use Field.parse to parse the raw field
>>>>>  * The body descriptor can extract information from the Field
>>>>> instances without need of their own parsing code
>>>>>  * A ContentHandler would be notified of a Field instead of a string
>>>>>  * MessageBuilder could simply store the field and would not have to
>>>>> parse it again.
>>>>>
>>>>> Please note that with the recent changes to the API all concrete Field
>>>>> classes parse the field value lazily so there should be no significant
>>>>> performance impact.
>>>>
>>>> i have no essential objections to the strategy though i think maybe
>>>> some discussion on tactics might be worthwhile
>>>>
>>>> the pull parsing stuff only stores a limited number of headers. for
>>>> any header which isn't of interest to the user, the field is only
>>>> parsed into header and value. this is about the minimum work required
>>>> to discover the header name and so whether it's of interest.
>>>>
>>>> the typical use case for not being interested in headers is
>>>> performance. so, though creating a field is a small hit, it would be
>>>> in keeping with the spirit of the API to add something like
>>>> isFieldSkippable (but with a better name) as well as modifying
>>>> addField to accept a field. then only when the descriptor is
>>>> interested in the result would the field need to be created. the SAX
>>>> handler would then take every field in full which is fine since it's a
>>>> higher level API.
>>>
>>> Sounds reasonable..
>>>
>>>> if this is the approach adopted then there are some improvements which
>>>> could be made to the structure of the descriptor classes. i should be
>>>> able to find time to help out with them.
>>>
>>> +1
>>>
>>> All in all I think this is something for 0.7..
>>
>> I'm getting a little worried that 0.7 is getting pushed further and
>> further, not closer. Mime4J is not mature enough to worry about API
>> compatibility so I wonder whether we'd be better to just look to cut
>> 0.7 soon and then follow up soon with a 0.8 featuring this redesign.
>
> I actually want to postpone this. Our next release is 0.6 ;-)

Good let's write up a JIRA and get 0.6 cut :-)

Robert

>
> Markus
>

Re: Duplicate parsing of some header fields

Posted by Markus Wiederkehr <ma...@gmail.com>.
On Tue, Feb 3, 2009 at 9:27 PM, Robert Burrell Donkin
<ro...@gmail.com> wrote:
> On 2/3/09, Markus Wiederkehr <ma...@gmail.com> wrote:
>> On Sun, Feb 1, 2009 at 10:34 PM, Robert Burrell Donkin
>> <ro...@gmail.com> wrote:
>>> On Fri, Jan 30, 2009 at 7:37 PM, Markus Wiederkehr
>>> <ma...@gmail.com> wrote:
>>>> When Mime4j is used to build a DOM some header fields are parsed more
>>>> than once..
>>>
>>> ...i suspected that this might be the case...
>>>
>>>> This is what happens:
>>>>  * AbstractEntity.parseField parses the raw string into a name-value pair
>>>>  * AbstractEntity.parseField calls MutableBodyDescriptor.addField for
>>>> valid fields
>>>>  * DefaultBodyDescriptor parses header fields such as Content-Type
>>>>  * MaximalBodyDescriptor parses additional header fields such as
>>>> Mime-Version
>>>>  * eventually MimeStreamParser.parse retrieves the raw field from
>>>> AbstractEntity and notifies a ContentHandler
>>>>  * the ContentHandler (again) has to parse the raw string into name and
>>>> value
>>>>  * the ContentHandler (again) has to parse certain structural fields
>>>> already parsed by a body descriptor
>>>
>>> IIRC the descriptor stuff was added to satisfy requirements for SAX
>>> and pull parsing downstream (in particular IMAP which uses a lot of
>>> MIME meta-data)
>>>
>>>> In case of building a DOM the latter two items are done by
>>>> MessageBuilder by calling Field.parse().
>>>>
>>>> There are several issues here:
>>>>  * the ContentHandler has to do a lot of work that has already been
>>>> done before by AbstractEntity.
>>>
>>> well, probably done before (dependent of parser settings)
>>>
>>>>  * Field.parse() uses a JavaCC-generated parser whereas the body
>>>> descriptors have their own parsing code.
>>>
>>> yes - the body descriptors support a wider variety of fields than the
>>> Field stuff
>>>
>>>>  * we have unnecessary duplicate code and the two parsers are very
>>>> likely inconsistent with each other.
>>>
>>> yes - i trust the code in AbstractEntity more highly then that in Field
>>
>> I'm not so sure.. For example AbstractEntity uses
>> MimeUtil.getHeaderParams() to parse Content-Type and
>> Content-Disposition. It looks like getHeaderParams does not correctly
>> handle comments that are allowed in quoted-strings.. Consider this
>> example:
>>
>>         String test = "text/plain; key=value; foo= (test) \"bar\"";
>>         Map<String, String> headerParams = MimeUtil.getHeaderParams(test);
>>         System.out.println(headerParams);
>>
>> The result is {=text/plain, foo=(test), key=value} which is wrong.
>
> Comments are a real PITA - users of high level APIs don't want to have
> to reparse values to strip comments. Most of this stuff was intended
> to be called after full normalization. If we switch to fields and
> JavaCC then it might be possible to support comment extraction which
> might be useful one day.

Unfortunately this full normalization does not happen anywhere so I
think this is actually a bug. I'm not sure whether comment extraction
would make much sense but comments definitely have to be ignored for
now.

Check it out for yourself: try parsing a message that has a a content
type like "multipart/mixed; boundary=(comment)boundary" and watch
Mime4j fail..

>> The Field equivalent..
>>
>>         ContentTypeField f = Fields.contentType(test);
>>         System.out.println(f.getParameters());
>>
>> ..gives the correct result: {foo=bar, key=value}.
>>
>> But of course we could fix this and use MimeUtil.getHeaderParams() for
>> parsing Fields, too. In fact I would not mind if most of the javacc
>> parsers were replaced by handcrafted equivalents..
>
> I'm happy to use the JavaCC parsers provided that they are available
> and debugged for all the required types.
>
>>
>>>>  * parsing issues have to be addressed twice.
>>>
>>> to support the wider range of fields in the DOM, then yes
>>>
>>>> To resolve these problems I would like to propose a drastic change to the
>>>> API:
>>>>  * AbstractEntity should use Field.parse to parse the raw field
>>>>  * The body descriptor can extract information from the Field
>>>> instances without need of their own parsing code
>>>>  * A ContentHandler would be notified of a Field instead of a string
>>>>  * MessageBuilder could simply store the field and would not have to
>>>> parse it again.
>>>>
>>>> Please note that with the recent changes to the API all concrete Field
>>>> classes parse the field value lazily so there should be no significant
>>>> performance impact.
>>>
>>> i have no essential objections to the strategy though i think maybe
>>> some discussion on tactics might be worthwhile
>>>
>>> the pull parsing stuff only stores a limited number of headers. for
>>> any header which isn't of interest to the user, the field is only
>>> parsed into header and value. this is about the minimum work required
>>> to discover the header name and so whether it's of interest.
>>>
>>> the typical use case for not being interested in headers is
>>> performance. so, though creating a field is a small hit, it would be
>>> in keeping with the spirit of the API to add something like
>>> isFieldSkippable (but with a better name) as well as modifying
>>> addField to accept a field. then only when the descriptor is
>>> interested in the result would the field need to be created. the SAX
>>> handler would then take every field in full which is fine since it's a
>>> higher level API.
>>
>> Sounds reasonable..
>>
>>> if this is the approach adopted then there are some improvements which
>>> could be made to the structure of the descriptor classes. i should be
>>> able to find time to help out with them.
>>
>> +1
>>
>> All in all I think this is something for 0.7..
>
> I'm getting a little worried that 0.7 is getting pushed further and
> further, not closer. Mime4J is not mature enough to worry about API
> compatibility so I wonder whether we'd be better to just look to cut
> 0.7 soon and then follow up soon with a 0.8 featuring this redesign.

I actually want to postpone this. Our next release is 0.6 ;-)

Markus

Re: Duplicate parsing of some header fields

Posted by Robert Burrell Donkin <ro...@gmail.com>.
On 2/3/09, Markus Wiederkehr <ma...@gmail.com> wrote:
> On Sun, Feb 1, 2009 at 10:34 PM, Robert Burrell Donkin
> <ro...@gmail.com> wrote:
>> On Fri, Jan 30, 2009 at 7:37 PM, Markus Wiederkehr
>> <ma...@gmail.com> wrote:
>>> When Mime4j is used to build a DOM some header fields are parsed more
>>> than once..
>>
>> ...i suspected that this might be the case...
>>
>>> This is what happens:
>>>  * AbstractEntity.parseField parses the raw string into a name-value pair
>>>  * AbstractEntity.parseField calls MutableBodyDescriptor.addField for
>>> valid fields
>>>  * DefaultBodyDescriptor parses header fields such as Content-Type
>>>  * MaximalBodyDescriptor parses additional header fields such as
>>> Mime-Version
>>>  * eventually MimeStreamParser.parse retrieves the raw field from
>>> AbstractEntity and notifies a ContentHandler
>>>  * the ContentHandler (again) has to parse the raw string into name and
>>> value
>>>  * the ContentHandler (again) has to parse certain structural fields
>>> already parsed by a body descriptor
>>
>> IIRC the descriptor stuff was added to satisfy requirements for SAX
>> and pull parsing downstream (in particular IMAP which uses a lot of
>> MIME meta-data)
>>
>>> In case of building a DOM the latter two items are done by
>>> MessageBuilder by calling Field.parse().
>>>
>>> There are several issues here:
>>>  * the ContentHandler has to do a lot of work that has already been
>>> done before by AbstractEntity.
>>
>> well, probably done before (dependent of parser settings)
>>
>>>  * Field.parse() uses a JavaCC-generated parser whereas the body
>>> descriptors have their own parsing code.
>>
>> yes - the body descriptors support a wider variety of fields than the
>> Field stuff
>>
>>>  * we have unnecessary duplicate code and the two parsers are very
>>> likely inconsistent with each other.
>>
>> yes - i trust the code in AbstractEntity more highly then that in Field
>
> I'm not so sure.. For example AbstractEntity uses
> MimeUtil.getHeaderParams() to parse Content-Type and
> Content-Disposition. It looks like getHeaderParams does not correctly
> handle comments that are allowed in quoted-strings.. Consider this
> example:
>
>         String test = "text/plain; key=value; foo= (test) \"bar\"";
>         Map<String, String> headerParams = MimeUtil.getHeaderParams(test);
>         System.out.println(headerParams);
>
> The result is {=text/plain, foo=(test), key=value} which is wrong.

Comments are a real PITA - users of high level APIs don't want to have
to reparse values to strip comments. Most of this stuff was intended
to be called after full normalization. If we switch to fields and
JavaCC then it might be possible to support comment extraction which
might be useful one day.

> The Field equivalent..
>
>         ContentTypeField f = Fields.contentType(test);
>         System.out.println(f.getParameters());
>
> ..gives the correct result: {foo=bar, key=value}.
>
> But of course we could fix this and use MimeUtil.getHeaderParams() for
> parsing Fields, too. In fact I would not mind if most of the javacc
> parsers were replaced by handcrafted equivalents..

I'm happy to use the JavaCC parsers provided that they are available
and debugged for all the required types.

>
>>>  * parsing issues have to be addressed twice.
>>
>> to support the wider range of fields in the DOM, then yes
>>
>>> To resolve these problems I would like to propose a drastic change to the
>>> API:
>>>  * AbstractEntity should use Field.parse to parse the raw field
>>>  * The body descriptor can extract information from the Field
>>> instances without need of their own parsing code
>>>  * A ContentHandler would be notified of a Field instead of a string
>>>  * MessageBuilder could simply store the field and would not have to
>>> parse it again.
>>>
>>> Please note that with the recent changes to the API all concrete Field
>>> classes parse the field value lazily so there should be no significant
>>> performance impact.
>>
>> i have no essential objections to the strategy though i think maybe
>> some discussion on tactics might be worthwhile
>>
>> the pull parsing stuff only stores a limited number of headers. for
>> any header which isn't of interest to the user, the field is only
>> parsed into header and value. this is about the minimum work required
>> to discover the header name and so whether it's of interest.
>>
>> the typical use case for not being interested in headers is
>> performance. so, though creating a field is a small hit, it would be
>> in keeping with the spirit of the API to add something like
>> isFieldSkippable (but with a better name) as well as modifying
>> addField to accept a field. then only when the descriptor is
>> interested in the result would the field need to be created. the SAX
>> handler would then take every field in full which is fine since it's a
>> higher level API.
>
> Sounds reasonable..
>
>> if this is the approach adopted then there are some improvements which
>> could be made to the structure of the descriptor classes. i should be
>> able to find time to help out with them.
>
> +1
>
> All in all I think this is something for 0.7..

I'm getting a little worried that 0.7 is getting pushed further and
further, not closer. Mime4J is not mature enough to worry about API
compatibility so I wonder whether we'd be better to just look to cut
0.7 soon and then follow up soon with a 0.8 featuring this redesign.

- Robert

Re: Duplicate parsing of some header fields

Posted by Robert Burrell Donkin <ro...@gmail.com>.
On Wed, Feb 4, 2009 at 1:26 PM, Oleg Kalnichevski <ol...@apache.org> wrote:
> Markus Wiederkehr wrote:
>>
>> On Sun, Feb 1, 2009 at 10:34 PM, Robert Burrell Donkin
>> <ro...@gmail.com> wrote:
>>>
>>> On Fri, Jan 30, 2009 at 7:37 PM, Markus Wiederkehr
>>> <ma...@gmail.com> wrote:
>>>>
>>>> When Mime4j is used to build a DOM some header fields are parsed more
>>>> than once..
>>>
>>> ...i suspected that this might be the case...
>>>
>>>> This is what happens:
>>>>  * AbstractEntity.parseField parses the raw string into a name-value
>>>> pair
>>>>  * AbstractEntity.parseField calls MutableBodyDescriptor.addField for
>>>> valid fields
>>>>  * DefaultBodyDescriptor parses header fields such as Content-Type
>>>>  * MaximalBodyDescriptor parses additional header fields such as
>>>> Mime-Version
>>>>  * eventually MimeStreamParser.parse retrieves the raw field from
>>>> AbstractEntity and notifies a ContentHandler
>>>>  * the ContentHandler (again) has to parse the raw string into name and
>>>> value
>>>>  * the ContentHandler (again) has to parse certain structural fields
>>>> already parsed by a body descriptor
>>>
>>> IIRC the descriptor stuff was added to satisfy requirements for SAX
>>> and pull parsing downstream (in particular IMAP which uses a lot of
>>> MIME meta-data)
>>>
>>>> In case of building a DOM the latter two items are done by
>>>> MessageBuilder by calling Field.parse().
>>>>
>>>> There are several issues here:
>>>>  * the ContentHandler has to do a lot of work that has already been
>>>> done before by AbstractEntity.
>>>
>>> well, probably done before (dependent of parser settings)
>>>
>>>>  * Field.parse() uses a JavaCC-generated parser whereas the body
>>>> descriptors have their own parsing code.
>>>
>>> yes - the body descriptors support a wider variety of fields than the
>>> Field stuff
>>>
>>>>  * we have unnecessary duplicate code and the two parsers are very
>>>> likely inconsistent with each other.
>>>
>>> yes - i trust the code in AbstractEntity more highly then that in Field
>>
>> I'm not so sure.. For example AbstractEntity uses
>> MimeUtil.getHeaderParams() to parse Content-Type and
>> Content-Disposition. It looks like getHeaderParams does not correctly
>> handle comments that are allowed in quoted-strings.. Consider this
>> example:
>>
>>        String test = "text/plain; key=value; foo= (test) \"bar\"";
>>        Map<String, String> headerParams = MimeUtil.getHeaderParams(test);
>>        System.out.println(headerParams);
>>
>> The result is {=text/plain, foo=(test), key=value} which is wrong. The
>> Field equivalent..
>>
>>        ContentTypeField f = Fields.contentType(test);
>>        System.out.println(f.getParameters());
>>
>> ..gives the correct result: {foo=bar, key=value}.
>>
>> But of course we could fix this and use MimeUtil.getHeaderParams() for
>> parsing Fields, too. In fact I would not mind if most of the javacc
>> parsers were replaced by handcrafted equivalents..
>>
>
> Folks
>
> For what is it worth to you. HttpComponents Core provides an extensive
> support for parsing and formatting of MIME headers. The code has been around
> for quite some time, has been relatively well tested, has been extensively
> optimized for performance and memory footprint and is able to operate with
> virtually zero intermediate garbage and no intermediate memory copying.
>
> The parsing/formatting framework is based on the CharArrayBuffer class,
> which is used by MimeTokenStream for low level parsing operations.
>
> http://svn.apache.org/repos/asf/httpcomponents/httpcore/trunk/module-main/src/main/java/org/apache/http/message/HeaderValueParser.java
> http://svn.apache.org/repos/asf/httpcomponents/httpcore/trunk/module-main/src/main/java/org/apache/http/message/BasicHeaderValueParser.java
> http://svn.apache.org/repos/asf/httpcomponents/httpcore/trunk/module-main/src/main/java/org/apache/http/message/HeaderValueFormatter.java
> http://svn.apache.org/repos/asf/httpcomponents/httpcore/trunk/module-main/src/main/java/org/apache/http/message/BasicHeaderValueFormatter.java
>
> Feel free to take a look at the code and see if it might make sense
> replacing MimeUtil.getHeaderParams(value) with some bits from HC.

sounds good - it's good to see that we have quite a few options for
fix this properly in 0.7

- robert

Re: Duplicate parsing of some header fields

Posted by Oleg Kalnichevski <ol...@apache.org>.
Markus Wiederkehr wrote:
> On Sun, Feb 1, 2009 at 10:34 PM, Robert Burrell Donkin
> <ro...@gmail.com> wrote:
>> On Fri, Jan 30, 2009 at 7:37 PM, Markus Wiederkehr
>> <ma...@gmail.com> wrote:
>>> When Mime4j is used to build a DOM some header fields are parsed more
>>> than once..
>> ...i suspected that this might be the case...
>>
>>> This is what happens:
>>>  * AbstractEntity.parseField parses the raw string into a name-value pair
>>>  * AbstractEntity.parseField calls MutableBodyDescriptor.addField for
>>> valid fields
>>>  * DefaultBodyDescriptor parses header fields such as Content-Type
>>>  * MaximalBodyDescriptor parses additional header fields such as Mime-Version
>>>  * eventually MimeStreamParser.parse retrieves the raw field from
>>> AbstractEntity and notifies a ContentHandler
>>>  * the ContentHandler (again) has to parse the raw string into name and value
>>>  * the ContentHandler (again) has to parse certain structural fields
>>> already parsed by a body descriptor
>> IIRC the descriptor stuff was added to satisfy requirements for SAX
>> and pull parsing downstream (in particular IMAP which uses a lot of
>> MIME meta-data)
>>
>>> In case of building a DOM the latter two items are done by
>>> MessageBuilder by calling Field.parse().
>>>
>>> There are several issues here:
>>>  * the ContentHandler has to do a lot of work that has already been
>>> done before by AbstractEntity.
>> well, probably done before (dependent of parser settings)
>>
>>>  * Field.parse() uses a JavaCC-generated parser whereas the body
>>> descriptors have their own parsing code.
>> yes - the body descriptors support a wider variety of fields than the
>> Field stuff
>>
>>>  * we have unnecessary duplicate code and the two parsers are very
>>> likely inconsistent with each other.
>> yes - i trust the code in AbstractEntity more highly then that in Field
> 
> I'm not so sure.. For example AbstractEntity uses
> MimeUtil.getHeaderParams() to parse Content-Type and
> Content-Disposition. It looks like getHeaderParams does not correctly
> handle comments that are allowed in quoted-strings.. Consider this
> example:
> 
>         String test = "text/plain; key=value; foo= (test) \"bar\"";
>         Map<String, String> headerParams = MimeUtil.getHeaderParams(test);
>         System.out.println(headerParams);
> 
> The result is {=text/plain, foo=(test), key=value} which is wrong. The
> Field equivalent..
> 
>         ContentTypeField f = Fields.contentType(test);
>         System.out.println(f.getParameters());
> 
> ..gives the correct result: {foo=bar, key=value}.
> 
> But of course we could fix this and use MimeUtil.getHeaderParams() for
> parsing Fields, too. In fact I would not mind if most of the javacc
> parsers were replaced by handcrafted equivalents..
> 

Folks

For what is it worth to you. HttpComponents Core provides an extensive 
support for parsing and formatting of MIME headers. The code has been 
around for quite some time, has been relatively well tested, has been 
extensively optimized for performance and memory footprint and is able 
to operate with virtually zero intermediate garbage and no intermediate 
memory copying.

The parsing/formatting framework is based on the CharArrayBuffer class, 
which is used by MimeTokenStream for low level parsing operations.

http://svn.apache.org/repos/asf/httpcomponents/httpcore/trunk/module-main/src/main/java/org/apache/http/message/HeaderValueParser.java
http://svn.apache.org/repos/asf/httpcomponents/httpcore/trunk/module-main/src/main/java/org/apache/http/message/BasicHeaderValueParser.java
http://svn.apache.org/repos/asf/httpcomponents/httpcore/trunk/module-main/src/main/java/org/apache/http/message/HeaderValueFormatter.java
http://svn.apache.org/repos/asf/httpcomponents/httpcore/trunk/module-main/src/main/java/org/apache/http/message/BasicHeaderValueFormatter.java

Feel free to take a look at the code and see if it might make sense 
replacing MimeUtil.getHeaderParams(value) with some bits from HC.

I fully admit my personal bias, so feel free to ignore me.

Oleg



Re: Duplicate parsing of some header fields

Posted by Markus Wiederkehr <ma...@gmail.com>.
On Sun, Feb 1, 2009 at 10:34 PM, Robert Burrell Donkin
<ro...@gmail.com> wrote:
> On Fri, Jan 30, 2009 at 7:37 PM, Markus Wiederkehr
> <ma...@gmail.com> wrote:
>> When Mime4j is used to build a DOM some header fields are parsed more
>> than once..
>
> ...i suspected that this might be the case...
>
>> This is what happens:
>>  * AbstractEntity.parseField parses the raw string into a name-value pair
>>  * AbstractEntity.parseField calls MutableBodyDescriptor.addField for
>> valid fields
>>  * DefaultBodyDescriptor parses header fields such as Content-Type
>>  * MaximalBodyDescriptor parses additional header fields such as Mime-Version
>>  * eventually MimeStreamParser.parse retrieves the raw field from
>> AbstractEntity and notifies a ContentHandler
>>  * the ContentHandler (again) has to parse the raw string into name and value
>>  * the ContentHandler (again) has to parse certain structural fields
>> already parsed by a body descriptor
>
> IIRC the descriptor stuff was added to satisfy requirements for SAX
> and pull parsing downstream (in particular IMAP which uses a lot of
> MIME meta-data)
>
>> In case of building a DOM the latter two items are done by
>> MessageBuilder by calling Field.parse().
>>
>> There are several issues here:
>>  * the ContentHandler has to do a lot of work that has already been
>> done before by AbstractEntity.
>
> well, probably done before (dependent of parser settings)
>
>>  * Field.parse() uses a JavaCC-generated parser whereas the body
>> descriptors have their own parsing code.
>
> yes - the body descriptors support a wider variety of fields than the
> Field stuff
>
>>  * we have unnecessary duplicate code and the two parsers are very
>> likely inconsistent with each other.
>
> yes - i trust the code in AbstractEntity more highly then that in Field

I'm not so sure.. For example AbstractEntity uses
MimeUtil.getHeaderParams() to parse Content-Type and
Content-Disposition. It looks like getHeaderParams does not correctly
handle comments that are allowed in quoted-strings.. Consider this
example:

        String test = "text/plain; key=value; foo= (test) \"bar\"";
        Map<String, String> headerParams = MimeUtil.getHeaderParams(test);
        System.out.println(headerParams);

The result is {=text/plain, foo=(test), key=value} which is wrong. The
Field equivalent..

        ContentTypeField f = Fields.contentType(test);
        System.out.println(f.getParameters());

..gives the correct result: {foo=bar, key=value}.

But of course we could fix this and use MimeUtil.getHeaderParams() for
parsing Fields, too. In fact I would not mind if most of the javacc
parsers were replaced by handcrafted equivalents..

>>  * parsing issues have to be addressed twice.
>
> to support the wider range of fields in the DOM, then yes
>
>> To resolve these problems I would like to propose a drastic change to the API:
>>  * AbstractEntity should use Field.parse to parse the raw field
>>  * The body descriptor can extract information from the Field
>> instances without need of their own parsing code
>>  * A ContentHandler would be notified of a Field instead of a string
>>  * MessageBuilder could simply store the field and would not have to
>> parse it again.
>>
>> Please note that with the recent changes to the API all concrete Field
>> classes parse the field value lazily so there should be no significant
>> performance impact.
>
> i have no essential objections to the strategy though i think maybe
> some discussion on tactics might be worthwhile
>
> the pull parsing stuff only stores a limited number of headers. for
> any header which isn't of interest to the user, the field is only
> parsed into header and value. this is about the minimum work required
> to discover the header name and so whether it's of interest.
>
> the typical use case for not being interested in headers is
> performance. so, though creating a field is a small hit, it would be
> in keeping with the spirit of the API to add something like
> isFieldSkippable (but with a better name) as well as modifying
> addField to accept a field. then only when the descriptor is
> interested in the result would the field need to be created. the SAX
> handler would then take every field in full which is fine since it's a
> higher level API.

Sounds reasonable..

> if this is the approach adopted then there are some improvements which
> could be made to the structure of the descriptor classes. i should be
> able to find time to help out with them.

+1

All in all I think this is something for 0.7..

Markus