You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mime4j-dev@james.apache.org by Wolfgang Fahl <wf...@bitplan.com> on 2014/10/02 19:37:39 UTC

Re: Mime4J improvements was: Re: Thunderbird Mailbox support (patch included)

Hi Oleg,

Thank you for your prompt answer and advice. It worked beautifully from
my point of view:
https://github.com/WolfgangFahl/james-mime4j/blob/trunk/dom/src/main/java/org/apache/james/mime4j/message/BasicBodyFactory.java
has a fix with my proposal as to make lenient Charset handling the
default but at least switchable

and

https://github.com/WolfgangFahl/james-mime4j/blob/trunk/dom/src/test/java/org/apache/james/mime4j/dom/MessageCharsetLenientTest.java
has a JUnit test that shows the modified behavior. It includes some 50
invalid Charsets i found in my sample of 1/4 million e-mail messages.

As far as I can tell the changes don't break any other test.

The relevant bug https://issues.apache.org/jira/browse/MIME4J-218
is still marked as resolved. Shall I add a new one or are you going to
reopen it?

With the fix above only 5 messages in my sample of 1/4 million emails
can't be parsed by mime4j 0.8.0-SNAPSHOT. All errors are due to
line size and header size issues. The repository above has further
improvements on the handling of these MimeConfig settings. There seems
to be a
followup problem that the MimeConfig settings are not fully picked up in
all situations. This gets visible when the exception messages has the
current setting added e.g.
as I did it in my changes.

Example:
mail<4E...@dia-bonn.de>.err:org.apache.james.mime4j.MimeIOException:
Maximum header length limit (20000) exceeded

E.g. when setting the maxheaderlines parameter to 20000 there are still
situations when an exception is thrown with the maxheaderlines parameter
being 1000. So it's seems that the config is not used consistently but
replaced by the default in certain circumstances which I'd still have to
debug. Is this
worth another BR?

Cheers

Wolfgang

Am 29.09.14 um 16:27 schrieb Oleg Kalnichevski:
> On Mon, 2014-09-29 at 15:59 +0200, Wolfgang Fahl wrote:
>> Hi Eric, Ioan, Oleg and others,
>>
> ...
>
>> Now I was hoping to be able to test this fix. I assume I have to add
>> some test message to:
>> core:
>>    src/test/resources/testmsgs
>>
>> But to really check the new behaviour they'd have to be three different
>> tests:
>> 1. check invalid mimeCharset in lenient mode - will work with default
>> Charset
>> 2. check invalid mimeCharset in non-lenient mode - will throw exception
>> 3. check invalid mimeCharset in non-lenient mode with overridden
>> resolveCharset - will work with chosen mapped Charset.
>>
> A plain vanilla JUnit will do.
>
>> Please let me know how I can add these tests and how get a proper
>> patchset going. I don't work much with subversion theses days -
>> i prefer to use git.
>>
> You are welcome to open a PR at github and reference it from JIRA
>
> https://github.com/apache/james-mime4j
>
> Oleg
>
>> Cheers
>>
>> Wolfgang
>>
>> Am 10.08.14 um 10:33 schrieb Stan Ioan Eugen:
>>> Hello Wolfgang,
>>>
>>> Sorry for my late reply.  I've created a Jira ticket to track this
>>> issue. As Eric suggested, it's the right way to do get code into the
>>> project.
>>> I've looked over the code and it looks good in general. I would keep
>>> both variants of the regular expression to match FROM lines, with  a
>>> good  javadoc, so users can use any of them in their code. I would
>>> also move the 'mbox != null' check inside the constructor - this way
>>> we make sure we don't create an object in an inconsistent state.
>>>
>>> I will be more than happy to push the patch upstream once we have some
>>> tests for the new behavior. Are you interested in providing the tests?
>>>
>>> Please use the issue for patch submission and relevant comments.
>>> https://issues.apache.org/jira/browse/MIME4J-242
>>>
>>> Thanks,
>>>
>>>
>>> 2014-08-03 10:52 GMT+03:00 Eric Charles <er...@apache.org>:
>>>> Could you open on JIRA on https://issues.apache.org/jira/browse/MIME4J
>>>> and upload there your patch? Thx.
>>>>
>>>> On 07/23/2014 09:57 AM, Wolfgang Fahl wrote:
>>>>> Hi Ioan Eugen,
>>>>>
>>>>> please find attached a patch.
>>>>>
>>>>> it uses the following fromline pattern:
>>>>> static final String DEFAULT = "^From \\S+.*\\d{4}$";
>>>>> so that it matches more lines.
>>>>> 1. From ieugen@apache.org Fri Sep 09 14:04:52 2011
>>>>> 2. From MAILER-DAEMON Wed Oct 05 21:54:09 2011
>>>>> 3. From - Wed Apr 02 06:51:08 2014
>>>>>
>>>>> so looking for an "@" sign is not enforced any more.
>>>>>
>>>>> The patch fixes a typo:
>>>>> -    private Matcher fromLineMathcer;
>>>>> +    private Matcher fromLineMatcher;
>>>>>
>>>>> in many places of the source code.
>>>>>
>>>>> It adds a reference to the original mbox File so that the error message:
>>>>> +                 if (mbox!=null)
>>>>> +                       path=mbox.getPath();
>>>>> +            throw new IllegalArgumentException("File "+path+" does not
>>>>> contain From_ lines that match the pattern
>>>>> '"+MESSAGE_START.pattern()+"'! Maybe not be a valid Mbox.");
>>>>>
>>>>> can be improved.
>>>>>
>>>>> Who is going to check this patch and what needs to be done to get it
>>>>> into the official repo?
>>>>> I would also like to add more test cases and especially include some
>>>>> dummy mboxes. And as mentioned I'd like to check the iterator against
>>>>> all my Thunderbird mboxes to check
>>>>> whether it will successfully parse them all. Also I am offering to write
>>>>> a few "tutorial lines". Where would I have to put these?
>>>>>
>>>>> Cheers
>>>>>   Wolfgang
>>>>>
>>>>> Am 22.07.14 22:23, schrieb Ioan Eugen Stan:
>>>>>> Hello Wolfgang,
>>>>>>
>>>>>> I developed MailboxIterator. It's nice to see that it's helpful :)
>>>>>>
>>>>>> You get that error because MboxIterator does not know how to split the
>>>>>> messages. Messages in an mbox file are separated via lines that start
>>>>>> with '' From:'. They are called (by me at least) 'From lines' :) .
>>>>>> One problem with the mbox format is that it's a bit 'free-form' in the
>>>>>> sense that developers abused it and we have some variants [1].
>>>>>>
>>>>>> One thing that you could try is to supply a different From line
>>>>>> regular expression to MboxIterator via regexpPattern argument. It will
>>>>>> split messages based on this new value.
>>>>>>
>>>>>> [1] http://wiki2.dovecot.org/MailboxFormat/mbox
>>>>>>
>>>>>> Good luck and please post the your results.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> On Fri, Jul 18, 2014 at 12:53 PM, Wolfgang Fahl <wf...@bitplan.com> wrote:
>>>>>>> Dear mime4j developers,
>>>>>>>
>>>>>>> for one of my projects I have been using mime4j successfully to import
>>>>>>> e-mail into our CRM database for some two years know.
>>>>>>> Currently I am trying to add a feature which would allow reading Mozilla
>>>>>>> Thunderbird Mailbox content.
>>>>>>> As of mime4j 0.8 there seems to be a MboxIterator which could do that.
>>>>>>> Since I didn't find any publicly available source repository which I
>>>>>>> could use to access the 0.8-Snapshop I have copied
>>>>>>> the three source files:
>>>>>>> * CharBufferWrapper.java
>>>>>>> * FromLinePatterns.java
>>>>>>> * MboxIterator.java
>>>>>>>
>>>>>>> into my source tree and I am using these together with the following
>>>>>>> maven dependency:
>>>>>>>
>>>>>>> <!-- EMail handling -->
>>>>>>>         <dependency>
>>>>>>>             <groupId>org.apache.james</groupId>
>>>>>>>             <artifactId>apache-mime4j-core</artifactId>
>>>>>>>             <version>0.7.2</version>
>>>>>>>         </dependency>
>>>>>>>         <dependency>
>>>>>>>             <groupId>org.apache.james</groupId>
>>>>>>>             <artifactId>apache-mime4j-dom</artifactId>
>>>>>>>             <version>0.7.2</version>
>>>>>>>         </dependency>
>>>>>>>
>>>>>>> The iterator works somewhat o.k. on some of the Thunderbird mailbox
>>>>>>> files and loops thru the mails in it correctly.
>>>>>>> The mails can than not be directly parsed with mime4j - there is one
>>>>>>> newline at the begining which spoils the show. After
>>>>>>> working around this it's working as expected in some cases. In other
>>>>>>> cases there is an error:
>>>>>>>
>>>>>>> java.lang.IllegalArgumentException: File does not contain From_ lines!
>>>>>>> Maybe not be a vaild Mbox.
>>>>>>>     at
>>>>>>> org.apache.james.mime4j.mboxiterator.MboxIterator.initMboxIterator(MboxIterator.java:85)
>>>>>>>     at
>>>>>>> org.apache.james.mime4j.mboxiterator.MboxIterator.<init>(MboxIterator.java:75)
>>>>>>>     at
>>>>>>> org.apache.james.mime4j.mboxiterator.MboxIterator.<init>(MboxIterator.java:62)
>>>>>>>     at
>>>>>>> org.apache.james.mime4j.mboxiterator.MboxIterator$Builder.build(MboxIterator.java:241)
>>>>>>>     at
>>>>>>> com.bitplan.clientutils.ThunderbirdMailArchiveImpl.getMailById(ThunderbirdMailArchiveImpl.java:386)
>>>>>>>     at
>>>>>>> com.bitplan.clientutils.ThunderbirdMailArchiveImpl.getMailById(ThunderbirdMailArchiveImpl.java:261)
>>>>>>>     at
>>>>>>> com.bitplan.clientutils.rest.TestMailAccess.testMailById(TestMailAccess.java:77)
>>>>>>>
>>>>>>> By the way - there is a typo in the above error message "vaild" should
>>>>>>> be "valid".
>>>>>>>
>>>>>>> The error is something I'd like to fix or work-around.
>>>>>>>
>>>>>>> I have two big user accounts with several hundred mailbox files and some
>>>>>>> 300.000 mails from the last 15 years which I'd like
>>>>>>> to use as a testcase against which to run the mime4j implementation.
>>>>>>>
>>>>>>> Would you please supply me with some pointers where I get the necessary
>>>>>>> source code and how i could supply patches and
>>>>>>> testcases for the project?
>>>>>>>
>>>>>>> Also it would be good to know whether others would be interested in the
>>>>>>> Thunderbird Mailbox reading capability.
>>>>>>>
>>>>>>>
>>>>>>> Cheers
>>>>>>>   Wolfgang
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> BITPlan - smart solutions
>>>>>>> Wolfgang Fahl
>>>>>>> Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
>>>>>>> Tel. +49 2154 811-480, Fax +49 2154 811-481
>>>>>>> Web: http://www.bitplan.de
>>>>>>> BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, Geschäftsführer: Wolfgang Fahl
>>>>>>>
>>>
>
>

-- 

BITPlan - smart solutions
Wolfgang Fahl
Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
Tel. +49 2154 811-480, Fax +49 2154 811-481
Web: http://www.bitplan.de
BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, Geschäftsführer: Wolfgang Fahl 



Re: Mime4J improvements was: Re: Thunderbird Mailbox support (patch included)

Posted by Wolfgang Fahl <wf...@bitplan.com>.
Hi Oleg,

thankx. I tried this and it looks like my pull request magically made it
into the comment
for https://issues.apache.org/jira/browse/MIME4J-218

Cheers
  Wolfgang
Am 03.10.14 um 10:31 schrieb Oleg Kalnichevski:
> On Thu, 2014-10-02 at 19:37 +0200, Wolfgang Fahl wrote:
>> Hi Oleg,
>>
>> Thank you for your prompt answer and advice. It worked beautifully from
>> my point of view:
>> https://github.com/WolfgangFahl/james-mime4j/blob/trunk/dom/src/main/java/org/apache/james/mime4j/message/BasicBodyFactory.java
>> has a fix with my proposal as to make lenient Charset handling the
>> default but at least switchable
>>
>> and
>>
>> https://github.com/WolfgangFahl/james-mime4j/blob/trunk/dom/src/test/java/org/apache/james/mime4j/dom/MessageCharsetLenientTest.java
>> has a JUnit test that shows the modified behavior. It includes some 50
>> invalid Charsets i found in my sample of 1/4 million e-mail messages.
>>
>> As far as I can tell the changes don't break any other test.
>>
> Please raise a pull request in GitHub and post a link to MIME4J-218
>
>> The relevant bug https://issues.apache.org/jira/browse/MIME4J-218
>> is still marked as resolved. Shall I add a new one or are you going to
>> reopen it?
>>
> I re-opened MIME4J-218.
>
>> With the fix above only 5 messages in my sample of 1/4 million emails
>> can't be parsed by mime4j 0.8.0-SNAPSHOT. All errors are due to
>> line size and header size issues. The repository above has further
>> improvements on the handling of these MimeConfig settings. There seems
>> to be a
>> followup problem that the MimeConfig settings are not fully picked up in
>> all situations. This gets visible when the exception messages has the
>> current setting added e.g.
>> as I did it in my changes.
>>
>> Example:
>> mail<4E...@dia-bonn.de>.err:org.apache.james.mime4j.MimeIOException:
>> Maximum header length limit (20000) exceeded
>>
>> E.g. when setting the maxheaderlines parameter to 20000 there are still
>> situations when an exception is thrown with the maxheaderlines parameter
>> being 1000. So it's seems that the config is not used consistently but
>> replaced by the default in certain circumstances which I'd still have to
>> debug. Is this
>> worth another BR?
>>
> If you have a reasonable reproducer please raise a JIRA and attach the
> test case to it.
>
> Oleg
>
>
>

-- 

BITPlan - smart solutions
Wolfgang Fahl
Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
Tel. +49 2154 811-480, Fax +49 2154 811-481
Web: http://www.bitplan.de
BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, Geschäftsführer: Wolfgang Fahl 


Re: Mime4J improvements was: Re: Thunderbird Mailbox support (patch included)

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Thu, 2014-10-02 at 19:37 +0200, Wolfgang Fahl wrote:
> Hi Oleg,
> 
> Thank you for your prompt answer and advice. It worked beautifully from
> my point of view:
> https://github.com/WolfgangFahl/james-mime4j/blob/trunk/dom/src/main/java/org/apache/james/mime4j/message/BasicBodyFactory.java
> has a fix with my proposal as to make lenient Charset handling the
> default but at least switchable
> 
> and
> 
> https://github.com/WolfgangFahl/james-mime4j/blob/trunk/dom/src/test/java/org/apache/james/mime4j/dom/MessageCharsetLenientTest.java
> has a JUnit test that shows the modified behavior. It includes some 50
> invalid Charsets i found in my sample of 1/4 million e-mail messages.
> 
> As far as I can tell the changes don't break any other test.
> 

Please raise a pull request in GitHub and post a link to MIME4J-218

> The relevant bug https://issues.apache.org/jira/browse/MIME4J-218
> is still marked as resolved. Shall I add a new one or are you going to
> reopen it?
> 

I re-opened MIME4J-218.

> With the fix above only 5 messages in my sample of 1/4 million emails
> can't be parsed by mime4j 0.8.0-SNAPSHOT. All errors are due to
> line size and header size issues. The repository above has further
> improvements on the handling of these MimeConfig settings. There seems
> to be a
> followup problem that the MimeConfig settings are not fully picked up in
> all situations. This gets visible when the exception messages has the
> current setting added e.g.
> as I did it in my changes.
> 
> Example:
> mail<4E...@dia-bonn.de>.err:org.apache.james.mime4j.MimeIOException:
> Maximum header length limit (20000) exceeded
> 
> E.g. when setting the maxheaderlines parameter to 20000 there are still
> situations when an exception is thrown with the maxheaderlines parameter
> being 1000. So it's seems that the config is not used consistently but
> replaced by the default in certain circumstances which I'd still have to
> debug. Is this
> worth another BR?
> 

If you have a reasonable reproducer please raise a JIRA and attach the
test case to it.

Oleg