You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mime4j-dev@james.apache.org by Wolfgang Fahl <wf...@bitplan.com> on 2014/07/18 11:53:05 UTC

Thunderbird Mailbox support

Dear mime4j developers,

for one of my projects I have been using mime4j successfully to import
e-mail into our CRM database for some two years know.
Currently I am trying to add a feature which would allow reading Mozilla
Thunderbird Mailbox content.
As of mime4j 0.8 there seems to be a MboxIterator which could do that.
Since I didn't find any publicly available source repository which I
could use to access the 0.8-Snapshop I have copied
the three source files:
* CharBufferWrapper.java
* FromLinePatterns.java
* MboxIterator.java

into my source tree and I am using these together with the following
maven dependency:

<!-- EMail handling -->
        <dependency>
            <groupId>org.apache.james</groupId>
            <artifactId>apache-mime4j-core</artifactId>
            <version>0.7.2</version>
        </dependency>
        <dependency>
            <groupId>org.apache.james</groupId>
            <artifactId>apache-mime4j-dom</artifactId>
            <version>0.7.2</version>
        </dependency>

The iterator works somewhat o.k. on some of the Thunderbird mailbox
files and loops thru the mails in it correctly.
The mails can than not be directly parsed with mime4j - there is one
newline at the begining which spoils the show. After
working around this it's working as expected in some cases. In other
cases there is an error:

java.lang.IllegalArgumentException: File does not contain From_ lines!
Maybe not be a vaild Mbox.
    at
org.apache.james.mime4j.mboxiterator.MboxIterator.initMboxIterator(MboxIterator.java:85)
    at
org.apache.james.mime4j.mboxiterator.MboxIterator.<init>(MboxIterator.java:75)
    at
org.apache.james.mime4j.mboxiterator.MboxIterator.<init>(MboxIterator.java:62)
    at
org.apache.james.mime4j.mboxiterator.MboxIterator$Builder.build(MboxIterator.java:241)
    at
com.bitplan.clientutils.ThunderbirdMailArchiveImpl.getMailById(ThunderbirdMailArchiveImpl.java:386)
    at
com.bitplan.clientutils.ThunderbirdMailArchiveImpl.getMailById(ThunderbirdMailArchiveImpl.java:261)
    at
com.bitplan.clientutils.rest.TestMailAccess.testMailById(TestMailAccess.java:77)
 
By the way - there is a typo in the above error message "vaild" should
be "valid".

The error is something I'd like to fix or work-around.

I have two big user accounts with several hundred mailbox files and some
300.000 mails from the last 15 years which I'd like
to use as a testcase against which to run the mime4j implementation.

Would you please supply me with some pointers where I get the necessary
source code and how i could supply patches and
testcases for the project?

Also it would be good to know whether others would be interested in the
Thunderbird Mailbox reading capability.


Cheers
  Wolfgang

-- 

BITPlan - smart solutions
Wolfgang Fahl
Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
Tel. +49 2154 811-480, Fax +49 2154 811-481
Web: http://www.bitplan.de
BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, Geschäftsführer: Wolfgang Fahl 


Re: Thunderbird Mailbox support

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Fri, 2014-07-18 at 11:53 +0200, Wolfgang Fahl wrote:
> Dear mime4j developers,
> 
> for one of my projects I have been using mime4j successfully to import
> e-mail into our CRM database for some two years know.
> Currently I am trying to add a feature which would allow reading Mozilla
> Thunderbird Mailbox content.
> As of mime4j 0.8 there seems to be a MboxIterator which could do that.
> Since I didn't find any publicly available source repository which I
> could use to access the 0.8-Snapshop I have copied
> the three source files:
> * CharBufferWrapper.java
> * FromLinePatterns.java
> * MboxIterator.java
> 
> into my source tree and I am using these together with the following
> maven dependency:
> 
> <!-- EMail handling -->
>         <dependency>
>             <groupId>org.apache.james</groupId>
>             <artifactId>apache-mime4j-core</artifactId>
>             <version>0.7.2</version>
>         </dependency>
>         <dependency>
>             <groupId>org.apache.james</groupId>
>             <artifactId>apache-mime4j-dom</artifactId>
>             <version>0.7.2</version>
>         </dependency>
> 
> The iterator works somewhat o.k. on some of the Thunderbird mailbox
> files and loops thru the mails in it correctly.
> The mails can than not be directly parsed with mime4j - there is one
> newline at the begining which spoils the show. After
> working around this it's working as expected in some cases. In other
> cases there is an error:
> 
> java.lang.IllegalArgumentException: File does not contain From_ lines!
> Maybe not be a vaild Mbox.
>     at
> org.apache.james.mime4j.mboxiterator.MboxIterator.initMboxIterator(MboxIterator.java:85)
>     at
> org.apache.james.mime4j.mboxiterator.MboxIterator.<init>(MboxIterator.java:75)
>     at
> org.apache.james.mime4j.mboxiterator.MboxIterator.<init>(MboxIterator.java:62)
>     at
> org.apache.james.mime4j.mboxiterator.MboxIterator$Builder.build(MboxIterator.java:241)
>     at
> com.bitplan.clientutils.ThunderbirdMailArchiveImpl.getMailById(ThunderbirdMailArchiveImpl.java:386)
>     at
> com.bitplan.clientutils.ThunderbirdMailArchiveImpl.getMailById(ThunderbirdMailArchiveImpl.java:261)
>     at
> com.bitplan.clientutils.rest.TestMailAccess.testMailById(TestMailAccess.java:77)
>  
> By the way - there is a typo in the above error message "vaild" should
> be "valid".
> 
> The error is something I'd like to fix or work-around.
> 
> I have two big user accounts with several hundred mailbox files and some
> 300.000 mails from the last 15 years which I'd like
> to use as a testcase against which to run the mime4j implementation.
> 
> Would you please supply me with some pointers where I get the necessary
> source code and how i could supply patches and
> testcases for the project?
> 
> Also it would be good to know whether others would be interested in the
> Thunderbird Mailbox reading capability.
> 
> 
> Cheers
>   Wolfgang
> 

Wolfgang

I am not really involved in development of MboxIterator, but generally
you should be able to find sources in the ASF source repository [1] or
at github [2] (read-only copy of the official repo).

Once you have a change-set which you would like incorporated in the
official code tree, you should raise a change request in JIRA [3] and
attach the patch to it or reference a pull request at github.

Oleg

[1] http://svn.apache.org/repos/asf/james/mime4j/trunk/
[2] https://github.com/apache/james-mime4j/tree/trunk
[3] https://issues.apache.org/jira/browse/MIME4J


Re: Mime4J improvements was: Re: Thunderbird Mailbox support (patch included)

Posted by Wolfgang Fahl <wf...@bitplan.com>.
Hi Oleg,

thankx. I tried this and it looks like my pull request magically made it
into the comment
for https://issues.apache.org/jira/browse/MIME4J-218

Cheers
  Wolfgang
Am 03.10.14 um 10:31 schrieb Oleg Kalnichevski:
> On Thu, 2014-10-02 at 19:37 +0200, Wolfgang Fahl wrote:
>> Hi Oleg,
>>
>> Thank you for your prompt answer and advice. It worked beautifully from
>> my point of view:
>> https://github.com/WolfgangFahl/james-mime4j/blob/trunk/dom/src/main/java/org/apache/james/mime4j/message/BasicBodyFactory.java
>> has a fix with my proposal as to make lenient Charset handling the
>> default but at least switchable
>>
>> and
>>
>> https://github.com/WolfgangFahl/james-mime4j/blob/trunk/dom/src/test/java/org/apache/james/mime4j/dom/MessageCharsetLenientTest.java
>> has a JUnit test that shows the modified behavior. It includes some 50
>> invalid Charsets i found in my sample of 1/4 million e-mail messages.
>>
>> As far as I can tell the changes don't break any other test.
>>
> Please raise a pull request in GitHub and post a link to MIME4J-218
>
>> The relevant bug https://issues.apache.org/jira/browse/MIME4J-218
>> is still marked as resolved. Shall I add a new one or are you going to
>> reopen it?
>>
> I re-opened MIME4J-218.
>
>> With the fix above only 5 messages in my sample of 1/4 million emails
>> can't be parsed by mime4j 0.8.0-SNAPSHOT. All errors are due to
>> line size and header size issues. The repository above has further
>> improvements on the handling of these MimeConfig settings. There seems
>> to be a
>> followup problem that the MimeConfig settings are not fully picked up in
>> all situations. This gets visible when the exception messages has the
>> current setting added e.g.
>> as I did it in my changes.
>>
>> Example:
>> mail<4E...@dia-bonn.de>.err:org.apache.james.mime4j.MimeIOException:
>> Maximum header length limit (20000) exceeded
>>
>> E.g. when setting the maxheaderlines parameter to 20000 there are still
>> situations when an exception is thrown with the maxheaderlines parameter
>> being 1000. So it's seems that the config is not used consistently but
>> replaced by the default in certain circumstances which I'd still have to
>> debug. Is this
>> worth another BR?
>>
> If you have a reasonable reproducer please raise a JIRA and attach the
> test case to it.
>
> Oleg
>
>
>

-- 

BITPlan - smart solutions
Wolfgang Fahl
Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
Tel. +49 2154 811-480, Fax +49 2154 811-481
Web: http://www.bitplan.de
BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, Geschäftsführer: Wolfgang Fahl 


Re: Mime4J improvements was: Re: Thunderbird Mailbox support (patch included)

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Thu, 2014-10-02 at 19:37 +0200, Wolfgang Fahl wrote:
> Hi Oleg,
> 
> Thank you for your prompt answer and advice. It worked beautifully from
> my point of view:
> https://github.com/WolfgangFahl/james-mime4j/blob/trunk/dom/src/main/java/org/apache/james/mime4j/message/BasicBodyFactory.java
> has a fix with my proposal as to make lenient Charset handling the
> default but at least switchable
> 
> and
> 
> https://github.com/WolfgangFahl/james-mime4j/blob/trunk/dom/src/test/java/org/apache/james/mime4j/dom/MessageCharsetLenientTest.java
> has a JUnit test that shows the modified behavior. It includes some 50
> invalid Charsets i found in my sample of 1/4 million e-mail messages.
> 
> As far as I can tell the changes don't break any other test.
> 

Please raise a pull request in GitHub and post a link to MIME4J-218

> The relevant bug https://issues.apache.org/jira/browse/MIME4J-218
> is still marked as resolved. Shall I add a new one or are you going to
> reopen it?
> 

I re-opened MIME4J-218.

> With the fix above only 5 messages in my sample of 1/4 million emails
> can't be parsed by mime4j 0.8.0-SNAPSHOT. All errors are due to
> line size and header size issues. The repository above has further
> improvements on the handling of these MimeConfig settings. There seems
> to be a
> followup problem that the MimeConfig settings are not fully picked up in
> all situations. This gets visible when the exception messages has the
> current setting added e.g.
> as I did it in my changes.
> 
> Example:
> mail<4E...@dia-bonn.de>.err:org.apache.james.mime4j.MimeIOException:
> Maximum header length limit (20000) exceeded
> 
> E.g. when setting the maxheaderlines parameter to 20000 there are still
> situations when an exception is thrown with the maxheaderlines parameter
> being 1000. So it's seems that the config is not used consistently but
> replaced by the default in certain circumstances which I'd still have to
> debug. Is this
> worth another BR?
> 

If you have a reasonable reproducer please raise a JIRA and attach the
test case to it.

Oleg



Re: Mime4J improvements was: Re: Thunderbird Mailbox support (patch included)

Posted by Wolfgang Fahl <wf...@bitplan.com>.
Hi Oleg,

Thank you for your prompt answer and advice. It worked beautifully from
my point of view:
https://github.com/WolfgangFahl/james-mime4j/blob/trunk/dom/src/main/java/org/apache/james/mime4j/message/BasicBodyFactory.java
has a fix with my proposal as to make lenient Charset handling the
default but at least switchable

and

https://github.com/WolfgangFahl/james-mime4j/blob/trunk/dom/src/test/java/org/apache/james/mime4j/dom/MessageCharsetLenientTest.java
has a JUnit test that shows the modified behavior. It includes some 50
invalid Charsets i found in my sample of 1/4 million e-mail messages.

As far as I can tell the changes don't break any other test.

The relevant bug https://issues.apache.org/jira/browse/MIME4J-218
is still marked as resolved. Shall I add a new one or are you going to
reopen it?

With the fix above only 5 messages in my sample of 1/4 million emails
can't be parsed by mime4j 0.8.0-SNAPSHOT. All errors are due to
line size and header size issues. The repository above has further
improvements on the handling of these MimeConfig settings. There seems
to be a
followup problem that the MimeConfig settings are not fully picked up in
all situations. This gets visible when the exception messages has the
current setting added e.g.
as I did it in my changes.

Example:
mail<4E...@dia-bonn.de>.err:org.apache.james.mime4j.MimeIOException:
Maximum header length limit (20000) exceeded

E.g. when setting the maxheaderlines parameter to 20000 there are still
situations when an exception is thrown with the maxheaderlines parameter
being 1000. So it's seems that the config is not used consistently but
replaced by the default in certain circumstances which I'd still have to
debug. Is this
worth another BR?

Cheers

Wolfgang

Am 29.09.14 um 16:27 schrieb Oleg Kalnichevski:
> On Mon, 2014-09-29 at 15:59 +0200, Wolfgang Fahl wrote:
>> Hi Eric, Ioan, Oleg and others,
>>
> ...
>
>> Now I was hoping to be able to test this fix. I assume I have to add
>> some test message to:
>> core:
>>    src/test/resources/testmsgs
>>
>> But to really check the new behaviour they'd have to be three different
>> tests:
>> 1. check invalid mimeCharset in lenient mode - will work with default
>> Charset
>> 2. check invalid mimeCharset in non-lenient mode - will throw exception
>> 3. check invalid mimeCharset in non-lenient mode with overridden
>> resolveCharset - will work with chosen mapped Charset.
>>
> A plain vanilla JUnit will do.
>
>> Please let me know how I can add these tests and how get a proper
>> patchset going. I don't work much with subversion theses days -
>> i prefer to use git.
>>
> You are welcome to open a PR at github and reference it from JIRA
>
> https://github.com/apache/james-mime4j
>
> Oleg
>
>> Cheers
>>
>> Wolfgang
>>
>> Am 10.08.14 um 10:33 schrieb Stan Ioan Eugen:
>>> Hello Wolfgang,
>>>
>>> Sorry for my late reply.  I've created a Jira ticket to track this
>>> issue. As Eric suggested, it's the right way to do get code into the
>>> project.
>>> I've looked over the code and it looks good in general. I would keep
>>> both variants of the regular expression to match FROM lines, with  a
>>> good  javadoc, so users can use any of them in their code. I would
>>> also move the 'mbox != null' check inside the constructor - this way
>>> we make sure we don't create an object in an inconsistent state.
>>>
>>> I will be more than happy to push the patch upstream once we have some
>>> tests for the new behavior. Are you interested in providing the tests?
>>>
>>> Please use the issue for patch submission and relevant comments.
>>> https://issues.apache.org/jira/browse/MIME4J-242
>>>
>>> Thanks,
>>>
>>>
>>> 2014-08-03 10:52 GMT+03:00 Eric Charles <er...@apache.org>:
>>>> Could you open on JIRA on https://issues.apache.org/jira/browse/MIME4J
>>>> and upload there your patch? Thx.
>>>>
>>>> On 07/23/2014 09:57 AM, Wolfgang Fahl wrote:
>>>>> Hi Ioan Eugen,
>>>>>
>>>>> please find attached a patch.
>>>>>
>>>>> it uses the following fromline pattern:
>>>>> static final String DEFAULT = "^From \\S+.*\\d{4}$";
>>>>> so that it matches more lines.
>>>>> 1. From ieugen@apache.org Fri Sep 09 14:04:52 2011
>>>>> 2. From MAILER-DAEMON Wed Oct 05 21:54:09 2011
>>>>> 3. From - Wed Apr 02 06:51:08 2014
>>>>>
>>>>> so looking for an "@" sign is not enforced any more.
>>>>>
>>>>> The patch fixes a typo:
>>>>> -    private Matcher fromLineMathcer;
>>>>> +    private Matcher fromLineMatcher;
>>>>>
>>>>> in many places of the source code.
>>>>>
>>>>> It adds a reference to the original mbox File so that the error message:
>>>>> +                 if (mbox!=null)
>>>>> +                       path=mbox.getPath();
>>>>> +            throw new IllegalArgumentException("File "+path+" does not
>>>>> contain From_ lines that match the pattern
>>>>> '"+MESSAGE_START.pattern()+"'! Maybe not be a valid Mbox.");
>>>>>
>>>>> can be improved.
>>>>>
>>>>> Who is going to check this patch and what needs to be done to get it
>>>>> into the official repo?
>>>>> I would also like to add more test cases and especially include some
>>>>> dummy mboxes. And as mentioned I'd like to check the iterator against
>>>>> all my Thunderbird mboxes to check
>>>>> whether it will successfully parse them all. Also I am offering to write
>>>>> a few "tutorial lines". Where would I have to put these?
>>>>>
>>>>> Cheers
>>>>>   Wolfgang
>>>>>
>>>>> Am 22.07.14 22:23, schrieb Ioan Eugen Stan:
>>>>>> Hello Wolfgang,
>>>>>>
>>>>>> I developed MailboxIterator. It's nice to see that it's helpful :)
>>>>>>
>>>>>> You get that error because MboxIterator does not know how to split the
>>>>>> messages. Messages in an mbox file are separated via lines that start
>>>>>> with '' From:'. They are called (by me at least) 'From lines' :) .
>>>>>> One problem with the mbox format is that it's a bit 'free-form' in the
>>>>>> sense that developers abused it and we have some variants [1].
>>>>>>
>>>>>> One thing that you could try is to supply a different From line
>>>>>> regular expression to MboxIterator via regexpPattern argument. It will
>>>>>> split messages based on this new value.
>>>>>>
>>>>>> [1] http://wiki2.dovecot.org/MailboxFormat/mbox
>>>>>>
>>>>>> Good luck and please post the your results.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> On Fri, Jul 18, 2014 at 12:53 PM, Wolfgang Fahl <wf...@bitplan.com> wrote:
>>>>>>> Dear mime4j developers,
>>>>>>>
>>>>>>> for one of my projects I have been using mime4j successfully to import
>>>>>>> e-mail into our CRM database for some two years know.
>>>>>>> Currently I am trying to add a feature which would allow reading Mozilla
>>>>>>> Thunderbird Mailbox content.
>>>>>>> As of mime4j 0.8 there seems to be a MboxIterator which could do that.
>>>>>>> Since I didn't find any publicly available source repository which I
>>>>>>> could use to access the 0.8-Snapshop I have copied
>>>>>>> the three source files:
>>>>>>> * CharBufferWrapper.java
>>>>>>> * FromLinePatterns.java
>>>>>>> * MboxIterator.java
>>>>>>>
>>>>>>> into my source tree and I am using these together with the following
>>>>>>> maven dependency:
>>>>>>>
>>>>>>> <!-- EMail handling -->
>>>>>>>         <dependency>
>>>>>>>             <groupId>org.apache.james</groupId>
>>>>>>>             <artifactId>apache-mime4j-core</artifactId>
>>>>>>>             <version>0.7.2</version>
>>>>>>>         </dependency>
>>>>>>>         <dependency>
>>>>>>>             <groupId>org.apache.james</groupId>
>>>>>>>             <artifactId>apache-mime4j-dom</artifactId>
>>>>>>>             <version>0.7.2</version>
>>>>>>>         </dependency>
>>>>>>>
>>>>>>> The iterator works somewhat o.k. on some of the Thunderbird mailbox
>>>>>>> files and loops thru the mails in it correctly.
>>>>>>> The mails can than not be directly parsed with mime4j - there is one
>>>>>>> newline at the begining which spoils the show. After
>>>>>>> working around this it's working as expected in some cases. In other
>>>>>>> cases there is an error:
>>>>>>>
>>>>>>> java.lang.IllegalArgumentException: File does not contain From_ lines!
>>>>>>> Maybe not be a vaild Mbox.
>>>>>>>     at
>>>>>>> org.apache.james.mime4j.mboxiterator.MboxIterator.initMboxIterator(MboxIterator.java:85)
>>>>>>>     at
>>>>>>> org.apache.james.mime4j.mboxiterator.MboxIterator.<init>(MboxIterator.java:75)
>>>>>>>     at
>>>>>>> org.apache.james.mime4j.mboxiterator.MboxIterator.<init>(MboxIterator.java:62)
>>>>>>>     at
>>>>>>> org.apache.james.mime4j.mboxiterator.MboxIterator$Builder.build(MboxIterator.java:241)
>>>>>>>     at
>>>>>>> com.bitplan.clientutils.ThunderbirdMailArchiveImpl.getMailById(ThunderbirdMailArchiveImpl.java:386)
>>>>>>>     at
>>>>>>> com.bitplan.clientutils.ThunderbirdMailArchiveImpl.getMailById(ThunderbirdMailArchiveImpl.java:261)
>>>>>>>     at
>>>>>>> com.bitplan.clientutils.rest.TestMailAccess.testMailById(TestMailAccess.java:77)
>>>>>>>
>>>>>>> By the way - there is a typo in the above error message "vaild" should
>>>>>>> be "valid".
>>>>>>>
>>>>>>> The error is something I'd like to fix or work-around.
>>>>>>>
>>>>>>> I have two big user accounts with several hundred mailbox files and some
>>>>>>> 300.000 mails from the last 15 years which I'd like
>>>>>>> to use as a testcase against which to run the mime4j implementation.
>>>>>>>
>>>>>>> Would you please supply me with some pointers where I get the necessary
>>>>>>> source code and how i could supply patches and
>>>>>>> testcases for the project?
>>>>>>>
>>>>>>> Also it would be good to know whether others would be interested in the
>>>>>>> Thunderbird Mailbox reading capability.
>>>>>>>
>>>>>>>
>>>>>>> Cheers
>>>>>>>   Wolfgang
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> BITPlan - smart solutions
>>>>>>> Wolfgang Fahl
>>>>>>> Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
>>>>>>> Tel. +49 2154 811-480, Fax +49 2154 811-481
>>>>>>> Web: http://www.bitplan.de
>>>>>>> BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, Geschäftsführer: Wolfgang Fahl
>>>>>>>
>>>
>
>

-- 

BITPlan - smart solutions
Wolfgang Fahl
Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
Tel. +49 2154 811-480, Fax +49 2154 811-481
Web: http://www.bitplan.de
BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, Geschäftsführer: Wolfgang Fahl 



Re: Mime4J improvements was: Re: Thunderbird Mailbox support (patch included)

Posted by Oleg Kalnichevski <ol...@apache.org>.
On Mon, 2014-09-29 at 15:59 +0200, Wolfgang Fahl wrote:
> Hi Eric, Ioan, Oleg and others,
> 

...

> Now I was hoping to be able to test this fix. I assume I have to add
> some test message to:
> core:
>    src/test/resources/testmsgs
> 
> But to really check the new behaviour they'd have to be three different
> tests:
> 1. check invalid mimeCharset in lenient mode - will work with default
> Charset
> 2. check invalid mimeCharset in non-lenient mode - will throw exception
> 3. check invalid mimeCharset in non-lenient mode with overridden
> resolveCharset - will work with chosen mapped Charset.
> 

A plain vanilla JUnit will do.

> Please let me know how I can add these tests and how get a proper
> patchset going. I don't work much with subversion theses days -
> i prefer to use git.
> 

You are welcome to open a PR at github and reference it from JIRA

https://github.com/apache/james-mime4j

Oleg

> Cheers
> 
> Wolfgang
> 
> Am 10.08.14 um 10:33 schrieb Stan Ioan Eugen:
> > Hello Wolfgang,
> >
> > Sorry for my late reply.  I've created a Jira ticket to track this
> > issue. As Eric suggested, it's the right way to do get code into the
> > project.
> > I've looked over the code and it looks good in general. I would keep
> > both variants of the regular expression to match FROM lines, with  a
> > good  javadoc, so users can use any of them in their code. I would
> > also move the 'mbox != null' check inside the constructor - this way
> > we make sure we don't create an object in an inconsistent state.
> >
> > I will be more than happy to push the patch upstream once we have some
> > tests for the new behavior. Are you interested in providing the tests?
> >
> > Please use the issue for patch submission and relevant comments.
> > https://issues.apache.org/jira/browse/MIME4J-242
> >
> > Thanks,
> >
> >
> > 2014-08-03 10:52 GMT+03:00 Eric Charles <er...@apache.org>:
> >> Could you open on JIRA on https://issues.apache.org/jira/browse/MIME4J
> >> and upload there your patch? Thx.
> >>
> >> On 07/23/2014 09:57 AM, Wolfgang Fahl wrote:
> >>> Hi Ioan Eugen,
> >>>
> >>> please find attached a patch.
> >>>
> >>> it uses the following fromline pattern:
> >>> static final String DEFAULT = "^From \\S+.*\\d{4}$";
> >>> so that it matches more lines.
> >>> 1. From ieugen@apache.org Fri Sep 09 14:04:52 2011
> >>> 2. From MAILER-DAEMON Wed Oct 05 21:54:09 2011
> >>> 3. From - Wed Apr 02 06:51:08 2014
> >>>
> >>> so looking for an "@" sign is not enforced any more.
> >>>
> >>> The patch fixes a typo:
> >>> -    private Matcher fromLineMathcer;
> >>> +    private Matcher fromLineMatcher;
> >>>
> >>> in many places of the source code.
> >>>
> >>> It adds a reference to the original mbox File so that the error message:
> >>> +                 if (mbox!=null)
> >>> +                       path=mbox.getPath();
> >>> +            throw new IllegalArgumentException("File "+path+" does not
> >>> contain From_ lines that match the pattern
> >>> '"+MESSAGE_START.pattern()+"'! Maybe not be a valid Mbox.");
> >>>
> >>> can be improved.
> >>>
> >>> Who is going to check this patch and what needs to be done to get it
> >>> into the official repo?
> >>> I would also like to add more test cases and especially include some
> >>> dummy mboxes. And as mentioned I'd like to check the iterator against
> >>> all my Thunderbird mboxes to check
> >>> whether it will successfully parse them all. Also I am offering to write
> >>> a few "tutorial lines". Where would I have to put these?
> >>>
> >>> Cheers
> >>>   Wolfgang
> >>>
> >>> Am 22.07.14 22:23, schrieb Ioan Eugen Stan:
> >>>> Hello Wolfgang,
> >>>>
> >>>> I developed MailboxIterator. It's nice to see that it's helpful :)
> >>>>
> >>>> You get that error because MboxIterator does not know how to split the
> >>>> messages. Messages in an mbox file are separated via lines that start
> >>>> with '' From:'. They are called (by me at least) 'From lines' :) .
> >>>> One problem with the mbox format is that it's a bit 'free-form' in the
> >>>> sense that developers abused it and we have some variants [1].
> >>>>
> >>>> One thing that you could try is to supply a different From line
> >>>> regular expression to MboxIterator via regexpPattern argument. It will
> >>>> split messages based on this new value.
> >>>>
> >>>> [1] http://wiki2.dovecot.org/MailboxFormat/mbox
> >>>>
> >>>> Good luck and please post the your results.
> >>>>
> >>>> Regards,
> >>>>
> >>>> On Fri, Jul 18, 2014 at 12:53 PM, Wolfgang Fahl <wf...@bitplan.com> wrote:
> >>>>> Dear mime4j developers,
> >>>>>
> >>>>> for one of my projects I have been using mime4j successfully to import
> >>>>> e-mail into our CRM database for some two years know.
> >>>>> Currently I am trying to add a feature which would allow reading Mozilla
> >>>>> Thunderbird Mailbox content.
> >>>>> As of mime4j 0.8 there seems to be a MboxIterator which could do that.
> >>>>> Since I didn't find any publicly available source repository which I
> >>>>> could use to access the 0.8-Snapshop I have copied
> >>>>> the three source files:
> >>>>> * CharBufferWrapper.java
> >>>>> * FromLinePatterns.java
> >>>>> * MboxIterator.java
> >>>>>
> >>>>> into my source tree and I am using these together with the following
> >>>>> maven dependency:
> >>>>>
> >>>>> <!-- EMail handling -->
> >>>>>         <dependency>
> >>>>>             <groupId>org.apache.james</groupId>
> >>>>>             <artifactId>apache-mime4j-core</artifactId>
> >>>>>             <version>0.7.2</version>
> >>>>>         </dependency>
> >>>>>         <dependency>
> >>>>>             <groupId>org.apache.james</groupId>
> >>>>>             <artifactId>apache-mime4j-dom</artifactId>
> >>>>>             <version>0.7.2</version>
> >>>>>         </dependency>
> >>>>>
> >>>>> The iterator works somewhat o.k. on some of the Thunderbird mailbox
> >>>>> files and loops thru the mails in it correctly.
> >>>>> The mails can than not be directly parsed with mime4j - there is one
> >>>>> newline at the begining which spoils the show. After
> >>>>> working around this it's working as expected in some cases. In other
> >>>>> cases there is an error:
> >>>>>
> >>>>> java.lang.IllegalArgumentException: File does not contain From_ lines!
> >>>>> Maybe not be a vaild Mbox.
> >>>>>     at
> >>>>> org.apache.james.mime4j.mboxiterator.MboxIterator.initMboxIterator(MboxIterator.java:85)
> >>>>>     at
> >>>>> org.apache.james.mime4j.mboxiterator.MboxIterator.<init>(MboxIterator.java:75)
> >>>>>     at
> >>>>> org.apache.james.mime4j.mboxiterator.MboxIterator.<init>(MboxIterator.java:62)
> >>>>>     at
> >>>>> org.apache.james.mime4j.mboxiterator.MboxIterator$Builder.build(MboxIterator.java:241)
> >>>>>     at
> >>>>> com.bitplan.clientutils.ThunderbirdMailArchiveImpl.getMailById(ThunderbirdMailArchiveImpl.java:386)
> >>>>>     at
> >>>>> com.bitplan.clientutils.ThunderbirdMailArchiveImpl.getMailById(ThunderbirdMailArchiveImpl.java:261)
> >>>>>     at
> >>>>> com.bitplan.clientutils.rest.TestMailAccess.testMailById(TestMailAccess.java:77)
> >>>>>
> >>>>> By the way - there is a typo in the above error message "vaild" should
> >>>>> be "valid".
> >>>>>
> >>>>> The error is something I'd like to fix or work-around.
> >>>>>
> >>>>> I have two big user accounts with several hundred mailbox files and some
> >>>>> 300.000 mails from the last 15 years which I'd like
> >>>>> to use as a testcase against which to run the mime4j implementation.
> >>>>>
> >>>>> Would you please supply me with some pointers where I get the necessary
> >>>>> source code and how i could supply patches and
> >>>>> testcases for the project?
> >>>>>
> >>>>> Also it would be good to know whether others would be interested in the
> >>>>> Thunderbird Mailbox reading capability.
> >>>>>
> >>>>>
> >>>>> Cheers
> >>>>>   Wolfgang
> >>>>>
> >>>>> --
> >>>>>
> >>>>> BITPlan - smart solutions
> >>>>> Wolfgang Fahl
> >>>>> Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
> >>>>> Tel. +49 2154 811-480, Fax +49 2154 811-481
> >>>>> Web: http://www.bitplan.de
> >>>>> BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, Geschäftsführer: Wolfgang Fahl
> >>>>>
> >>>>
> >
> >
> 



Mime4J improvements was: Re: Thunderbird Mailbox support (patch included)

Posted by Wolfgang Fahl <wf...@bitplan.com>.
Hi Eric, Ioan, Oleg and others,

as offered in July:
> I would also like to add more test cases and especially include some
> dummy mboxes. And as mentioned I'd like to check the iterator against
> all my Thunderbird mboxes to check
> whether it will successfully parse them all. 
I started doing this based on the improvements that you kindly checked
in in the meantime.
So I am working with 0.8.0-SNAPSHOT at thist time.

I intend to run the iterator against some 1/4 million emails in some 850
mailboxes. I got as far
as some message  400 with 0.7.2. With 0.8.0-SNAPSHOT the library chockes
at message some 4000
which is from the apple store !

it contains:

<22...@apple.com>
Content-Type: TEXT/HTML; CHARSET=None
Content-Transfer-Encoding: QUOTED-PRINTABLE


And I ran into bug
https://issues.apache.org/jira/browse/MIME4J-218

I tried:
/**
     * Lenient BodyFactory that fixes
     * https://issues.apache.org/jira/browse/MIME4J-218 won't fix behaviour
     *
     * @author wf
     *
     */
    public static class LenientBodyFactory extends BasicBodyFactory {

        @Override
        public  Charset resolveCharset(final String mimeCharset)
                throws UnsupportedEncodingException {
            Charset result=Charset.defaultCharset();
            try {
                result=super.resolveCharset(mimeCharset);
            } catch (UnsupportedEncodingException ex) {
                // ignore
            }
            return result;
        }
    }

Which didn't work since resolveCharset is static private ... :-(

I proposed the following fix for
dom/src/main/java/org/apache/james/mime4j/message/BasicBodyFactory.java:

    public static boolean lenient=true;
   
    /**
     * select the Charset for the given mimeCharset string
     *
     *  if you need support for non standard or invalid mimeCharset
specifications
     *  you might want to create your own derived BodyFactory extending
BasicBodyFactory and
     *  overriding this method as suggested by:
     *    https://issues.apache.org/jira/browse/MIME4J-218
     * 
     *  the default behaviour is lenient, invalid mimeCharset specs will
return the defaultCharset
     *
     *  @param mimeCharset - the string specification for a charset e.g.
"UTF-8"
     *  @throws UnsupportedEncodingException if the mimeCharset is invalid
     */
    protected Charset resolveCharset(final String mimeCharset) throws
UnsupportedEncodingException {
        Charset result=null;
        if (lenient) {
          result=Charset.defaultCharset();
        }
        if (mimeCharset !=null) {
          try {
          result=  Charset.forName(mimeCharset);
           } catch (UnsupportedCharsetException ex) {
               if (!lenient)
              throw new UnsupportedEncodingException(mimeCharset);
        }
      }
      return result;
    }

Now I was hoping to be able to test this fix. I assume I have to add
some test message to:
core:
   src/test/resources/testmsgs

But to really check the new behaviour they'd have to be three different
tests:
1. check invalid mimeCharset in lenient mode - will work with default
Charset
2. check invalid mimeCharset in non-lenient mode - will throw exception
3. check invalid mimeCharset in non-lenient mode with overridden
resolveCharset - will work with chosen mapped Charset.

Please let me know how I can add these tests and how get a proper
patchset going. I don't work much with subversion theses days -
i prefer to use git.

Cheers

Wolfgang

Am 10.08.14 um 10:33 schrieb Stan Ioan Eugen:
> Hello Wolfgang,
>
> Sorry for my late reply.  I've created a Jira ticket to track this
> issue. As Eric suggested, it's the right way to do get code into the
> project.
> I've looked over the code and it looks good in general. I would keep
> both variants of the regular expression to match FROM lines, with  a
> good  javadoc, so users can use any of them in their code. I would
> also move the 'mbox != null' check inside the constructor - this way
> we make sure we don't create an object in an inconsistent state.
>
> I will be more than happy to push the patch upstream once we have some
> tests for the new behavior. Are you interested in providing the tests?
>
> Please use the issue for patch submission and relevant comments.
> https://issues.apache.org/jira/browse/MIME4J-242
>
> Thanks,
>
>
> 2014-08-03 10:52 GMT+03:00 Eric Charles <er...@apache.org>:
>> Could you open on JIRA on https://issues.apache.org/jira/browse/MIME4J
>> and upload there your patch? Thx.
>>
>> On 07/23/2014 09:57 AM, Wolfgang Fahl wrote:
>>> Hi Ioan Eugen,
>>>
>>> please find attached a patch.
>>>
>>> it uses the following fromline pattern:
>>> static final String DEFAULT = "^From \\S+.*\\d{4}$";
>>> so that it matches more lines.
>>> 1. From ieugen@apache.org Fri Sep 09 14:04:52 2011
>>> 2. From MAILER-DAEMON Wed Oct 05 21:54:09 2011
>>> 3. From - Wed Apr 02 06:51:08 2014
>>>
>>> so looking for an "@" sign is not enforced any more.
>>>
>>> The patch fixes a typo:
>>> -    private Matcher fromLineMathcer;
>>> +    private Matcher fromLineMatcher;
>>>
>>> in many places of the source code.
>>>
>>> It adds a reference to the original mbox File so that the error message:
>>> +                 if (mbox!=null)
>>> +                       path=mbox.getPath();
>>> +            throw new IllegalArgumentException("File "+path+" does not
>>> contain From_ lines that match the pattern
>>> '"+MESSAGE_START.pattern()+"'! Maybe not be a valid Mbox.");
>>>
>>> can be improved.
>>>
>>> Who is going to check this patch and what needs to be done to get it
>>> into the official repo?
>>> I would also like to add more test cases and especially include some
>>> dummy mboxes. And as mentioned I'd like to check the iterator against
>>> all my Thunderbird mboxes to check
>>> whether it will successfully parse them all. Also I am offering to write
>>> a few "tutorial lines". Where would I have to put these?
>>>
>>> Cheers
>>>   Wolfgang
>>>
>>> Am 22.07.14 22:23, schrieb Ioan Eugen Stan:
>>>> Hello Wolfgang,
>>>>
>>>> I developed MailboxIterator. It's nice to see that it's helpful :)
>>>>
>>>> You get that error because MboxIterator does not know how to split the
>>>> messages. Messages in an mbox file are separated via lines that start
>>>> with '' From:'. They are called (by me at least) 'From lines' :) .
>>>> One problem with the mbox format is that it's a bit 'free-form' in the
>>>> sense that developers abused it and we have some variants [1].
>>>>
>>>> One thing that you could try is to supply a different From line
>>>> regular expression to MboxIterator via regexpPattern argument. It will
>>>> split messages based on this new value.
>>>>
>>>> [1] http://wiki2.dovecot.org/MailboxFormat/mbox
>>>>
>>>> Good luck and please post the your results.
>>>>
>>>> Regards,
>>>>
>>>> On Fri, Jul 18, 2014 at 12:53 PM, Wolfgang Fahl <wf...@bitplan.com> wrote:
>>>>> Dear mime4j developers,
>>>>>
>>>>> for one of my projects I have been using mime4j successfully to import
>>>>> e-mail into our CRM database for some two years know.
>>>>> Currently I am trying to add a feature which would allow reading Mozilla
>>>>> Thunderbird Mailbox content.
>>>>> As of mime4j 0.8 there seems to be a MboxIterator which could do that.
>>>>> Since I didn't find any publicly available source repository which I
>>>>> could use to access the 0.8-Snapshop I have copied
>>>>> the three source files:
>>>>> * CharBufferWrapper.java
>>>>> * FromLinePatterns.java
>>>>> * MboxIterator.java
>>>>>
>>>>> into my source tree and I am using these together with the following
>>>>> maven dependency:
>>>>>
>>>>> <!-- EMail handling -->
>>>>>         <dependency>
>>>>>             <groupId>org.apache.james</groupId>
>>>>>             <artifactId>apache-mime4j-core</artifactId>
>>>>>             <version>0.7.2</version>
>>>>>         </dependency>
>>>>>         <dependency>
>>>>>             <groupId>org.apache.james</groupId>
>>>>>             <artifactId>apache-mime4j-dom</artifactId>
>>>>>             <version>0.7.2</version>
>>>>>         </dependency>
>>>>>
>>>>> The iterator works somewhat o.k. on some of the Thunderbird mailbox
>>>>> files and loops thru the mails in it correctly.
>>>>> The mails can than not be directly parsed with mime4j - there is one
>>>>> newline at the begining which spoils the show. After
>>>>> working around this it's working as expected in some cases. In other
>>>>> cases there is an error:
>>>>>
>>>>> java.lang.IllegalArgumentException: File does not contain From_ lines!
>>>>> Maybe not be a vaild Mbox.
>>>>>     at
>>>>> org.apache.james.mime4j.mboxiterator.MboxIterator.initMboxIterator(MboxIterator.java:85)
>>>>>     at
>>>>> org.apache.james.mime4j.mboxiterator.MboxIterator.<init>(MboxIterator.java:75)
>>>>>     at
>>>>> org.apache.james.mime4j.mboxiterator.MboxIterator.<init>(MboxIterator.java:62)
>>>>>     at
>>>>> org.apache.james.mime4j.mboxiterator.MboxIterator$Builder.build(MboxIterator.java:241)
>>>>>     at
>>>>> com.bitplan.clientutils.ThunderbirdMailArchiveImpl.getMailById(ThunderbirdMailArchiveImpl.java:386)
>>>>>     at
>>>>> com.bitplan.clientutils.ThunderbirdMailArchiveImpl.getMailById(ThunderbirdMailArchiveImpl.java:261)
>>>>>     at
>>>>> com.bitplan.clientutils.rest.TestMailAccess.testMailById(TestMailAccess.java:77)
>>>>>
>>>>> By the way - there is a typo in the above error message "vaild" should
>>>>> be "valid".
>>>>>
>>>>> The error is something I'd like to fix or work-around.
>>>>>
>>>>> I have two big user accounts with several hundred mailbox files and some
>>>>> 300.000 mails from the last 15 years which I'd like
>>>>> to use as a testcase against which to run the mime4j implementation.
>>>>>
>>>>> Would you please supply me with some pointers where I get the necessary
>>>>> source code and how i could supply patches and
>>>>> testcases for the project?
>>>>>
>>>>> Also it would be good to know whether others would be interested in the
>>>>> Thunderbird Mailbox reading capability.
>>>>>
>>>>>
>>>>> Cheers
>>>>>   Wolfgang
>>>>>
>>>>> --
>>>>>
>>>>> BITPlan - smart solutions
>>>>> Wolfgang Fahl
>>>>> Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
>>>>> Tel. +49 2154 811-480, Fax +49 2154 811-481
>>>>> Web: http://www.bitplan.de
>>>>> BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, Geschäftsführer: Wolfgang Fahl
>>>>>
>>>>
>
>

-- 

BITPlan - smart solutions
Wolfgang Fahl
Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
Tel. +49 2154 811-480, Fax +49 2154 811-481
Web: http://www.bitplan.de
BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, Geschäftsführer: Wolfgang Fahl 


Re: Thunderbird Mailbox support (patch included)

Posted by Stan Ioan Eugen <me...@ieugen.ro>.
Hello Wolfgang,

Sorry for my late reply.  I've created a Jira ticket to track this
issue. As Eric suggested, it's the right way to do get code into the
project.
I've looked over the code and it looks good in general. I would keep
both variants of the regular expression to match FROM lines, with  a
good  javadoc, so users can use any of them in their code. I would
also move the 'mbox != null' check inside the constructor - this way
we make sure we don't create an object in an inconsistent state.

I will be more than happy to push the patch upstream once we have some
tests for the new behavior. Are you interested in providing the tests?

Please use the issue for patch submission and relevant comments.
https://issues.apache.org/jira/browse/MIME4J-242

Thanks,


2014-08-03 10:52 GMT+03:00 Eric Charles <er...@apache.org>:
> Could you open on JIRA on https://issues.apache.org/jira/browse/MIME4J
> and upload there your patch? Thx.
>
> On 07/23/2014 09:57 AM, Wolfgang Fahl wrote:
>> Hi Ioan Eugen,
>>
>> please find attached a patch.
>>
>> it uses the following fromline pattern:
>> static final String DEFAULT = "^From \\S+.*\\d{4}$";
>> so that it matches more lines.
>> 1. From ieugen@apache.org Fri Sep 09 14:04:52 2011
>> 2. From MAILER-DAEMON Wed Oct 05 21:54:09 2011
>> 3. From - Wed Apr 02 06:51:08 2014
>>
>> so looking for an "@" sign is not enforced any more.
>>
>> The patch fixes a typo:
>> -    private Matcher fromLineMathcer;
>> +    private Matcher fromLineMatcher;
>>
>> in many places of the source code.
>>
>> It adds a reference to the original mbox File so that the error message:
>> +                 if (mbox!=null)
>> +                       path=mbox.getPath();
>> +            throw new IllegalArgumentException("File "+path+" does not
>> contain From_ lines that match the pattern
>> '"+MESSAGE_START.pattern()+"'! Maybe not be a valid Mbox.");
>>
>> can be improved.
>>
>> Who is going to check this patch and what needs to be done to get it
>> into the official repo?
>> I would also like to add more test cases and especially include some
>> dummy mboxes. And as mentioned I'd like to check the iterator against
>> all my Thunderbird mboxes to check
>> whether it will successfully parse them all. Also I am offering to write
>> a few "tutorial lines". Where would I have to put these?
>>
>> Cheers
>>   Wolfgang
>>
>> Am 22.07.14 22:23, schrieb Ioan Eugen Stan:
>>> Hello Wolfgang,
>>>
>>> I developed MailboxIterator. It's nice to see that it's helpful :)
>>>
>>> You get that error because MboxIterator does not know how to split the
>>> messages. Messages in an mbox file are separated via lines that start
>>> with '' From:'. They are called (by me at least) 'From lines' :) .
>>> One problem with the mbox format is that it's a bit 'free-form' in the
>>> sense that developers abused it and we have some variants [1].
>>>
>>> One thing that you could try is to supply a different From line
>>> regular expression to MboxIterator via regexpPattern argument. It will
>>> split messages based on this new value.
>>>
>>> [1] http://wiki2.dovecot.org/MailboxFormat/mbox
>>>
>>> Good luck and please post the your results.
>>>
>>> Regards,
>>>
>>> On Fri, Jul 18, 2014 at 12:53 PM, Wolfgang Fahl <wf...@bitplan.com> wrote:
>>>> Dear mime4j developers,
>>>>
>>>> for one of my projects I have been using mime4j successfully to import
>>>> e-mail into our CRM database for some two years know.
>>>> Currently I am trying to add a feature which would allow reading Mozilla
>>>> Thunderbird Mailbox content.
>>>> As of mime4j 0.8 there seems to be a MboxIterator which could do that.
>>>> Since I didn't find any publicly available source repository which I
>>>> could use to access the 0.8-Snapshop I have copied
>>>> the three source files:
>>>> * CharBufferWrapper.java
>>>> * FromLinePatterns.java
>>>> * MboxIterator.java
>>>>
>>>> into my source tree and I am using these together with the following
>>>> maven dependency:
>>>>
>>>> <!-- EMail handling -->
>>>>         <dependency>
>>>>             <groupId>org.apache.james</groupId>
>>>>             <artifactId>apache-mime4j-core</artifactId>
>>>>             <version>0.7.2</version>
>>>>         </dependency>
>>>>         <dependency>
>>>>             <groupId>org.apache.james</groupId>
>>>>             <artifactId>apache-mime4j-dom</artifactId>
>>>>             <version>0.7.2</version>
>>>>         </dependency>
>>>>
>>>> The iterator works somewhat o.k. on some of the Thunderbird mailbox
>>>> files and loops thru the mails in it correctly.
>>>> The mails can than not be directly parsed with mime4j - there is one
>>>> newline at the begining which spoils the show. After
>>>> working around this it's working as expected in some cases. In other
>>>> cases there is an error:
>>>>
>>>> java.lang.IllegalArgumentException: File does not contain From_ lines!
>>>> Maybe not be a vaild Mbox.
>>>>     at
>>>> org.apache.james.mime4j.mboxiterator.MboxIterator.initMboxIterator(MboxIterator.java:85)
>>>>     at
>>>> org.apache.james.mime4j.mboxiterator.MboxIterator.<init>(MboxIterator.java:75)
>>>>     at
>>>> org.apache.james.mime4j.mboxiterator.MboxIterator.<init>(MboxIterator.java:62)
>>>>     at
>>>> org.apache.james.mime4j.mboxiterator.MboxIterator$Builder.build(MboxIterator.java:241)
>>>>     at
>>>> com.bitplan.clientutils.ThunderbirdMailArchiveImpl.getMailById(ThunderbirdMailArchiveImpl.java:386)
>>>>     at
>>>> com.bitplan.clientutils.ThunderbirdMailArchiveImpl.getMailById(ThunderbirdMailArchiveImpl.java:261)
>>>>     at
>>>> com.bitplan.clientutils.rest.TestMailAccess.testMailById(TestMailAccess.java:77)
>>>>
>>>> By the way - there is a typo in the above error message "vaild" should
>>>> be "valid".
>>>>
>>>> The error is something I'd like to fix or work-around.
>>>>
>>>> I have two big user accounts with several hundred mailbox files and some
>>>> 300.000 mails from the last 15 years which I'd like
>>>> to use as a testcase against which to run the mime4j implementation.
>>>>
>>>> Would you please supply me with some pointers where I get the necessary
>>>> source code and how i could supply patches and
>>>> testcases for the project?
>>>>
>>>> Also it would be good to know whether others would be interested in the
>>>> Thunderbird Mailbox reading capability.
>>>>
>>>>
>>>> Cheers
>>>>   Wolfgang
>>>>
>>>> --
>>>>
>>>> BITPlan - smart solutions
>>>> Wolfgang Fahl
>>>> Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
>>>> Tel. +49 2154 811-480, Fax +49 2154 811-481
>>>> Web: http://www.bitplan.de
>>>> BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, Geschäftsführer: Wolfgang Fahl
>>>>
>>>
>>>
>>



-- 
Ioan Eugen Stan / ieugen.ro

Re: Thunderbird Mailbox support (patch included)

Posted by Eric Charles <er...@apache.org>.
Could you open on JIRA on https://issues.apache.org/jira/browse/MIME4J
and upload there your patch? Thx.

On 07/23/2014 09:57 AM, Wolfgang Fahl wrote:
> Hi Ioan Eugen,
> 
> please find attached a patch.
> 
> it uses the following fromline pattern:
> static final String DEFAULT = "^From \\S+.*\\d{4}$";
> so that it matches more lines.
> 1. From ieugen@apache.org Fri Sep 09 14:04:52 2011
> 2. From MAILER-DAEMON Wed Oct 05 21:54:09 2011
> 3. From - Wed Apr 02 06:51:08 2014
> 
> so looking for an "@" sign is not enforced any more.
> 
> The patch fixes a typo:
> -    private Matcher fromLineMathcer;
> +    private Matcher fromLineMatcher;
> 
> in many places of the source code.
> 
> It adds a reference to the original mbox File so that the error message:
> +                 if (mbox!=null)
> +                       path=mbox.getPath();
> +            throw new IllegalArgumentException("File "+path+" does not
> contain From_ lines that match the pattern
> '"+MESSAGE_START.pattern()+"'! Maybe not be a valid Mbox.");
> 
> can be improved.
> 
> Who is going to check this patch and what needs to be done to get it
> into the official repo?
> I would also like to add more test cases and especially include some
> dummy mboxes. And as mentioned I'd like to check the iterator against
> all my Thunderbird mboxes to check
> whether it will successfully parse them all. Also I am offering to write
> a few "tutorial lines". Where would I have to put these?
> 
> Cheers
>   Wolfgang
> 
> Am 22.07.14 22:23, schrieb Ioan Eugen Stan:
>> Hello Wolfgang,
>>
>> I developed MailboxIterator. It's nice to see that it's helpful :)
>>
>> You get that error because MboxIterator does not know how to split the
>> messages. Messages in an mbox file are separated via lines that start
>> with '' From:'. They are called (by me at least) 'From lines' :) .
>> One problem with the mbox format is that it's a bit 'free-form' in the
>> sense that developers abused it and we have some variants [1].
>>
>> One thing that you could try is to supply a different From line
>> regular expression to MboxIterator via regexpPattern argument. It will
>> split messages based on this new value.
>>
>> [1] http://wiki2.dovecot.org/MailboxFormat/mbox
>>
>> Good luck and please post the your results.
>>
>> Regards,
>>
>> On Fri, Jul 18, 2014 at 12:53 PM, Wolfgang Fahl <wf...@bitplan.com> wrote:
>>> Dear mime4j developers,
>>>
>>> for one of my projects I have been using mime4j successfully to import
>>> e-mail into our CRM database for some two years know.
>>> Currently I am trying to add a feature which would allow reading Mozilla
>>> Thunderbird Mailbox content.
>>> As of mime4j 0.8 there seems to be a MboxIterator which could do that.
>>> Since I didn't find any publicly available source repository which I
>>> could use to access the 0.8-Snapshop I have copied
>>> the three source files:
>>> * CharBufferWrapper.java
>>> * FromLinePatterns.java
>>> * MboxIterator.java
>>>
>>> into my source tree and I am using these together with the following
>>> maven dependency:
>>>
>>> <!-- EMail handling -->
>>>         <dependency>
>>>             <groupId>org.apache.james</groupId>
>>>             <artifactId>apache-mime4j-core</artifactId>
>>>             <version>0.7.2</version>
>>>         </dependency>
>>>         <dependency>
>>>             <groupId>org.apache.james</groupId>
>>>             <artifactId>apache-mime4j-dom</artifactId>
>>>             <version>0.7.2</version>
>>>         </dependency>
>>>
>>> The iterator works somewhat o.k. on some of the Thunderbird mailbox
>>> files and loops thru the mails in it correctly.
>>> The mails can than not be directly parsed with mime4j - there is one
>>> newline at the begining which spoils the show. After
>>> working around this it's working as expected in some cases. In other
>>> cases there is an error:
>>>
>>> java.lang.IllegalArgumentException: File does not contain From_ lines!
>>> Maybe not be a vaild Mbox.
>>>     at
>>> org.apache.james.mime4j.mboxiterator.MboxIterator.initMboxIterator(MboxIterator.java:85)
>>>     at
>>> org.apache.james.mime4j.mboxiterator.MboxIterator.<init>(MboxIterator.java:75)
>>>     at
>>> org.apache.james.mime4j.mboxiterator.MboxIterator.<init>(MboxIterator.java:62)
>>>     at
>>> org.apache.james.mime4j.mboxiterator.MboxIterator$Builder.build(MboxIterator.java:241)
>>>     at
>>> com.bitplan.clientutils.ThunderbirdMailArchiveImpl.getMailById(ThunderbirdMailArchiveImpl.java:386)
>>>     at
>>> com.bitplan.clientutils.ThunderbirdMailArchiveImpl.getMailById(ThunderbirdMailArchiveImpl.java:261)
>>>     at
>>> com.bitplan.clientutils.rest.TestMailAccess.testMailById(TestMailAccess.java:77)
>>>
>>> By the way - there is a typo in the above error message "vaild" should
>>> be "valid".
>>>
>>> The error is something I'd like to fix or work-around.
>>>
>>> I have two big user accounts with several hundred mailbox files and some
>>> 300.000 mails from the last 15 years which I'd like
>>> to use as a testcase against which to run the mime4j implementation.
>>>
>>> Would you please supply me with some pointers where I get the necessary
>>> source code and how i could supply patches and
>>> testcases for the project?
>>>
>>> Also it would be good to know whether others would be interested in the
>>> Thunderbird Mailbox reading capability.
>>>
>>>
>>> Cheers
>>>   Wolfgang
>>>
>>> --
>>>
>>> BITPlan - smart solutions
>>> Wolfgang Fahl
>>> Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
>>> Tel. +49 2154 811-480, Fax +49 2154 811-481
>>> Web: http://www.bitplan.de
>>> BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, Geschäftsführer: Wolfgang Fahl
>>>
>>
>>
> 

Re: Thunderbird Mailbox support (patch included)

Posted by Wolfgang Fahl <wf...@bitplan.com>.
Hi Ioan Eugen,

please find attached a patch.

it uses the following fromline pattern:
static final String DEFAULT = "^From \\S+.*\\d{4}$";
so that it matches more lines.
1. From ieugen@apache.org Fri Sep 09 14:04:52 2011
2. From MAILER-DAEMON Wed Oct 05 21:54:09 2011
3. From - Wed Apr 02 06:51:08 2014

so looking for an "@" sign is not enforced any more.

The patch fixes a typo:
-    private Matcher fromLineMathcer;
+    private Matcher fromLineMatcher;

in many places of the source code.

It adds a reference to the original mbox File so that the error message:
+                 if (mbox!=null)
+                       path=mbox.getPath();
+            throw new IllegalArgumentException("File "+path+" does not
contain From_ lines that match the pattern
'"+MESSAGE_START.pattern()+"'! Maybe not be a valid Mbox.");

can be improved.

Who is going to check this patch and what needs to be done to get it
into the official repo?
I would also like to add more test cases and especially include some
dummy mboxes. And as mentioned I'd like to check the iterator against
all my Thunderbird mboxes to check
whether it will successfully parse them all. Also I am offering to write
a few "tutorial lines". Where would I have to put these?

Cheers
  Wolfgang

Am 22.07.14 22:23, schrieb Ioan Eugen Stan:
> Hello Wolfgang,
>
> I developed MailboxIterator. It's nice to see that it's helpful :)
>
> You get that error because MboxIterator does not know how to split the
> messages. Messages in an mbox file are separated via lines that start
> with '' From:'. They are called (by me at least) 'From lines' :) .
> One problem with the mbox format is that it's a bit 'free-form' in the
> sense that developers abused it and we have some variants [1].
>
> One thing that you could try is to supply a different From line
> regular expression to MboxIterator via regexpPattern argument. It will
> split messages based on this new value.
>
> [1] http://wiki2.dovecot.org/MailboxFormat/mbox
>
> Good luck and please post the your results.
>
> Regards,
>
> On Fri, Jul 18, 2014 at 12:53 PM, Wolfgang Fahl <wf...@bitplan.com> wrote:
>> Dear mime4j developers,
>>
>> for one of my projects I have been using mime4j successfully to import
>> e-mail into our CRM database for some two years know.
>> Currently I am trying to add a feature which would allow reading Mozilla
>> Thunderbird Mailbox content.
>> As of mime4j 0.8 there seems to be a MboxIterator which could do that.
>> Since I didn't find any publicly available source repository which I
>> could use to access the 0.8-Snapshop I have copied
>> the three source files:
>> * CharBufferWrapper.java
>> * FromLinePatterns.java
>> * MboxIterator.java
>>
>> into my source tree and I am using these together with the following
>> maven dependency:
>>
>> <!-- EMail handling -->
>>         <dependency>
>>             <groupId>org.apache.james</groupId>
>>             <artifactId>apache-mime4j-core</artifactId>
>>             <version>0.7.2</version>
>>         </dependency>
>>         <dependency>
>>             <groupId>org.apache.james</groupId>
>>             <artifactId>apache-mime4j-dom</artifactId>
>>             <version>0.7.2</version>
>>         </dependency>
>>
>> The iterator works somewhat o.k. on some of the Thunderbird mailbox
>> files and loops thru the mails in it correctly.
>> The mails can than not be directly parsed with mime4j - there is one
>> newline at the begining which spoils the show. After
>> working around this it's working as expected in some cases. In other
>> cases there is an error:
>>
>> java.lang.IllegalArgumentException: File does not contain From_ lines!
>> Maybe not be a vaild Mbox.
>>     at
>> org.apache.james.mime4j.mboxiterator.MboxIterator.initMboxIterator(MboxIterator.java:85)
>>     at
>> org.apache.james.mime4j.mboxiterator.MboxIterator.<init>(MboxIterator.java:75)
>>     at
>> org.apache.james.mime4j.mboxiterator.MboxIterator.<init>(MboxIterator.java:62)
>>     at
>> org.apache.james.mime4j.mboxiterator.MboxIterator$Builder.build(MboxIterator.java:241)
>>     at
>> com.bitplan.clientutils.ThunderbirdMailArchiveImpl.getMailById(ThunderbirdMailArchiveImpl.java:386)
>>     at
>> com.bitplan.clientutils.ThunderbirdMailArchiveImpl.getMailById(ThunderbirdMailArchiveImpl.java:261)
>>     at
>> com.bitplan.clientutils.rest.TestMailAccess.testMailById(TestMailAccess.java:77)
>>
>> By the way - there is a typo in the above error message "vaild" should
>> be "valid".
>>
>> The error is something I'd like to fix or work-around.
>>
>> I have two big user accounts with several hundred mailbox files and some
>> 300.000 mails from the last 15 years which I'd like
>> to use as a testcase against which to run the mime4j implementation.
>>
>> Would you please supply me with some pointers where I get the necessary
>> source code and how i could supply patches and
>> testcases for the project?
>>
>> Also it would be good to know whether others would be interested in the
>> Thunderbird Mailbox reading capability.
>>
>>
>> Cheers
>>   Wolfgang
>>
>> --
>>
>> BITPlan - smart solutions
>> Wolfgang Fahl
>> Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
>> Tel. +49 2154 811-480, Fax +49 2154 811-481
>> Web: http://www.bitplan.de
>> BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, Geschäftsführer: Wolfgang Fahl
>>
>
>

-- 

BITPlan - smart solutions
Wolfgang Fahl
Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
Tel. +49 2154 811-480, Fax +49 2154 811-481
Web: http://www.bitplan.de
BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, Geschäftsführer: Wolfgang Fahl 


Re: Thunderbird Mailbox support

Posted by Ioan Eugen Stan <st...@gmail.com>.
Hello Wolfgang,

I developed MailboxIterator. It's nice to see that it's helpful :)

You get that error because MboxIterator does not know how to split the
messages. Messages in an mbox file are separated via lines that start
with '' From:'. They are called (by me at least) 'From lines' :) .
One problem with the mbox format is that it's a bit 'free-form' in the
sense that developers abused it and we have some variants [1].

One thing that you could try is to supply a different From line
regular expression to MboxIterator via regexpPattern argument. It will
split messages based on this new value.

[1] http://wiki2.dovecot.org/MailboxFormat/mbox

Good luck and please post the your results.

Regards,

On Fri, Jul 18, 2014 at 12:53 PM, Wolfgang Fahl <wf...@bitplan.com> wrote:
> Dear mime4j developers,
>
> for one of my projects I have been using mime4j successfully to import
> e-mail into our CRM database for some two years know.
> Currently I am trying to add a feature which would allow reading Mozilla
> Thunderbird Mailbox content.
> As of mime4j 0.8 there seems to be a MboxIterator which could do that.
> Since I didn't find any publicly available source repository which I
> could use to access the 0.8-Snapshop I have copied
> the three source files:
> * CharBufferWrapper.java
> * FromLinePatterns.java
> * MboxIterator.java
>
> into my source tree and I am using these together with the following
> maven dependency:
>
> <!-- EMail handling -->
>         <dependency>
>             <groupId>org.apache.james</groupId>
>             <artifactId>apache-mime4j-core</artifactId>
>             <version>0.7.2</version>
>         </dependency>
>         <dependency>
>             <groupId>org.apache.james</groupId>
>             <artifactId>apache-mime4j-dom</artifactId>
>             <version>0.7.2</version>
>         </dependency>
>
> The iterator works somewhat o.k. on some of the Thunderbird mailbox
> files and loops thru the mails in it correctly.
> The mails can than not be directly parsed with mime4j - there is one
> newline at the begining which spoils the show. After
> working around this it's working as expected in some cases. In other
> cases there is an error:
>
> java.lang.IllegalArgumentException: File does not contain From_ lines!
> Maybe not be a vaild Mbox.
>     at
> org.apache.james.mime4j.mboxiterator.MboxIterator.initMboxIterator(MboxIterator.java:85)
>     at
> org.apache.james.mime4j.mboxiterator.MboxIterator.<init>(MboxIterator.java:75)
>     at
> org.apache.james.mime4j.mboxiterator.MboxIterator.<init>(MboxIterator.java:62)
>     at
> org.apache.james.mime4j.mboxiterator.MboxIterator$Builder.build(MboxIterator.java:241)
>     at
> com.bitplan.clientutils.ThunderbirdMailArchiveImpl.getMailById(ThunderbirdMailArchiveImpl.java:386)
>     at
> com.bitplan.clientutils.ThunderbirdMailArchiveImpl.getMailById(ThunderbirdMailArchiveImpl.java:261)
>     at
> com.bitplan.clientutils.rest.TestMailAccess.testMailById(TestMailAccess.java:77)
>
> By the way - there is a typo in the above error message "vaild" should
> be "valid".
>
> The error is something I'd like to fix or work-around.
>
> I have two big user accounts with several hundred mailbox files and some
> 300.000 mails from the last 15 years which I'd like
> to use as a testcase against which to run the mime4j implementation.
>
> Would you please supply me with some pointers where I get the necessary
> source code and how i could supply patches and
> testcases for the project?
>
> Also it would be good to know whether others would be interested in the
> Thunderbird Mailbox reading capability.
>
>
> Cheers
>   Wolfgang
>
> --
>
> BITPlan - smart solutions
> Wolfgang Fahl
> Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
> Tel. +49 2154 811-480, Fax +49 2154 811-481
> Web: http://www.bitplan.de
> BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, Geschäftsführer: Wolfgang Fahl
>



-- 
Ioan Eugen Stan
0720 898 747