You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2009/09/29 00:17:16 UTC

[jira] Created: (TIKA-295) Rough cut of mbox parser

Rough cut of mbox parser
------------------------

                 Key: TIKA-295
                 URL: https://issues.apache.org/jira/browse/TIKA-295
             Project: Tika
          Issue Type: New Feature
    Affects Versions: 0.4
            Reporter: Ken Krugler


Attached is a patch for a first-cut at a parser that handles mailbox (.mbox, application/mbox) files.

* The first email headers are used to fill in metadata. Subsequent email headers are tossed.
* Charset handling needs to be fixed up. It's unclear (not spec'd) whether emails individually use the charset as specified in their individual header, or the entire file should be re-encoded (and the encoding is sent in the response header, or auto-detected).
* Multi-part emails won't be handled properly, though it's unclear what should be done in that case (if anything).


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-295) Rough cut of mbox parser

Posted by "Thilo Goetz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765485#action_12765485 ] 

Thilo Goetz commented on TIKA-295:
----------------------------------

I have used mstor in the past, which is under a BSD license and worked well for me.  It drags in a whole boatload of dependencies (and I didn't check all the licenses), but I suspect that just for MBOX parsing you won't need most of them.  It might be worth checking out mstor before writing our own mbox parser.


> Rough cut of mbox parser
> ------------------------
>
>                 Key: TIKA-295
>                 URL: https://issues.apache.org/jira/browse/TIKA-295
>             Project: Tika
>          Issue Type: New Feature
>    Affects Versions: 0.4
>            Reporter: Ken Krugler
>            Assignee: Jukka Zitting
>             Fix For: 0.5
>
>         Attachments: tika-295.patch
>
>
> Attached is a patch for a first-cut at a parser that handles mailbox (.mbox, application/mbox) files.
> * The first email headers are used to fill in metadata. Subsequent email headers are tossed.
> * Charset handling needs to be fixed up. It's unclear (not spec'd) whether emails individually use the charset as specified in their individual header, or the entire file should be re-encoded (and the encoding is sent in the response header, or auto-detected).
> * Multi-part emails won't be handled properly, though it's unclear what should be done in that case (if anything).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-295) Rough cut of mbox parser

Posted by "Alex Baranov (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765410#action_12765410 ] 

Alex Baranov commented on TIKA-295:
-----------------------------------

I guess since the Tika is subproject of Lucene you should use the same format as for other Lucene projects:

http://wiki.apache.org/lucene-java/HowToContribute
http://wiki.apache.org/solr/HowToContribute
(in the end of the pages).

One question about the parser - do you still work on it? Any progress from the first draft?

> Rough cut of mbox parser
> ------------------------
>
>                 Key: TIKA-295
>                 URL: https://issues.apache.org/jira/browse/TIKA-295
>             Project: Tika
>          Issue Type: New Feature
>    Affects Versions: 0.4
>            Reporter: Ken Krugler
>            Assignee: Jukka Zitting
>             Fix For: 0.5
>
>         Attachments: tika-295.patch
>
>
> Attached is a patch for a first-cut at a parser that handles mailbox (.mbox, application/mbox) files.
> * The first email headers are used to fill in metadata. Subsequent email headers are tossed.
> * Charset handling needs to be fixed up. It's unclear (not spec'd) whether emails individually use the charset as specified in their individual header, or the entire file should be re-encoded (and the encoding is sent in the response header, or auto-detected).
> * Multi-part emails won't be handled properly, though it's unclear what should be done in that case (if anything).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-295) Rough cut of mbox parser

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-295.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.5
         Assignee: Jukka Zitting

Nice work, thanks! I committed the patch (with tabs->spaces changes and an added license header for the test case) in revision 820967.

For further work on this I would suggest using the Mime4J library [1] from Apache James, as they've already dealt with many of the questions you raise above.

I'm resolving this as Fixed as the basic feature is now there thanks to the patch. Please file additional issues on any future improvements.

[1] http://james.apache.org/mime4j/

> Rough cut of mbox parser
> ------------------------
>
>                 Key: TIKA-295
>                 URL: https://issues.apache.org/jira/browse/TIKA-295
>             Project: Tika
>          Issue Type: New Feature
>    Affects Versions: 0.4
>            Reporter: Ken Krugler
>            Assignee: Jukka Zitting
>             Fix For: 0.5
>
>         Attachments: tika-295.patch
>
>
> Attached is a patch for a first-cut at a parser that handles mailbox (.mbox, application/mbox) files.
> * The first email headers are used to fill in metadata. Subsequent email headers are tossed.
> * Charset handling needs to be fixed up. It's unclear (not spec'd) whether emails individually use the charset as specified in their individual header, or the entire file should be re-encoded (and the encoding is sent in the response header, or auto-detected).
> * Multi-part emails won't be handled properly, though it's unclear what should be done in that case (if anything).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-295) Rough cut of mbox parser

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760414#action_12760414 ] 

Ken Krugler commented on TIKA-295:
----------------------------------

This patch also relies on using Mockito for unit tests, so there's a modified pom.xml that adds this as a dependency.

I'm hoping it's OK to add Mockito to the test scope.

> Rough cut of mbox parser
> ------------------------
>
>                 Key: TIKA-295
>                 URL: https://issues.apache.org/jira/browse/TIKA-295
>             Project: Tika
>          Issue Type: New Feature
>    Affects Versions: 0.4
>            Reporter: Ken Krugler
>
> Attached is a patch for a first-cut at a parser that handles mailbox (.mbox, application/mbox) files.
> * The first email headers are used to fill in metadata. Subsequent email headers are tossed.
> * Charset handling needs to be fixed up. It's unclear (not spec'd) whether emails individually use the charset as specified in their individual header, or the entire file should be re-encoded (and the encoding is sent in the response header, or auto-detected).
> * Multi-part emails won't be handled properly, though it's unclear what should be done in that case (if anything).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-295) Rough cut of mbox parser

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765579#action_12765579 ] 

Ken Krugler commented on TIKA-295:
----------------------------------

Hi Alex - thanks for looking into the formatting issues. Maybe I should open a Jira issue to create an Eclipse formatter file :)

Re additional work done on this parser - nothing more yet, it's working for what I currently need, sorry.

> Rough cut of mbox parser
> ------------------------
>
>                 Key: TIKA-295
>                 URL: https://issues.apache.org/jira/browse/TIKA-295
>             Project: Tika
>          Issue Type: New Feature
>    Affects Versions: 0.4
>            Reporter: Ken Krugler
>            Assignee: Jukka Zitting
>             Fix For: 0.5
>
>         Attachments: tika-295.patch
>
>
> Attached is a patch for a first-cut at a parser that handles mailbox (.mbox, application/mbox) files.
> * The first email headers are used to fill in metadata. Subsequent email headers are tossed.
> * Charset handling needs to be fixed up. It's unclear (not spec'd) whether emails individually use the charset as specified in their individual header, or the entire file should be re-encoded (and the encoding is sent in the response header, or auto-detected).
> * Multi-part emails won't be handled properly, though it's unclear what should be done in that case (if anything).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-295) Rough cut of mbox parser

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler updated TIKA-295:
-----------------------------

    Attachment: tika-295.patch

> Rough cut of mbox parser
> ------------------------
>
>                 Key: TIKA-295
>                 URL: https://issues.apache.org/jira/browse/TIKA-295
>             Project: Tika
>          Issue Type: New Feature
>    Affects Versions: 0.4
>            Reporter: Ken Krugler
>         Attachments: tika-295.patch
>
>
> Attached is a patch for a first-cut at a parser that handles mailbox (.mbox, application/mbox) files.
> * The first email headers are used to fill in metadata. Subsequent email headers are tossed.
> * Charset handling needs to be fixed up. It's unclear (not spec'd) whether emails individually use the charset as specified in their individual header, or the entire file should be re-encoded (and the encoding is sent in the response header, or auto-detected).
> * Multi-part emails won't be handled properly, though it's unclear what should be done in that case (if anything).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-295) Rough cut of mbox parser

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765578#action_12765578 ] 

Ken Krugler commented on TIKA-295:
----------------------------------

Hi Thilo - I also looked at mstor, but trying to figure out the license issues and JavaMail dependencies gave me a headache.

And the mbox format itself is trivial - the hard part is parsing properly the mail messages themselves, which is where (I think) mime4j would be a good option.

But if there aren't any license issues, and it's easy to  separate mstor, then I agree that's a good candidate.

> Rough cut of mbox parser
> ------------------------
>
>                 Key: TIKA-295
>                 URL: https://issues.apache.org/jira/browse/TIKA-295
>             Project: Tika
>          Issue Type: New Feature
>    Affects Versions: 0.4
>            Reporter: Ken Krugler
>            Assignee: Jukka Zitting
>             Fix For: 0.5
>
>         Attachments: tika-295.patch
>
>
> Attached is a patch for a first-cut at a parser that handles mailbox (.mbox, application/mbox) files.
> * The first email headers are used to fill in metadata. Subsequent email headers are tossed.
> * Charset handling needs to be fixed up. It's unclear (not spec'd) whether emails individually use the charset as specified in their individual header, or the entire file should be re-encoded (and the encoding is sent in the response header, or auto-detected).
> * Multi-part emails won't be handled properly, though it's unclear what should be done in that case (if anything).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (TIKA-295) Rough cut of mbox parser

Posted by "Alex Baranov (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765410#action_12765410 ] 

Alex Baranov edited comment on TIKA-295 at 10/13/09 11:12 PM:
--------------------------------------------------------------

I guess since the Tika is subproject of Lucene you should use the same format as for other Lucene projects:

http://wiki.apache.org/lucene-java/HowToContribute
http://wiki.apache.org/solr/HowToContribute
(in the end of the pages).

[Edited: well it turned out that they use another coding styles on Tika project. At least the indent is 4 spaces instead of 2...]

One question about the parser - do you still work on it? Any progress from the first draft?

      was (Author: alexb):
    I guess since the Tika is subproject of Lucene you should use the same format as for other Lucene projects:

http://wiki.apache.org/lucene-java/HowToContribute
http://wiki.apache.org/solr/HowToContribute
(in the end of the pages).

One question about the parser - do you still work on it? Any progress from the first draft?
  
> Rough cut of mbox parser
> ------------------------
>
>                 Key: TIKA-295
>                 URL: https://issues.apache.org/jira/browse/TIKA-295
>             Project: Tika
>          Issue Type: New Feature
>    Affects Versions: 0.4
>            Reporter: Ken Krugler
>            Assignee: Jukka Zitting
>             Fix For: 0.5
>
>         Attachments: tika-295.patch
>
>
> Attached is a patch for a first-cut at a parser that handles mailbox (.mbox, application/mbox) files.
> * The first email headers are used to fill in metadata. Subsequent email headers are tossed.
> * Charset handling needs to be fixed up. It's unclear (not spec'd) whether emails individually use the charset as specified in their individual header, or the entire file should be re-encoded (and the encoding is sent in the response header, or auto-detected).
> * Multi-part emails won't be handled properly, though it's unclear what should be done in that case (if anything).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-295) Rough cut of mbox parser

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764475#action_12764475 ] 

Ken Krugler commented on TIKA-295:
----------------------------------

Hi Jukka,

Is there an Eclipse formatter file that defines the Tika project's target format?

Thanks,

-- Ken

> Rough cut of mbox parser
> ------------------------
>
>                 Key: TIKA-295
>                 URL: https://issues.apache.org/jira/browse/TIKA-295
>             Project: Tika
>          Issue Type: New Feature
>    Affects Versions: 0.4
>            Reporter: Ken Krugler
>            Assignee: Jukka Zitting
>             Fix For: 0.5
>
>         Attachments: tika-295.patch
>
>
> Attached is a patch for a first-cut at a parser that handles mailbox (.mbox, application/mbox) files.
> * The first email headers are used to fill in metadata. Subsequent email headers are tossed.
> * Charset handling needs to be fixed up. It's unclear (not spec'd) whether emails individually use the charset as specified in their individual header, or the entire file should be re-encoded (and the encoding is sent in the response header, or auto-detected).
> * Multi-part emails won't be handled properly, though it's unclear what should be done in that case (if anything).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.