You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2009/09/29 00:17:16 UTC

[jira] Created: (TIKA-295) Rough cut of mbox parser

Rough cut of mbox parser
------------------------

                 Key: TIKA-295
                 URL: https://issues.apache.org/jira/browse/TIKA-295
             Project: Tika
          Issue Type: New Feature
    Affects Versions: 0.4
            Reporter: Ken Krugler


Attached is a patch for a first-cut at a parser that handles mailbox (.mbox, application/mbox) files.

* The first email headers are used to fill in metadata. Subsequent email headers are tossed.
* Charset handling needs to be fixed up. It's unclear (not spec'd) whether emails individually use the charset as specified in their individual header, or the entire file should be re-encoded (and the encoding is sent in the response header, or auto-detected).
* Multi-part emails won't be handled properly, though it's unclear what should be done in that case (if anything).


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-295) Rough cut of mbox parser

Posted by "Thilo Goetz (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765485#action_12765485 ] 

Thilo Goetz commented on TIKA-295:
----------------------------------

I have used mstor in the past, which is under a BSD license and worked well for me.  It drags in a whole boatload of dependencies (and I didn't check all the licenses), but I suspect that just for MBOX parsing you won't need most of them.  It might be worth checking out mstor before writing our own mbox parser.


> Rough cut of mbox parser
> ------------------------
>
>                 Key: TIKA-295
>                 URL: https://issues.apache.org/jira/browse/TIKA-295
>             Project: Tika
>          Issue Type: New Feature
>    Affects Versions: 0.4
>            Reporter: Ken Krugler
>            Assignee: Jukka Zitting
>             Fix For: 0.5
>
>         Attachments: tika-295.patch
>
>
> Attached is a patch for a first-cut at a parser that handles mailbox (.mbox, application/mbox) files.
> * The first email headers are used to fill in metadata. Subsequent email headers are tossed.
> * Charset handling needs to be fixed up. It's unclear (not spec'd) whether emails individually use the charset as specified in their individual header, or the entire file should be re-encoded (and the encoding is sent in the response header, or auto-detected).
> * Multi-part emails won't be handled properly, though it's unclear what should be done in that case (if anything).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (TIKA-295) Rough cut of mbox parser

Posted by "Alex Baranov (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765410#action_12765410 ] 

Alex Baranov commented on TIKA-295:
-----------------------------------

I guess since the Tika is subproject of Lucene you should use the same format as for other Lucene projects:

http://wiki.apache.org/lucene-java/HowToContribute
http://wiki.apache.org/solr/HowToContribute
(in the end of the pages).

One question about the parser - do you still work on it? Any progress from the first draft?

> Rough cut of mbox parser
> ------------------------
>
>                 Key: TIKA-295
>                 URL: https://issues.apache.org/jira/browse/TIKA-295
>             Project: Tika
>          Issue Type: New Feature
>    Affects Versions: 0.4
>            Reporter: Ken Krugler
>            Assignee: Jukka Zitting
>             Fix For: 0.5
>
>         Attachments: tika-295.patch
>
>
> Attached is a patch for a first-cut at a parser that handles mailbox (.mbox, application/mbox) files.
> * The first email headers are used to fill in metadata. Subsequent email headers are tossed.
> * Charset handling needs to be fixed up. It's unclear (not spec'd) whether emails individually use the charset as specified in their individual header, or the entire file should be re-encoded (and the encoding is sent in the response header, or auto-detected).
> * Multi-part emails won't be handled properly, though it's unclear what should be done in that case (if anything).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.