You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Mark Butler (JIRA)" <ji...@apache.org> on 2011/05/27 12:37:47 UTC

[jira] [Created] (TIKA-667) Changes to RFC822Parser to support turning off strict parsing

Changes to RFC822Parser to support turning off strict parsing
-------------------------------------------------------------

                 Key: TIKA-667
                 URL: https://issues.apache.org/jira/browse/TIKA-667
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.0
            Reporter: Mark Butler
            Priority: Minor
             Fix For: 1.0
         Attachments: mailparser.diff

Currently in RFC822Parser if Apache-Mime4J fails while parsing any field, then parsing the whole document will fail. This causes problems on the Enron Corpus - see https://issues.apache.org/jira/browse/TIKA-657

RFC822Parser is configured from a MimeEntityConfig object. MimeEntityConfig contains an option for "strict parsing". Currently MailContentHandler only performs strict parsing, I.E. if a MimeException is encountered when processing any fields in MailContentHandler.field then processing the document fails. However, we may prefer not to have strict parsing I.E. continue even if processing one or more fields fails. This can be achieved by placing a try / catch block around the logic inside MailContentHandler.field(), and only rethrowing the error if strictParsing is enabled, otherwise we log the error.

I enclose a diff for RFC822Parser and MailContentHandler that does this. I have also made some other minor changes to MailContentHandler: there was some repeated code for handling To:, Cc: and Bcc: fields, so I have replaced that with a single private method, and rewritten stripOutFieldPrefix, to avoid manipulating the String using re-assignment. 


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-667) Changes to RFC822Parser to support turning off strict parsing

Posted by "Mark Butler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Butler updated TIKA-667:
-----------------------------

    Attachment: mailparser.diff

Diff for RFC822Parser.java and MailContentHandler.java

> Changes to RFC822Parser to support turning off strict parsing
> -------------------------------------------------------------
>
>                 Key: TIKA-667
>                 URL: https://issues.apache.org/jira/browse/TIKA-667
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Mark Butler
>            Priority: Minor
>             Fix For: 1.0
>
>         Attachments: mailparser.diff
>
>
> Currently in RFC822Parser if Apache-Mime4J fails while parsing any field, then parsing the whole document will fail. This causes problems on the Enron Corpus - see https://issues.apache.org/jira/browse/TIKA-657
> RFC822Parser is configured from a MimeEntityConfig object. MimeEntityConfig contains an option for "strict parsing". Currently MailContentHandler only performs strict parsing, I.E. if a MimeException is encountered when processing any fields in MailContentHandler.field then processing the document fails. However, we may prefer not to have strict parsing I.E. continue even if processing one or more fields fails. This can be achieved by placing a try / catch block around the logic inside MailContentHandler.field(), and only rethrowing the error if strictParsing is enabled, otherwise we log the error.
> I enclose a diff for RFC822Parser and MailContentHandler that does this. I have also made some other minor changes to MailContentHandler: there was some repeated code for handling To:, Cc: and Bcc: fields, so I have replaced that with a single private method, and rewritten stripOutFieldPrefix, to avoid manipulating the String using re-assignment. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-667) Changes to RFC822Parser to support turning off strict parsing

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-667.
--------------------------------

    Resolution: Fixed
      Assignee: Jukka Zitting

Thanks! Patch committed in revision 1160018.

Note that I removed the log message in case a problem with a header field is encountered. In such a situation I think it's fine to just silently ignore that field, just like Mime4J does when silently skipping parse issues when strict parsing is not enabled.

PS. I also changed some tab indentation to spaces.

> Changes to RFC822Parser to support turning off strict parsing
> -------------------------------------------------------------
>
>                 Key: TIKA-667
>                 URL: https://issues.apache.org/jira/browse/TIKA-667
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.0
>            Reporter: Mark Butler
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 1.0
>
>         Attachments: mailparser.diff
>
>
> Currently in RFC822Parser if Apache-Mime4J fails while parsing any field, then parsing the whole document will fail. This causes problems on the Enron Corpus - see https://issues.apache.org/jira/browse/TIKA-657
> RFC822Parser is configured from a MimeEntityConfig object. MimeEntityConfig contains an option for "strict parsing". Currently MailContentHandler only performs strict parsing, I.E. if a MimeException is encountered when processing any fields in MailContentHandler.field then processing the document fails. However, we may prefer not to have strict parsing I.E. continue even if processing one or more fields fails. This can be achieved by placing a try / catch block around the logic inside MailContentHandler.field(), and only rethrowing the error if strictParsing is enabled, otherwise we log the error.
> I enclose a diff for RFC822Parser and MailContentHandler that does this. I have also made some other minor changes to MailContentHandler: there was some repeated code for handling To:, Cc: and Bcc: fields, so I have replaced that with a single private method, and rewritten stripOutFieldPrefix, to avoid manipulating the String using re-assignment. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira