You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jens Wilmer (JIRA)" <ji...@apache.org> on 2011/04/15 23:40:05 UTC

[jira] [Created] (TIKA-640) RFC822Parser should configure Mime4j not to fail reading mails containing more than 1000 chars in one headers text (even if folded)

RFC822Parser should configure Mime4j not to fail reading mails containing more than 1000 chars in one headers text (even if folded)
-----------------------------------------------------------------------------------------------------------------------------------

                 Key: TIKA-640
                 URL: https://issues.apache.org/jira/browse/TIKA-640
             Project: Tika
          Issue Type: Wish
          Components: parser
    Affects Versions: 0.9
         Environment: All
            Reporter: Jens Wilmer


Standard configuration of Mime4j accepts only 1000 characters per line and 1000 charackters per header. The streaming approach of tika should not need theese limitations, an exception is being thrown and none of the data read is available.

Solution:
Replace all occurences of:

Parser parser = new RFC822Parser();

by:

MimeEntityConfig config = new MimeEntityConfig();
config.setMaxLineLen(-1);
config.setMaxContentLen(-1);
Parser parser = new RFC822Parser(config);


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-640) RFC822Parser should configure Mime4j not to fail reading mails containing more than 1000 chars in one headers text (even if folded)

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13109066#comment-13109066 ] 

Jukka Zitting commented on TIKA-640:
------------------------------------

Note that along with TIKA-716 and Mime4J version 0.7 the configuration object is now called MimeConfig instead of MimeEntityConfig.

> RFC822Parser should configure Mime4j not to fail reading mails containing more than 1000 chars in one headers text (even if folded)
> -----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-640
>                 URL: https://issues.apache.org/jira/browse/TIKA-640
>             Project: Tika
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 0.9
>         Environment: All
>            Reporter: Jens Wilmer
>            Assignee: Jukka Zitting
>              Labels: mail, rfc822parser
>             Fix For: 0.10
>
>         Attachments: TIKA-640.patch
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> Standard configuration of Mime4j accepts only 1000 characters per line and 1000 charackters per header. The streaming approach of tika should not need theese limitations, an exception is being thrown and none of the data read is available.
> Solution:
> Replace all occurences of:
> Parser parser = new RFC822Parser();
> by:
> MimeEntityConfig config = new MimeEntityConfig();
> config.setMaxLineLen(-1);
> config.setMaxContentLen(-1);
> Parser parser = new RFC822Parser(config);

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (TIKA-640) RFC822Parser should configure Mime4j not to fail reading mails containing more than 1000 chars in one headers text (even if folded)

Posted by "Jens Wilmer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13020946#comment-13020946 ] 

Jens Wilmer commented on TIKA-640:
----------------------------------

How long is it taking up too much heap space and what is too much heap space and what is the problem of taking up too much heap space? Is a probable "OutOfMemoryError" Exception the problem? I would rather not be able to read any information and catch an OutOfMemoryError if i have to process an email that has a larger headers than i can handle than not being able to read any information and catch an IOException caused by an MaxLineLimitException if i have to handle a message that contains a header bigger than any arbitrarily chosen size that must be smaller than the possible size to take any effect. After one of theese exceptions has been thrown and caught, there is no real difference in the programs flow and despite any limit whatsoever you still have to handle both Exceptions if there is a limit because "too much heap space" heavily depends on how much heap space is available which in turn depends on many parameters and is changing over time.


> RFC822Parser should configure Mime4j not to fail reading mails containing more than 1000 chars in one headers text (even if folded)
> -----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-640
>                 URL: https://issues.apache.org/jira/browse/TIKA-640
>             Project: Tika
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 0.9
>         Environment: All
>            Reporter: Jens Wilmer
>              Labels: mail, rfc822parser
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> Standard configuration of Mime4j accepts only 1000 characters per line and 1000 charackters per header. The streaming approach of tika should not need theese limitations, an exception is being thrown and none of the data read is available.
> Solution:
> Replace all occurences of:
> Parser parser = new RFC822Parser();
> by:
> MimeEntityConfig config = new MimeEntityConfig();
> config.setMaxLineLen(-1);
> config.setMaxContentLen(-1);
> Parser parser = new RFC822Parser(config);

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-640) RFC822Parser should configure Mime4j not to fail reading mails containing more than 1000 chars in one headers text (even if folded)

Posted by "Benjamin Douglas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022793#comment-13022793 ] 

Benjamin Douglas commented on TIKA-640:
---------------------------------------

I believe the real difference between these two scenarios is that the IOException definitely comes from the current document being read while an OutOfMemoryError might be caused by any number of other things, many of which are not recoverable. Catching unchecked OutOfMemoryErrors, skipping the file, and moving on seems like something we don't want to expect our normal program flow to look like. Catching checked IOExceptions, skipping the file, and moving on seems more reasonable.

That said, I will concede that having an email with a header so large that it noticeably imposes on someone's heap should be quite rare. And, because of the way that Tika handles metadata, it all needs to end up in memory anyway -- there is no streaming of metadata, it is represented as a bag of Strings. The hard-coded limit I think has some value as a stopgap in cases where a bogus file gets mis-detected, but that is a general problem not limited to RFC822 messages. It certainly could be decided to err on the side of letting all documents through, even bogus ones with potential memory problems, instead of being too conservative and not letting some valid documents through.

> RFC822Parser should configure Mime4j not to fail reading mails containing more than 1000 chars in one headers text (even if folded)
> -----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-640
>                 URL: https://issues.apache.org/jira/browse/TIKA-640
>             Project: Tika
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 0.9
>         Environment: All
>            Reporter: Jens Wilmer
>              Labels: mail, rfc822parser
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> Standard configuration of Mime4j accepts only 1000 characters per line and 1000 charackters per header. The streaming approach of tika should not need theese limitations, an exception is being thrown and none of the data read is available.
> Solution:
> Replace all occurences of:
> Parser parser = new RFC822Parser();
> by:
> MimeEntityConfig config = new MimeEntityConfig();
> config.setMaxLineLen(-1);
> config.setMaxContentLen(-1);
> Parser parser = new RFC822Parser(config);

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-640) RFC822Parser should configure Mime4j not to fail reading mails containing more than 1000 chars in one headers text (even if folded)

Posted by "Benjamin Douglas (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Douglas updated TIKA-640:
----------------------------------

    Attachment: TIKA-640.patch

I'll concede that given the fact that the Metadata structure holds entire fields in strings, that emails should behave no differently. This patch sets the max field length at unlimited, which should not be a problem in all but the most unusual of circumstances. Setting MaxContentLength to unlimited, as suggested by the jira author, is not necessary as that is the default.

> RFC822Parser should configure Mime4j not to fail reading mails containing more than 1000 chars in one headers text (even if folded)
> -----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-640
>                 URL: https://issues.apache.org/jira/browse/TIKA-640
>             Project: Tika
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 0.9
>         Environment: All
>            Reporter: Jens Wilmer
>              Labels: mail, rfc822parser
>         Attachments: TIKA-640.patch
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> Standard configuration of Mime4j accepts only 1000 characters per line and 1000 charackters per header. The streaming approach of tika should not need theese limitations, an exception is being thrown and none of the data read is available.
> Solution:
> Replace all occurences of:
> Parser parser = new RFC822Parser();
> by:
> MimeEntityConfig config = new MimeEntityConfig();
> config.setMaxLineLen(-1);
> config.setMaxContentLen(-1);
> Parser parser = new RFC822Parser(config);

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-640) RFC822Parser should configure Mime4j not to fail reading mails containing more than 1000 chars in one headers text (even if folded)

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-640.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0
         Assignee: Jukka Zitting

> RFC822Parser should configure Mime4j not to fail reading mails containing more than 1000 chars in one headers text (even if folded)
> -----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-640
>                 URL: https://issues.apache.org/jira/browse/TIKA-640
>             Project: Tika
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 0.9
>         Environment: All
>            Reporter: Jens Wilmer
>            Assignee: Jukka Zitting
>              Labels: mail, rfc822parser
>             Fix For: 1.0
>
>         Attachments: TIKA-640.patch
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> Standard configuration of Mime4j accepts only 1000 characters per line and 1000 charackters per header. The streaming approach of tika should not need theese limitations, an exception is being thrown and none of the data read is available.
> Solution:
> Replace all occurences of:
> Parser parser = new RFC822Parser();
> by:
> MimeEntityConfig config = new MimeEntityConfig();
> config.setMaxLineLen(-1);
> config.setMaxContentLen(-1);
> Parser parser = new RFC822Parser(config);

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-640) RFC822Parser should configure Mime4j not to fail reading mails containing more than 1000 chars in one headers text (even if folded)

Posted by "Benjamin Douglas (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13020622#comment-13020622 ] 

Benjamin Douglas commented on TIKA-640:
---------------------------------------

Per TIKA-461, a patch was recently made to trunk to increase the limit to 10,000 characters as 1,000 was too restrictive. The problem with setting it to unlimited (-1 as you show in the example) is that, because of the nature of mime4j, all of header data is read into a single String. The RFC does not put any limit on how many characters can go into a header, so this could potentially be very large. As far as I understand the goals of the Tika library, it should allow arbitrarily large files and thus uses a streaming model. Since headers cannot be streamed with mime4j, some artificial limit must be set to prevent taking up too much heap space.

> RFC822Parser should configure Mime4j not to fail reading mails containing more than 1000 chars in one headers text (even if folded)
> -----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-640
>                 URL: https://issues.apache.org/jira/browse/TIKA-640
>             Project: Tika
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 0.9
>         Environment: All
>            Reporter: Jens Wilmer
>              Labels: mail, rfc822parser
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> Standard configuration of Mime4j accepts only 1000 characters per line and 1000 charackters per header. The streaming approach of tika should not need theese limitations, an exception is being thrown and none of the data read is available.
> Solution:
> Replace all occurences of:
> Parser parser = new RFC822Parser();
> by:
> MimeEntityConfig config = new MimeEntityConfig();
> config.setMaxLineLen(-1);
> config.setMaxContentLen(-1);
> Parser parser = new RFC822Parser(config);

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-640) RFC822Parser should configure Mime4j not to fail reading mails containing more than 1000 chars in one headers text (even if folded)

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034933#comment-13034933 ] 

Jukka Zitting commented on TIKA-640:
------------------------------------

This seems like a rather rare use case, so I'd rather make this configurable instead of changing the default behavior. In the default configuration it's far better to avoid a possible OOM at the cost of not being able to parse some very rare or malformed emails.

In revision 1104444 I added support for passing a custom MimeEntityConfig object through the parsing config. This way you can achieve your use case with the following code snippet before you pass the ParseContext object to the parser.

    MimeEntityConfig config = new MimeEntityConfig();
    config.setMaxLineLen(-1);
    context.set(MimeEntityConfig.class, config);

> RFC822Parser should configure Mime4j not to fail reading mails containing more than 1000 chars in one headers text (even if folded)
> -----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-640
>                 URL: https://issues.apache.org/jira/browse/TIKA-640
>             Project: Tika
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 0.9
>         Environment: All
>            Reporter: Jens Wilmer
>              Labels: mail, rfc822parser
>             Fix For: 1.0
>
>         Attachments: TIKA-640.patch
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> Standard configuration of Mime4j accepts only 1000 characters per line and 1000 charackters per header. The streaming approach of tika should not need theese limitations, an exception is being thrown and none of the data read is available.
> Solution:
> Replace all occurences of:
> Parser parser = new RFC822Parser();
> by:
> MimeEntityConfig config = new MimeEntityConfig();
> config.setMaxLineLen(-1);
> config.setMaxContentLen(-1);
> Parser parser = new RFC822Parser(config);

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira