You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "kumar raja jana (JIRA)" <ji...@apache.org> on 2009/02/05 10:13:59 UTC

[jira] Created: (TIKA-197) Microsoft Outlook (msg) files get parsed multiple times

Microsoft Outlook (msg) files get parsed multiple times
-------------------------------------------------------

                 Key: TIKA-197
                 URL: https://issues.apache.org/jira/browse/TIKA-197
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.3
            Reporter: kumar raja jana
             Fix For: 0.3


Microsoft Outlook (msg) files get parsed around 50 times using TikaGUI

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-197) Microsoft Outlook (msg) files get parsed multiple times

Posted by "kumar raja jana (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

kumar raja jana updated TIKA-197:
---------------------------------

    Attachment: MIME.msg

sample document for testing

> Microsoft Outlook (msg) files get parsed multiple times
> -------------------------------------------------------
>
>                 Key: TIKA-197
>                 URL: https://issues.apache.org/jira/browse/TIKA-197
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3
>            Reporter: kumar raja jana
>             Fix For: 0.3
>
>         Attachments: MIME.msg
>
>
> Microsoft Outlook (msg) files get parsed around 50 times using TikaGUI

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (TIKA-197) Microsoft Outlook (msg) files get parsed multiple times

Posted by "kumar raja jana (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12671837#action_12671837 ] 

kumarraja.j edited comment on TIKA-197 at 2/9/09 4:58 AM:
--------------------------------------------------------------

Thanks Jukka for fixing this :) (sorry)

      was (Author: kumarraja.j):
    Thanks Jutta for fixing this :)
  
> Microsoft Outlook (msg) files get parsed multiple times
> -------------------------------------------------------
>
>                 Key: TIKA-197
>                 URL: https://issues.apache.org/jira/browse/TIKA-197
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3
>            Reporter: kumar raja jana
>            Assignee: Jukka Zitting
>             Fix For: 0.3
>
>         Attachments: MIME.msg
>
>
> Microsoft Outlook (msg) files get parsed around 50 times using TikaGUI

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (TIKA-197) Microsoft Outlook (msg) files get parsed multiple times

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-197.
--------------------------------

    Resolution: Fixed
      Assignee: Jukka Zitting

Thanks for reporting this!

This issue was caused by the OfficeParser class using a special pattern for detecting Outlook-specific entries inside Microsoft's OLE2 container format. Outlook-specific parsing was triggered whenever an internal entry matching the pattern was detected. Our previous test .msg file only contained one such entry so we never saw this issue, but apparently it's possible and even likely for Outlook files to contain multiple such entries.

I fixed the issue in revision 742187 simply by introducing a special marker flag that prevents the Outlook extractor from being fired more than once per document being parsed. It's a bit ugly, but it works. :-)

> Microsoft Outlook (msg) files get parsed multiple times
> -------------------------------------------------------
>
>                 Key: TIKA-197
>                 URL: https://issues.apache.org/jira/browse/TIKA-197
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3
>            Reporter: kumar raja jana
>            Assignee: Jukka Zitting
>             Fix For: 0.3
>
>         Attachments: MIME.msg
>
>
> Microsoft Outlook (msg) files get parsed around 50 times using TikaGUI

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-197) Microsoft Outlook (msg) files get parsed multiple times

Posted by "kumar raja jana (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12671837#action_12671837 ] 

kumar raja jana commented on TIKA-197:
--------------------------------------

Thanks Jutta for fixing this :)

> Microsoft Outlook (msg) files get parsed multiple times
> -------------------------------------------------------
>
>                 Key: TIKA-197
>                 URL: https://issues.apache.org/jira/browse/TIKA-197
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3
>            Reporter: kumar raja jana
>            Assignee: Jukka Zitting
>             Fix For: 0.3
>
>         Attachments: MIME.msg
>
>
> Microsoft Outlook (msg) files get parsed around 50 times using TikaGUI

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.