You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "kumar raja jana (JIRA)" <ji...@apache.org> on 2009/02/05 10:13:59 UTC
[jira] Created: (TIKA-197) Microsoft Outlook (msg) files get parsed
multiple times
Microsoft Outlook (msg) files get parsed multiple times
-------------------------------------------------------
Key: TIKA-197
URL: https://issues.apache.org/jira/browse/TIKA-197
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 0.3
Reporter: kumar raja jana
Fix For: 0.3
Microsoft Outlook (msg) files get parsed around 50 times using TikaGUI
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (TIKA-197) Microsoft Outlook (msg) files get parsed
multiple times
Posted by "kumar raja jana (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
kumar raja jana updated TIKA-197:
---------------------------------
Attachment: MIME.msg
sample document for testing
> Microsoft Outlook (msg) files get parsed multiple times
> -------------------------------------------------------
>
> Key: TIKA-197
> URL: https://issues.apache.org/jira/browse/TIKA-197
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.3
> Reporter: kumar raja jana
> Fix For: 0.3
>
> Attachments: MIME.msg
>
>
> Microsoft Outlook (msg) files get parsed around 50 times using TikaGUI
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (TIKA-197) Microsoft Outlook (msg)
files get parsed multiple times
Posted by "kumar raja jana (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12671837#action_12671837 ]
kumarraja.j edited comment on TIKA-197 at 2/9/09 4:58 AM:
--------------------------------------------------------------
Thanks Jukka for fixing this :) (sorry)
was (Author: kumarraja.j):
Thanks Jutta for fixing this :)
> Microsoft Outlook (msg) files get parsed multiple times
> -------------------------------------------------------
>
> Key: TIKA-197
> URL: https://issues.apache.org/jira/browse/TIKA-197
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.3
> Reporter: kumar raja jana
> Assignee: Jukka Zitting
> Fix For: 0.3
>
> Attachments: MIME.msg
>
>
> Microsoft Outlook (msg) files get parsed around 50 times using TikaGUI
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (TIKA-197) Microsoft Outlook (msg) files get
parsed multiple times
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved TIKA-197.
--------------------------------
Resolution: Fixed
Assignee: Jukka Zitting
Thanks for reporting this!
This issue was caused by the OfficeParser class using a special pattern for detecting Outlook-specific entries inside Microsoft's OLE2 container format. Outlook-specific parsing was triggered whenever an internal entry matching the pattern was detected. Our previous test .msg file only contained one such entry so we never saw this issue, but apparently it's possible and even likely for Outlook files to contain multiple such entries.
I fixed the issue in revision 742187 simply by introducing a special marker flag that prevents the Outlook extractor from being fired more than once per document being parsed. It's a bit ugly, but it works. :-)
> Microsoft Outlook (msg) files get parsed multiple times
> -------------------------------------------------------
>
> Key: TIKA-197
> URL: https://issues.apache.org/jira/browse/TIKA-197
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.3
> Reporter: kumar raja jana
> Assignee: Jukka Zitting
> Fix For: 0.3
>
> Attachments: MIME.msg
>
>
> Microsoft Outlook (msg) files get parsed around 50 times using TikaGUI
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (TIKA-197) Microsoft Outlook (msg) files get
parsed multiple times
Posted by "kumar raja jana (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12671837#action_12671837 ]
kumar raja jana commented on TIKA-197:
--------------------------------------
Thanks Jutta for fixing this :)
> Microsoft Outlook (msg) files get parsed multiple times
> -------------------------------------------------------
>
> Key: TIKA-197
> URL: https://issues.apache.org/jira/browse/TIKA-197
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.3
> Reporter: kumar raja jana
> Assignee: Jukka Zitting
> Fix For: 0.3
>
> Attachments: MIME.msg
>
>
> Microsoft Outlook (msg) files get parsed around 50 times using TikaGUI
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.