You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jana, Kumar Raja" <kj...@ptc.com> on 2009/02/04 12:00:42 UTC

Microsoft Outlook (msg) files get parsed 50 times in TikaGUI

Hi,

 

I was feeding various document formats to the TikaGUI tool and found
that Microsoft Outlook (msg) files get parsed around 50 times!!!

 

Did anyone else face the same issue? Is there any setting that I might
have overlooked?

 

Thanks,

Kumar


RE: Microsoft Outlook (msg) files get parsed 50 times in TikaGUI

Posted by "Jana, Kumar Raja" <kj...@ptc.com>.
Sure...will send the bug details soon

-----Original Message-----
From: Jukka Zitting [mailto:jukka.zitting@gmail.com] 
Sent: Thursday, February 05, 2009 2:10 PM
To: tika-dev@lucene.apache.org
Subject: Re: Microsoft Outlook (msg) files get parsed 50 times in
TikaGUI

Hi,

On Thu, Feb 5, 2009 at 7:10 AM, Jana, Kumar Raja <kj...@ptc.com> wrote:
> I see 50 copies of the content in the extracted text output.

OK. This is probably some issue with the Outlook parser from POI or
with the way we use it in Tika.

> I have attached a sample Outlook (msg) file to this mail (which
happens
> to be a mail from you to the dev group). Hope it helps.

Unfortunately the mailing list filters seem to have stripped the
attachment. Can you file a bug report about this in Jira and attach
the example mail there?

BR,

Jukka Zitting

Re: Microsoft Outlook (msg) files get parsed 50 times in TikaGUI

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Thu, Feb 5, 2009 at 7:10 AM, Jana, Kumar Raja <kj...@ptc.com> wrote:
> I see 50 copies of the content in the extracted text output.

OK. This is probably some issue with the Outlook parser from POI or
with the way we use it in Tika.

> I have attached a sample Outlook (msg) file to this mail (which happens
> to be a mail from you to the dev group). Hope it helps.

Unfortunately the mailing list filters seem to have stripped the
attachment. Can you file a bug report about this in Jira and attach
the example mail there?

BR,

Jukka Zitting

RE: Microsoft Outlook (msg) files get parsed 50 times in TikaGUI

Posted by "Jana, Kumar Raja" <kj...@ptc.com>.
Hi Jukka,

Thanks for the quick reply.
I see 50 copies of the content in the extracted text output. I have
attached a sample Outlook (msg) file to this mail (which happens to be a
mail from you to the dev group). Hope it helps.

Thanks again,
Kumar

-----Original Message-----
From: Jukka Zitting [mailto:jukka.zitting@gmail.com] 
Sent: Thursday, February 05, 2009 5:22 AM
To: tika-dev@lucene.apache.org
Subject: Re: Microsoft Outlook (msg) files get parsed 50 times in
TikaGUI

Hi,

On Wed, Feb 4, 2009 at 12:00 PM, Jana, Kumar Raja <kj...@ptc.com> wrote:
> I was feeding various document formats to the TikaGUI tool and found
> that Microsoft Outlook (msg) files get parsed around 50 times!!!

Hmm, that's quite a lot... How does this "50 times" appear, do you get
50 copies of the message content in the extracted text output? Do you
have an example file that you could share with us?

BR,

Jukka Zitting

Re: Microsoft Outlook (msg) files get parsed 50 times in TikaGUI

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Wed, Feb 4, 2009 at 12:00 PM, Jana, Kumar Raja <kj...@ptc.com> wrote:
> I was feeding various document formats to the TikaGUI tool and found
> that Microsoft Outlook (msg) files get parsed around 50 times!!!

Hmm, that's quite a lot... How does this "50 times" appear, do you get
50 copies of the message content in the extracted text output? Do you
have an example file that you could share with us?

BR,

Jukka Zitting