You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Grant Ingersoll <gs...@apache.org> on 2007/07/02 02:30:41 UTC

Exchange/PST/Mail parsing

Anyone have any recommendations on a decent, open (doesn't have to be  
Apache license, but would prefer non-GPL if possible), extractor for  
MS Exchange and/or PST files?  The Zoe link on the FAQ [1] seems dead.

For mbox, I think mstor will suffice for me and I think tropo (from  
the FAQ should work for IMAP).  Does anyone have experience with  
http://aperture.sourceforge.net/

[1] http://wiki.apache.org/lucene-java/LuceneFAQ#head- 
bcba2effabe224d5fb8c1761e4da1fedceb9800e

Cheers,
Grant



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Exchange/PST/Mail parsing

Posted by jm <jm...@gmail.com>.
We had to develop vb code to convert pst to eml files.

I am using mbox, works fine for me. And I am also using aperture, but only
for extracting text from non-mail files (like office etc), works fine too.

On 7/2/07, Grant Ingersoll <gs...@apache.org> wrote:
>
> Anyone have any recommendations on a decent, open (doesn't have to be
> Apache license, but would prefer non-GPL if possible), extractor for
> MS Exchange and/or PST files?  The Zoe link on the FAQ [1] seems dead.
>
> For mbox, I think mstor will suffice for me and I think tropo (from
> the FAQ should work for IMAP).  Does anyone have experience with
> http://aperture.sourceforge.net/
>
> [1] http://wiki.apache.org/lucene-java/LuceneFAQ#head-
> bcba2effabe224d5fb8c1761e4da1fedceb9800e
>
> Cheers,
> Grant
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Exchange/PST/Mail parsing

Posted by Christiaan Fluit <ch...@aduna-software.com>.
Hello Grant (cc-ing aperture-devel),

I am one of the Aperture admins, I can tell you a bit more about 
Aperture's mail facilities.

Short intro: Aperture is a framework for crawling and full-text and 
metadata extraction of a growing number of sources and file formats. We 
try to select the best-of-breed of the large number of open source 
libraries that tackle a specific source or format (e.g. PDFBox, Poi, 
JavaMail) and write some glue code around it so that they can be invoked 
in a uniform way. It's currently used in a number of desktop and 
enterprise search applications, both research and production systems.

At the moment we support a number of mail systems.

We can crawl IMAP mail boxes through JavaMail. In general it seems to 
work well, problems are usually caused by IMAP servers not conforming to 
the IMAP specs.

Some people have used the ImapCrawler to crawl MS Exchange as well. Some 
succeeded, some didn't. I don't really know whether the fault is in 
Aperture's code or in the Exchange configuration but I would be happy to 
take a look at it when someone runs into problems.

Outlook can also be crawled by connecting to a running Outlook process 
through jacob.dll. Others on aperture-devel can tell you more about its 
current status. Besides this crawler, I would also be very interested in 
having a crawler that directly processes .pst files, as to stay clear 
from communicating with other processes outside your own control.

People have been working on crawling Thunderbird mailboxes but I don't 
know what the current status is.

Ultimately, we try to support any major mail system. In practice, effort 
is usually dependent on knowledge and experience as well as customer demand.

We are happy to help you out with trying to get Aperture working in your 
domain and looking into the problems that you may encounter.


Kind regards,

Chris
--

Grant Ingersoll wrote:
> Anyone have any recommendations on a decent, open (doesn't have to be 
> Apache license, but would prefer non-GPL if possible), extractor for MS 
> Exchange and/or PST files?  The Zoe link on the FAQ [1] seems dead.
> 
> For mbox, I think mstor will suffice for me and I think tropo (from the 
> FAQ should work for IMAP).  Does anyone have experience with 
> http://aperture.sourceforge.net/
> 
> [1] 
> http://wiki.apache.org/lucene-java/LuceneFAQ#head-bcba2effabe224d5fb8c1761e4da1fedceb9800e 
> 
> 
> Cheers,
> Grant

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Exchange/PST/Mail parsing

Posted by Nick Burch <ni...@torchbox.com>.
On Sun, 1 Jul 2007, Grant Ingersoll wrote:
> Anyone have any recommendations on a decent, open (doesn't have to be 
> Apache license, but would prefer non-GPL if possible), extractor for MS 
> Exchange and/or PST files?

There has been an offer to contribute a PST parser to Apache POI. We're 
hoping that Travis will have something to go into the POI scratchpad quite 
soon, but we understand he's currently still working on the first version.

I can only suggest you keep an eye on the poi dev list for when the code 
comes through

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org