You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Johnny X <jo...@gmail.com> on 2008/11/08 01:53:40 UTC

Large Corpus XML Conversion?

I've been asked to look at the Enron e-mail corpus
(http://www.cs.cmu.edu/~enron/) and I've decided to use Solr as a means to
analyse it. 

So I have a few questions...

First off, how can I convert the flat file text below:


Message-ID: <18...@thyme>
Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)
From: phillip.allen@enron.com
To: tim.belden@enron.com
Subject: 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Tim Belden <Tim Belden/Enron@EnronXGate>
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Sent Mail
X-Origin: Allen-P
X-FileName: pallen (Non-Privileged).pst

Here is our forecast




to XML to input into Solr.

Secondly, I'm looking into searching for particular things in the e-mails
and sorting them into groups as a result. Say, characteristics of the
e-mails that suggest they concerns confidential company information for
instance.

How easy is it to make custom searches (based on semantics, word distances
etc) and use the results as an output?


I'm a complete newbie so any help is appreciated! I hope I've come to the
right place.

Thanks. :-)
-- 
View this message in context: http://www.nabble.com/Large-Corpus-XML-Conversion--tp20389947p20389947.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Large Corpus XML Conversion?

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Hi,

For message parsing you'll either have to write a custom parser or see if you can use JavaMail for that (or some other library if you are not working with Java).

As for the second part, that's not directly related to Solr.  Extracting meaning out of text would be something that your application needs to do.  Once it does that it could index that with Solr so it can be searched later on.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: Johnny X <jo...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Friday, November 7, 2008 7:53:40 PM
> Subject: Large Corpus XML Conversion?
> 
> 
> I've been asked to look at the Enron e-mail corpus
> (http://www.cs.cmu.edu/~enron/) and I've decided to use Solr as a means to
> analyse it. 
> 
> So I have a few questions...
> 
> First off, how can I convert the flat file text below:
> 
> 
> Message-ID: <18...@thyme>
> Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)
> From: phillip.allen@enron.com
> To: tim.belden@enron.com
> Subject: 
> Mime-Version: 1.0
> Content-Type: text/plain; charset=us-ascii
> Content-Transfer-Encoding: 7bit
> X-From: Phillip K Allen
> X-To: Tim Belden 
> X-cc: 
> X-bcc: 
> X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Sent Mail
> X-Origin: Allen-P
> X-FileName: pallen (Non-Privileged).pst
> 
> Here is our forecast
> 
> 
> 
> 
> to XML to input into Solr.
> 
> Secondly, I'm looking into searching for particular things in the e-mails
> and sorting them into groups as a result. Say, characteristics of the
> e-mails that suggest they concerns confidential company information for
> instance.
> 
> How easy is it to make custom searches (based on semantics, word distances
> etc) and use the results as an output?
> 
> 
> I'm a complete newbie so any help is appreciated! I hope I've come to the
> right place.
> 
> Thanks. :-)
> -- 
> View this message in context: 
> http://www.nabble.com/Large-Corpus-XML-Conversion--tp20389947p20389947.html
> Sent from the Solr - User mailing list archive at Nabble.com.