You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Abhishek Srivastava <ab...@gmail.com> on 2009/12/15 02:52:22 UTC

Searching .msg files

Hello Everyone,

In my company, we store a lot of old emails (.msg files) in a database (done
for the purpose of legal compliance).

The users have been asking us to give search functionality on the old
emails.

One of the primary requirement is that when people search, they should only
be able to search in their own emails (emails in which they were in the to,
cc or bcc list).

How can solr be used?

from what I know about this product is that it only searches xml content...
so I will have to extract the body of the email and convert it to xml right?

How will I limit the search results to only those emails where the user who
is searching was in the to, cc or bcc list?

Please do recommend me an approach for providing a solution to our
requirement.

Re: Searching .msg files

Posted by Lance Norskog <go...@gmail.com>.
As to the indexing part:

This is an automated document input tool:
http://wiki.apache.org/solr/DataImportHandler

This is a plugin for it that pulls mail from an IMAP server:
http://wiki.apache.org/solr/MailEntityProcessor

This is a comment about microsoft MSG files and parsing in Java:
http://www.rgagnon.com/javadetails/java-0613.html

It's quite possible that the ExtractingRequestHandler handles .msg
files. But if it does not I would take the code snippet from this
article and wrap it in a SolrJ program. Or, easier, do this in Ruby.

On Mon, Dec 14, 2009 at 7:02 PM, Kay Kay <ka...@gmail.com> wrote:
> I remember seeing a similar thread in the lucene user mailing list. You can
> check the archives of the same.
>
> As regarding the strategies - there could be 2 of them .
>
> * you can create an index per user and store the email content involving the
> user in the same and use it for search.
> (or)
>
> * you can have 1 gigantic index , and have the To/Cc names as fields in them
> and all searches by a given user would go through an initial filter-pass on
> this index.
>
> solr can of course, index a variety of content (see tika project ) and not
> restricted to xml at all.
>
> You would need to weight the pros / cons of each of them depending on the
> corpus of data you are talking about and usage / performance expectations of
> the search.
> Once you identify the strategy as appropriate  - you can define the solr
> schema for the fields and use the same.
>
>
>
>
> Abhishek Srivastava wrote:
>>
>> Hello Everyone,
>>
>> In my company, we store a lot of old emails (.msg files) in a database
>> (done
>> for the purpose of legal compliance).
>>
>> The users have been asking us to give search functionality on the old
>> emails.
>>
>> One of the primary requirement is that when people search, they should
>> only
>> be able to search in their own emails (emails in which they were in the
>> to,
>> cc or bcc list).
>>
>> How can solr be used?
>>
>> from what I know about this product is that it only searches xml
>> content...
>> so I will have to extract the body of the email and convert it to xml
>> right?
>>
>> How will I limit the search results to only those emails where the user
>> who
>> is searching was in the to, cc or bcc list?
>>
>> Please do recommend me an approach for providing a solution to our
>> requirement.
>>
>>
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Searching .msg files

Posted by Kay Kay <ka...@gmail.com>.
I remember seeing a similar thread in the lucene user mailing list. You 
can check the archives of the same.

As regarding the strategies - there could be 2 of them .

* you can create an index per user and store the email content involving 
the user in the same and use it for search.
(or)

* you can have 1 gigantic index , and have the To/Cc names as fields in 
them and all searches by a given user would go through an initial 
filter-pass on this index.

solr can of course, index a variety of content (see tika project ) and 
not restricted to xml at all.

You would need to weight the pros / cons of each of them depending on 
the corpus of data you are talking about and usage / performance 
expectations of the search.
Once you identify the strategy as appropriate  - you can define the solr 
schema for the fields and use the same.




Abhishek Srivastava wrote:
> Hello Everyone,
>
> In my company, we store a lot of old emails (.msg files) in a database (done
> for the purpose of legal compliance).
>
> The users have been asking us to give search functionality on the old
> emails.
>
> One of the primary requirement is that when people search, they should only
> be able to search in their own emails (emails in which they were in the to,
> cc or bcc list).
>
> How can solr be used?
>
> from what I know about this product is that it only searches xml content...
> so I will have to extract the body of the email and convert it to xml right?
>
> How will I limit the search results to only those emails where the user who
> is searching was in the to, cc or bcc list?
>
> Please do recommend me an approach for providing a solution to our
> requirement.
>
>   


Re: Searching .msg files

Posted by javaxmlsoapdev <vi...@yahoo.com>.

1)use tika to index .msg files (Tika does support Microsoft outlook format
and I am already using Tika: http://lucene.apache.org/tika/formats.html).
2)while indexing you'll have to write handler to extract To, CC, Bcc values
and store it in a separate field in index.
3)when user searches on .msg files, compare if s/he is in To, Cc, Bcc field
first before returning result to the page and filter results accordingly.



Abhishek Srivastava-2 wrote:
> 
> Hello Everyone,
> 
> In my company, we store a lot of old emails (.msg files) in a database
> (done
> for the purpose of legal compliance).
> 
> The users have been asking us to give search functionality on the old
> emails.
> 
> One of the primary requirement is that when people search, they should
> only
> be able to search in their own emails (emails in which they were in the
> to,
> cc or bcc list).
> 
> How can solr be used?
> 
> from what I know about this product is that it only searches xml
> content...
> so I will have to extract the body of the email and convert it to xml
> right?
> 
> How will I limit the search results to only those emails where the user
> who
> is searching was in the to, cc or bcc list?
> 
> Please do recommend me an approach for providing a solution to our
> requirement.
> 
> 

-- 
View this message in context: http://old.nabble.com/Searching-.msg-files-tp26788199p26835015.html
Sent from the Solr - User mailing list archive at Nabble.com.