You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Abhishek Srivastava <ab...@gmail.com> on 2009/12/15 02:52:22 UTC
Searching .msg files
Hello Everyone,
In my company, we store a lot of old emails (.msg files) in a database (done
for the purpose of legal compliance).
The users have been asking us to give search functionality on the old
emails.
One of the primary requirement is that when people search, they should only
be able to search in their own emails (emails in which they were in the to,
cc or bcc list).
How can solr be used?
from what I know about this product is that it only searches xml content...
so I will have to extract the body of the email and convert it to xml right?
How will I limit the search results to only those emails where the user who
is searching was in the to, cc or bcc list?
Please do recommend me an approach for providing a solution to our
requirement.
Re: Searching .msg files
Posted by Lance Norskog <go...@gmail.com>.
As to the indexing part:
This is an automated document input tool:
http://wiki.apache.org/solr/DataImportHandler
This is a plugin for it that pulls mail from an IMAP server:
http://wiki.apache.org/solr/MailEntityProcessor
This is a comment about microsoft MSG files and parsing in Java:
http://www.rgagnon.com/javadetails/java-0613.html
It's quite possible that the ExtractingRequestHandler handles .msg
files. But if it does not I would take the code snippet from this
article and wrap it in a SolrJ program. Or, easier, do this in Ruby.
On Mon, Dec 14, 2009 at 7:02 PM, Kay Kay <ka...@gmail.com> wrote:
> I remember seeing a similar thread in the lucene user mailing list. You can
> check the archives of the same.
>
> As regarding the strategies - there could be 2 of them .
>
> * you can create an index per user and store the email content involving the
> user in the same and use it for search.
> (or)
>
> * you can have 1 gigantic index , and have the To/Cc names as fields in them
> and all searches by a given user would go through an initial filter-pass on
> this index.
>
> solr can of course, index a variety of content (see tika project ) and not
> restricted to xml at all.
>
> You would need to weight the pros / cons of each of them depending on the
> corpus of data you are talking about and usage / performance expectations of
> the search.
> Once you identify the strategy as appropriate - you can define the solr
> schema for the fields and use the same.
>
>
>
>
> Abhishek Srivastava wrote:
>>
>> Hello Everyone,
>>
>> In my company, we store a lot of old emails (.msg files) in a database
>> (done
>> for the purpose of legal compliance).
>>
>> The users have been asking us to give search functionality on the old
>> emails.
>>
>> One of the primary requirement is that when people search, they should
>> only
>> be able to search in their own emails (emails in which they were in the
>> to,
>> cc or bcc list).
>>
>> How can solr be used?
>>
>> from what I know about this product is that it only searches xml
>> content...
>> so I will have to extract the body of the email and convert it to xml
>> right?
>>
>> How will I limit the search results to only those emails where the user
>> who
>> is searching was in the to, cc or bcc list?
>>
>> Please do recommend me an approach for providing a solution to our
>> requirement.
>>
>>
>
>
--
Lance Norskog
goksron@gmail.com
Re: Searching .msg files
Posted by Kay Kay <ka...@gmail.com>.
I remember seeing a similar thread in the lucene user mailing list. You
can check the archives of the same.
As regarding the strategies - there could be 2 of them .
* you can create an index per user and store the email content involving
the user in the same and use it for search.
(or)
* you can have 1 gigantic index , and have the To/Cc names as fields in
them and all searches by a given user would go through an initial
filter-pass on this index.
solr can of course, index a variety of content (see tika project ) and
not restricted to xml at all.
You would need to weight the pros / cons of each of them depending on
the corpus of data you are talking about and usage / performance
expectations of the search.
Once you identify the strategy as appropriate - you can define the solr
schema for the fields and use the same.
Abhishek Srivastava wrote:
> Hello Everyone,
>
> In my company, we store a lot of old emails (.msg files) in a database (done
> for the purpose of legal compliance).
>
> The users have been asking us to give search functionality on the old
> emails.
>
> One of the primary requirement is that when people search, they should only
> be able to search in their own emails (emails in which they were in the to,
> cc or bcc list).
>
> How can solr be used?
>
> from what I know about this product is that it only searches xml content...
> so I will have to extract the body of the email and convert it to xml right?
>
> How will I limit the search results to only those emails where the user who
> is searching was in the to, cc or bcc list?
>
> Please do recommend me an approach for providing a solution to our
> requirement.
>
>
Re: Searching .msg files
Posted by javaxmlsoapdev <vi...@yahoo.com>.
1)use tika to index .msg files (Tika does support Microsoft outlook format
and I am already using Tika: http://lucene.apache.org/tika/formats.html).
2)while indexing you'll have to write handler to extract To, CC, Bcc values
and store it in a separate field in index.
3)when user searches on .msg files, compare if s/he is in To, Cc, Bcc field
first before returning result to the page and filter results accordingly.
Abhishek Srivastava-2 wrote:
>
> Hello Everyone,
>
> In my company, we store a lot of old emails (.msg files) in a database
> (done
> for the purpose of legal compliance).
>
> The users have been asking us to give search functionality on the old
> emails.
>
> One of the primary requirement is that when people search, they should
> only
> be able to search in their own emails (emails in which they were in the
> to,
> cc or bcc list).
>
> How can solr be used?
>
> from what I know about this product is that it only searches xml
> content...
> so I will have to extract the body of the email and convert it to xml
> right?
>
> How will I limit the search results to only those emails where the user
> who
> is searching was in the to, cc or bcc list?
>
> Please do recommend me an approach for providing a solution to our
> requirement.
>
>
--
View this message in context: http://old.nabble.com/Searching-.msg-files-tp26788199p26835015.html
Sent from the Solr - User mailing list archive at Nabble.com.