You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by jm <jm...@gmail.com> on 2009/04/03 17:30:29 UTC

Re: [jira] Updated: (SOLR-934) Enable importing of mails into a solr index through DIH.

I don't know if I am missing something, but emails have a Message-ID
header that is unique by definition, would that do?

On Fri, Apr 3, 2009 at 1:12 PM, Shalin Shekhar Mangar (JIRA)
<ji...@apache.org> wrote:
>
>     [ https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Shalin Shekhar Mangar updated SOLR-934:
> ---------------------------------------
>
>    Attachment: SOLR-934.patch
>
> Changes:
> # Parse and store the fetchMailsSince string during init.
> # Return the sentDate as a Date object rather than as a long timestamp
> # Removed context as an argument from the getXFromContext methods
> # Removed unused getLongFromContext method
>
> I just indexed a month's worth of my gmail inbox. Works great!
>
> One question, what is the uniqueKey that we should use when indexing emails? I couldn't figure out so I removed the uniqueKey from my schema to try this out.
>
> Next steps:
> # Enhance the ant build file to copy the dependencies to example/solr/lib just like Solr Cell does.
> # Add a wiki page with instructions to setup, list of dependencies, example schema and data-config.xml
>
>> Enable importing of mails into a solr index through DIH.
>> --------------------------------------------------------
>>
>>                 Key: SOLR-934
>>                 URL: https://issues.apache.org/jira/browse/SOLR-934
>>             Project: Solr
>>          Issue Type: New Feature
>>          Components: contrib - DataImportHandler
>>    Affects Versions: 1.4
>>            Reporter: Preetam Rao
>>            Assignee: Shalin Shekhar Mangar
>>             Fix For: 1.4
>>
>>         Attachments: SOLR-934.patch, SOLR-934.patch, SOLR-934.patch, SOLR-934.patch, SOLR-934.patch
>>
>>   Original Estimate: 24h
>>  Remaining Estimate: 24h
>>
>> Enable importing of mails into solr through DIH. Take one or more mailbox credentials, download and index their content along with the content from attachments. The folders to fetch can be made configurable based on various criteria. Apache Tika is used for extracting content from different kinds of attachments. JavaMail is used for mail box related operations like fetching mails, filtering them etc.
>> The basic configuration for one mail box is as below:
>> {code:xml}
>> <document>
>>    <entity processor="MailEntityProcessor" user="somebody@gmail.com"
>>                 password="something" host="imap.gmail.com" protocol="imaps"/>
>> </document>
>> {code}
>> The below is the list of all configuration available:
>> {color:green}Required{color}
>> ---------
>> *user*
>> *pwd*
>> *protocol*  (only "imaps" supported now)
>> *host*
>> {color:green}Optional{color}
>> ---------
>> *folders* - comma seperated list of folders.
>> If not specified, default folder is used. Nested folders can be specified like a/b/c
>> *recurse* - index subfolders. Defaults to true.
>> *exclude* - comma seperated list of patterns.
>> *include* - comma seperated list of patterns.
>> *batchSize* - mails to fetch at once in a given folder.
>> Only headers can be prefetched in Javamail IMAP.
>> *readTimeout* - defaults to 60000ms
>> *conectTimeout* - defaults to 30000ms
>> *fetchSize* - IMAP config. 32KB default
>> *fetchMailsSince* -
>> date/time in "yyyy-MM-dd HH:mm:ss" format, mails received after which will be fetched. Useful for delta import.
>> *customFilter* - class name.
>> {code}
>> import javax.mail.Folder;
>> import javax.mail.SearchTerm;
>> clz implements MailEntityProcessor.CustomFilter() {
>> public SearchTerm getCustomSearch(Folder folder);
>> }
>> {code}
>> *processAttachement* - defaults to true
>> The below are the indexed fields.
>> {code}
>>   // Fields To Index
>>   // single valued
>>   private static final String SUBJECT = "subject";
>>   private static final String FROM = "from";
>>   private static final String SENT_DATE = "sentDate";
>>   private static final String XMAILER = "xMailer";
>>   // multi valued
>>   private static final String TO_CC_BCC = "allTo";
>>   private static final String FLAGS = "flags";
>>   private static final String CONTENT = "content";
>>   private static final String ATTACHMENT = "attachement";
>>   private static final String ATTACHMENT_NAMES = "attachementNames";
>>   // flag values
>>   private static final String FLAG_ANSWERED = "answered";
>>   private static final String FLAG_DELETED = "deleted";
>>   private static final String FLAG_DRAFT = "draft";
>>   private static final String FLAG_FLAGGED = "flagged";
>>   private static final String FLAG_RECENT = "recent";
>>   private static final String FLAG_SEEN = "seen";
>> {code}
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

Re: [jira] Updated: (SOLR-934) Enable importing of mails into a solr index through DIH.

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Fri, Apr 3, 2009 at 9:00 PM, jm <jm...@gmail.com> wrote:

> I don't know if I am missing something, but emails have a Message-ID
> header that is unique by definition, would that do?
>

Yes but I guess the patch does not exposes this field. We shoud also support
the in-reply-to attribute.

-- 
Regards,
Shalin Shekhar Mangar.