You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Timothy Potter (JIRA)" <ji...@apache.org> on 2014/07/01 00:46:27 UTC

[jira] [Updated] (SOLR-2245) MailEntityProcessor Update

     [ https://issues.apache.org/jira/browse/SOLR-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Timothy Potter updated SOLR-2245:
---------------------------------

    Attachment: SOLR-2245.patch

Here's an updated patch that's close to being ready for commit. However, I've changed a few things in the implementation but I believe it still meets the spirit of Peter's original work. Mainly, this patch removes support for the delta-import command and instead only does full-import with support for using the last_index_time from the previous run as the value for the fetchMailsSince filter. 

The delta-import stuff is really for importing updates to existing rows and the MailEntityProcessor was sort of hijacking that behavior. More to the point, I couldn't get the DocBuilder#collectDelta code to work with the rows generated by the MailEntityProcessor#nextModifiedRowKey. Put simply, nextModifiedRowKey was returning new mails that occurred after the fetchMailsSince date filter and the DocBuilder was processing them like they were updates to pre-existing rows.

Thus, I felt is better to just support full-import and then have the code set the fetchMailsSince filter based on the last_index_time set by the DIH framework, which gets persisted in dataimport.properties. Of course if that property is not set, then the code falls back to fetchMailsSince from the config.

> MailEntityProcessor Update
> --------------------------
>
>                 Key: SOLR-2245
>                 URL: https://issues.apache.org/jira/browse/SOLR-2245
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4, 1.4.1
>            Reporter: Peter Sturge
>            Assignee: Timothy Potter
>            Priority: Minor
>             Fix For: 4.9, 5.0
>
>         Attachments: SOLR-2245.patch, SOLR-2245.patch, SOLR-2245.patch, SOLR-2245.patch, SOLR-2245.patch, SOLR-2245.zip
>
>
> This patch addresses a number of issues in the MailEntityProcessor contrib-extras module.
> The changes are outlined here:
> * Added an 'includeContent' entity attribute to allow specifying content to be included independently of processing attachments
>      e.g. <entity includeContent="true" processAttachments="false" . . . /> would include message content, but not attachment content
> * Added a synonym called 'processAttachments', which is synonymous to the mis-spelled (and singular) 'processAttachement' property. This property functions the same as processAttachement. Default= 'true' - if either is false, then attachments are not processed. Note that only one of these should really be specified in a given <entity> tag.
> * Added a FLAGS.NONE value, so that if an email has no flags (i.e. it is unread, not deleted etc.), there is still a property value stored in the 'flags' field (the value is the string "none")
> Note: there is a potential backward compat issue with FLAGS.NONE for clients that expect the absence of the 'flags' field to mean 'Not read'. I'm calculating this would be extremely rare, and is inadviasable in any case as user flags can be arbitrarily set, so fixing it up now will ensure future client access will be consistent.
> * The folder name of an email is now included as a field called 'folder' (e.g. folder=INBOX.Sent). This is quite handy in search/post-indexing processing
> * The addPartToDocument() method that processes attachments is significantly re-written, as there looked to be no real way the existing code would ever actually process attachment content and add it to the row data
> Tested on the 3.x trunk with a number of popular imap servers.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org