You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Peter Sturge (JIRA)" <ji...@apache.org> on 2010/11/19 00:43:13 UTC

[jira] Created: (SOLR-2245) MailEntityProcessor Update

MailEntityProcessor Update
--------------------------

                 Key: SOLR-2245
                 URL: https://issues.apache.org/jira/browse/SOLR-2245
             Project: Solr
          Issue Type: Improvement
          Components: contrib - DataImportHandler
    Affects Versions: 1.4.1, 1.4
            Reporter: Peter Sturge
            Priority: Minor
             Fix For: 1.4.2
         Attachments: SOLR-2245.patch

This patch addresses a number of issues in the MailEntityProcessor contrib-extras module.

The changes are outlined here:
* Added an 'includeContent' entity attribute to allow specifying content to be included independently of processing attachments
     e.g. <entity includeContent="true" processAttachments="false" . . . /> would include message content, but not attachment content
* Added a synonym called 'processAttachments', which is synonymous to the mis-spelled (and singular) 'processAttachement' property. This property functions the same as processAttachement. Default= 'true' - if either is false, then attachments are not processed. Note that only one of these should really be specified in a given <entity> tag.
* Added a FLAGS.NONE value, so that if an email has no flags (i.e. it is unread, not deleted etc.), there is still a property value stored in the 'flags' field (the value is the string "none")
Note: there is a potential backward compat issue with FLAGS.NONE for clients that expect the absence of the 'flags' field to mean 'Not read'. I'm calculating this would be extremely rare, and is inadviasable in any case as user flags can be arbitrarily set, so fixing it up now will ensure future client access will be consistent.
* The folder name of an email is now included as a field called 'folder' (e.g. folder=INBOX.Sent). This is quite handy in search/post-indexing processing
* The addPartToDocument() method that processes attachments is significantly re-written, as there looked to be no real way the existing code would ever actually process attachment content and add it to the row data

Tested on the 3.x trunk with a number of popular imap servers.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2245) MailEntityProcessor Update

Posted by "Peter Sturge (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935898#action_12935898 ] 

Peter Sturge commented on SOLR-2245:
------------------------------------

Forgo to mention...
Because this now supports delta-import commands, the 'deltaFetch' attribute is no longer needed and is not used.


> MailEntityProcessor Update
> --------------------------
>
>                 Key: SOLR-2245
>                 URL: https://issues.apache.org/jira/browse/SOLR-2245
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4, 1.4.1
>            Reporter: Peter Sturge
>            Priority: Minor
>             Fix For: 1.4.2
>
>         Attachments: SOLR-2245.patch, SOLR-2245.patch, SOLR-2245.zip
>
>
> This patch addresses a number of issues in the MailEntityProcessor contrib-extras module.
> The changes are outlined here:
> * Added an 'includeContent' entity attribute to allow specifying content to be included independently of processing attachments
>      e.g. <entity includeContent="true" processAttachments="false" . . . /> would include message content, but not attachment content
> * Added a synonym called 'processAttachments', which is synonymous to the mis-spelled (and singular) 'processAttachement' property. This property functions the same as processAttachement. Default= 'true' - if either is false, then attachments are not processed. Note that only one of these should really be specified in a given <entity> tag.
> * Added a FLAGS.NONE value, so that if an email has no flags (i.e. it is unread, not deleted etc.), there is still a property value stored in the 'flags' field (the value is the string "none")
> Note: there is a potential backward compat issue with FLAGS.NONE for clients that expect the absence of the 'flags' field to mean 'Not read'. I'm calculating this would be extremely rare, and is inadviasable in any case as user flags can be arbitrarily set, so fixing it up now will ensure future client access will be consistent.
> * The folder name of an email is now included as a field called 'folder' (e.g. folder=INBOX.Sent). This is quite handy in search/post-indexing processing
> * The addPartToDocument() method that processes attachments is significantly re-written, as there looked to be no real way the existing code would ever actually process attachment content and add it to the row data
> Tested on the 3.x trunk with a number of popular imap servers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2245) MailEntityProcessor Update

Posted by "Peter Sturge (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934800#action_12934800 ] 

Peter Sturge commented on SOLR-2245:
------------------------------------

This latest version of the updated MailEntityProcessor adds a few new features:

1. Incorporated SOLR-1958 (exception if fetchMailsSince isn't specified) into this patch
2. Added a hacky version of delta mail retrieval for scheduled import runs:
       The new property is called 'deltaFetch'. If 'true', the first time the import is run, it will read the 'fetchMailsSince' property and import as normal
       On subsequent runs (within the same process session), the import will only fetch mail since the last run.
       Because it uses a runtime system property to hold the last_index_time, and there is currently no persistence, if/when the server is restarted, the last_index_time is not saved and the original fetchMailsSince value is used.
       I couldn't find exposed APIs for the dataimport.properties file (all the methods are private or pkg protected), persistence is not included in this patch version
3. Added support for including shared folders in the import
4. Added support for including personal folders (other folders) in the import

A typical {{monospaced}}<entity>{{monospaced}} element in data-config.xml might look something like this:

{{monospaced}}
    <entity name="email"
      user="user@mydomain.com" 
      password="userpwd" 
      host="imap.mydomain.com" 
      fetchMailsSince="2010-08-01 00:00:00" 
      deltaFetch="true"
      include=""
      exclude=""
      recurse="false"
      folders="INBOX,Inbox,inbox"
      includeContent="true"
      processAttachments="true"
      includeOtherUserFolders="true"
      includeSharedFolders="true"
      batchSize="100"
      processor="MailEntityProcessor"
      protocol="imap"/>
{{monospaced}}


> MailEntityProcessor Update
> --------------------------
>
>                 Key: SOLR-2245
>                 URL: https://issues.apache.org/jira/browse/SOLR-2245
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4, 1.4.1
>            Reporter: Peter Sturge
>            Priority: Minor
>             Fix For: 1.4.2
>
>         Attachments: SOLR-2245.patch
>
>
> This patch addresses a number of issues in the MailEntityProcessor contrib-extras module.
> The changes are outlined here:
> * Added an 'includeContent' entity attribute to allow specifying content to be included independently of processing attachments
>      e.g. <entity includeContent="true" processAttachments="false" . . . /> would include message content, but not attachment content
> * Added a synonym called 'processAttachments', which is synonymous to the mis-spelled (and singular) 'processAttachement' property. This property functions the same as processAttachement. Default= 'true' - if either is false, then attachments are not processed. Note that only one of these should really be specified in a given <entity> tag.
> * Added a FLAGS.NONE value, so that if an email has no flags (i.e. it is unread, not deleted etc.), there is still a property value stored in the 'flags' field (the value is the string "none")
> Note: there is a potential backward compat issue with FLAGS.NONE for clients that expect the absence of the 'flags' field to mean 'Not read'. I'm calculating this would be extremely rare, and is inadviasable in any case as user flags can be arbitrarily set, so fixing it up now will ensure future client access will be consistent.
> * The folder name of an email is now included as a field called 'folder' (e.g. folder=INBOX.Sent). This is quite handy in search/post-indexing processing
> * The addPartToDocument() method that processes attachments is significantly re-written, as there looked to be no real way the existing code would ever actually process attachment content and add it to the row data
> Tested on the 3.x trunk with a number of popular imap servers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (SOLR-2245) MailEntityProcessor Update

Posted by "Peter Sturge (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Sturge updated SOLR-2245:
-------------------------------

    Attachment: SOLR-2245.patch

> MailEntityProcessor Update
> --------------------------
>
>                 Key: SOLR-2245
>                 URL: https://issues.apache.org/jira/browse/SOLR-2245
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4, 1.4.1
>            Reporter: Peter Sturge
>            Priority: Minor
>             Fix For: 1.4.2
>
>         Attachments: SOLR-2245.patch, SOLR-2245.patch
>
>
> This patch addresses a number of issues in the MailEntityProcessor contrib-extras module.
> The changes are outlined here:
> * Added an 'includeContent' entity attribute to allow specifying content to be included independently of processing attachments
>      e.g. <entity includeContent="true" processAttachments="false" . . . /> would include message content, but not attachment content
> * Added a synonym called 'processAttachments', which is synonymous to the mis-spelled (and singular) 'processAttachement' property. This property functions the same as processAttachement. Default= 'true' - if either is false, then attachments are not processed. Note that only one of these should really be specified in a given <entity> tag.
> * Added a FLAGS.NONE value, so that if an email has no flags (i.e. it is unread, not deleted etc.), there is still a property value stored in the 'flags' field (the value is the string "none")
> Note: there is a potential backward compat issue with FLAGS.NONE for clients that expect the absence of the 'flags' field to mean 'Not read'. I'm calculating this would be extremely rare, and is inadviasable in any case as user flags can be arbitrarily set, so fixing it up now will ensure future client access will be consistent.
> * The folder name of an email is now included as a field called 'folder' (e.g. folder=INBOX.Sent). This is quite handy in search/post-indexing processing
> * The addPartToDocument() method that processes attachments is significantly re-written, as there looked to be no real way the existing code would ever actually process attachment content and add it to the row data
> Tested on the 3.x trunk with a number of popular imap servers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2245) MailEntityProcessor Update

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12936137#action_12936137 ] 

Lance Norskog commented on SOLR-2245:
-------------------------------------

bq. Tested on the 3.x trunk ...
3.x or trunk or both? Just 3.x is fine; 

Please add 'in-reply-to' to the fetched & stored headers. This is necessary to reconstruct mail threads.

Also, please add or update the unit tests.



> MailEntityProcessor Update
> --------------------------
>
>                 Key: SOLR-2245
>                 URL: https://issues.apache.org/jira/browse/SOLR-2245
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4, 1.4.1
>            Reporter: Peter Sturge
>            Priority: Minor
>             Fix For: 1.4.2
>
>         Attachments: SOLR-2245.patch, SOLR-2245.patch, SOLR-2245.zip
>
>
> This patch addresses a number of issues in the MailEntityProcessor contrib-extras module.
> The changes are outlined here:
> * Added an 'includeContent' entity attribute to allow specifying content to be included independently of processing attachments
>      e.g. <entity includeContent="true" processAttachments="false" . . . /> would include message content, but not attachment content
> * Added a synonym called 'processAttachments', which is synonymous to the mis-spelled (and singular) 'processAttachement' property. This property functions the same as processAttachement. Default= 'true' - if either is false, then attachments are not processed. Note that only one of these should really be specified in a given <entity> tag.
> * Added a FLAGS.NONE value, so that if an email has no flags (i.e. it is unread, not deleted etc.), there is still a property value stored in the 'flags' field (the value is the string "none")
> Note: there is a potential backward compat issue with FLAGS.NONE for clients that expect the absence of the 'flags' field to mean 'Not read'. I'm calculating this would be extremely rare, and is inadviasable in any case as user flags can be arbitrarily set, so fixing it up now will ensure future client access will be consistent.
> * The folder name of an email is now included as a field called 'folder' (e.g. folder=INBOX.Sent). This is quite handy in search/post-indexing processing
> * The addPartToDocument() method that processes attachments is significantly re-written, as there looked to be no real way the existing code would ever actually process attachment content and add it to the row data
> Tested on the 3.x trunk with a number of popular imap servers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (SOLR-2245) MailEntityProcessor Update

Posted by "Peter Sturge (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Sturge updated SOLR-2245:
-------------------------------

    Attachment: SOLR-2245.patch

> MailEntityProcessor Update
> --------------------------
>
>                 Key: SOLR-2245
>                 URL: https://issues.apache.org/jira/browse/SOLR-2245
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4, 1.4.1
>            Reporter: Peter Sturge
>            Priority: Minor
>             Fix For: 1.4.2
>
>         Attachments: SOLR-2245.patch
>
>
> This patch addresses a number of issues in the MailEntityProcessor contrib-extras module.
> The changes are outlined here:
> * Added an 'includeContent' entity attribute to allow specifying content to be included independently of processing attachments
>      e.g. <entity includeContent="true" processAttachments="false" . . . /> would include message content, but not attachment content
> * Added a synonym called 'processAttachments', which is synonymous to the mis-spelled (and singular) 'processAttachement' property. This property functions the same as processAttachement. Default= 'true' - if either is false, then attachments are not processed. Note that only one of these should really be specified in a given <entity> tag.
> * Added a FLAGS.NONE value, so that if an email has no flags (i.e. it is unread, not deleted etc.), there is still a property value stored in the 'flags' field (the value is the string "none")
> Note: there is a potential backward compat issue with FLAGS.NONE for clients that expect the absence of the 'flags' field to mean 'Not read'. I'm calculating this would be extremely rare, and is inadviasable in any case as user flags can be arbitrarily set, so fixing it up now will ensure future client access will be consistent.
> * The folder name of an email is now included as a field called 'folder' (e.g. folder=INBOX.Sent). This is quite handy in search/post-indexing processing
> * The addPartToDocument() method that processes attachments is significantly re-written, as there looked to be no real way the existing code would ever actually process attachment content and add it to the row data
> Tested on the 3.x trunk with a number of popular imap servers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Issue Comment Edited: (SOLR-2245) MailEntityProcessor Update

Posted by "Peter Sturge (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12934800#action_12934800 ] 

Peter Sturge edited comment on SOLR-2245 at 11/23/10 5:58 AM:
--------------------------------------------------------------

This latest version of the updated MailEntityProcessor adds a few new features:

1. Incorporated SOLR-1958 (exception if fetchMailsSince isn't specified) into this patch
2. Added a hacky version of delta mail retrieval for scheduled import runs:
       The new property is called 'deltaFetch'. If 'true', the first time the import is run, it will read the 'fetchMailsSince' property and import as normal
       On subsequent runs (within the same process session), the import will only fetch mail since the last run.
       Because it uses a runtime system property to hold the last_index_time, and there is currently no persistence, if/when the server is restarted, the last_index_time is not saved and the original fetchMailsSince value is used.
       I couldn't find exposed APIs for the dataimport.properties file (all the methods are private or pkg protected), persistence is not included in this patch version
3. Added support for including shared folders in the import
4. Added support for including personal folders (other folders) in the import

A typical <entity> element in data-config.xml might look something like this:
{code:xml}
    <entity name="email"
      user="user@mydomain.com" 
      password="userpwd" 
      host="imap.mydomain.com" 
      fetchMailsSince="2010-08-01 00:00:00" 
      deltaFetch="true"
      include=""
      exclude=""
      recurse="false"
      folders="INBOX,Inbox,inbox"
      includeContent="true"
      processAttachments="true"
      includeOtherUserFolders="true"
      includeSharedFolders="true"
      batchSize="100"
      processor="MailEntityProcessor"
      protocol="imap"/>
{code} 


      was (Author: midiman):
    This latest version of the updated MailEntityProcessor adds a few new features:

1. Incorporated SOLR-1958 (exception if fetchMailsSince isn't specified) into this patch
2. Added a hacky version of delta mail retrieval for scheduled import runs:
       The new property is called 'deltaFetch'. If 'true', the first time the import is run, it will read the 'fetchMailsSince' property and import as normal
       On subsequent runs (within the same process session), the import will only fetch mail since the last run.
       Because it uses a runtime system property to hold the last_index_time, and there is currently no persistence, if/when the server is restarted, the last_index_time is not saved and the original fetchMailsSince value is used.
       I couldn't find exposed APIs for the dataimport.properties file (all the methods are private or pkg protected), persistence is not included in this patch version
3. Added support for including shared folders in the import
4. Added support for including personal folders (other folders) in the import

A typical {{monospaced}}<entity>{{monospaced}} element in data-config.xml might look something like this:

{{monospaced}}
    <entity name="email"
      user="user@mydomain.com" 
      password="userpwd" 
      host="imap.mydomain.com" 
      fetchMailsSince="2010-08-01 00:00:00" 
      deltaFetch="true"
      include=""
      exclude=""
      recurse="false"
      folders="INBOX,Inbox,inbox"
      includeContent="true"
      processAttachments="true"
      includeOtherUserFolders="true"
      includeSharedFolders="true"
      batchSize="100"
      processor="MailEntityProcessor"
      protocol="imap"/>
{{monospaced}}

  
> MailEntityProcessor Update
> --------------------------
>
>                 Key: SOLR-2245
>                 URL: https://issues.apache.org/jira/browse/SOLR-2245
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4, 1.4.1
>            Reporter: Peter Sturge
>            Priority: Minor
>             Fix For: 1.4.2
>
>         Attachments: SOLR-2245.patch, SOLR-2245.patch
>
>
> This patch addresses a number of issues in the MailEntityProcessor contrib-extras module.
> The changes are outlined here:
> * Added an 'includeContent' entity attribute to allow specifying content to be included independently of processing attachments
>      e.g. <entity includeContent="true" processAttachments="false" . . . /> would include message content, but not attachment content
> * Added a synonym called 'processAttachments', which is synonymous to the mis-spelled (and singular) 'processAttachement' property. This property functions the same as processAttachement. Default= 'true' - if either is false, then attachments are not processed. Note that only one of these should really be specified in a given <entity> tag.
> * Added a FLAGS.NONE value, so that if an email has no flags (i.e. it is unread, not deleted etc.), there is still a property value stored in the 'flags' field (the value is the string "none")
> Note: there is a potential backward compat issue with FLAGS.NONE for clients that expect the absence of the 'flags' field to mean 'Not read'. I'm calculating this would be extremely rare, and is inadviasable in any case as user flags can be arbitrarily set, so fixing it up now will ensure future client access will be consistent.
> * The folder name of an email is now included as a field called 'folder' (e.g. folder=INBOX.Sent). This is quite handy in search/post-indexing processing
> * The addPartToDocument() method that processes attachments is significantly re-written, as there looked to be no real way the existing code would ever actually process attachment content and add it to the row data
> Tested on the 3.x trunk with a number of popular imap servers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (SOLR-2245) MailEntityProcessor Update

Posted by "Peter Sturge (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Sturge updated SOLR-2245:
-------------------------------

    Attachment: SOLR-2245.zip

This patch update does a more proper delta-import implementation, rather than the kludge used in the previous version.
MailEntityProcessor with this patch is useful for importing emails 'en-masse' the first time 'round, then only new mails after that.

Behaviour:
* If you send a full-import command, then the 'fetchMailsSince' property specified in data-config.xml will always be used.
* If you send a delta-import command, the 'fetchMailsSince' property specified in data-config.xml is used for the first call only. 
  Subsequent delta-import commands will use the time since the last index update.

There are significant code changes in this version. So much so, that I've included the complete MailEntityProcessor source as well as a PATCH file.

This version doesn't use the persistent last_index_time functionality of dataimport.properties (i.e. it's delta only for the life of the solr process). If I get some free cycles, I'll try to put this in.


> MailEntityProcessor Update
> --------------------------
>
>                 Key: SOLR-2245
>                 URL: https://issues.apache.org/jira/browse/SOLR-2245
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4, 1.4.1
>            Reporter: Peter Sturge
>            Priority: Minor
>             Fix For: 1.4.2
>
>         Attachments: SOLR-2245.patch, SOLR-2245.patch, SOLR-2245.zip
>
>
> This patch addresses a number of issues in the MailEntityProcessor contrib-extras module.
> The changes are outlined here:
> * Added an 'includeContent' entity attribute to allow specifying content to be included independently of processing attachments
>      e.g. <entity includeContent="true" processAttachments="false" . . . /> would include message content, but not attachment content
> * Added a synonym called 'processAttachments', which is synonymous to the mis-spelled (and singular) 'processAttachement' property. This property functions the same as processAttachement. Default= 'true' - if either is false, then attachments are not processed. Note that only one of these should really be specified in a given <entity> tag.
> * Added a FLAGS.NONE value, so that if an email has no flags (i.e. it is unread, not deleted etc.), there is still a property value stored in the 'flags' field (the value is the string "none")
> Note: there is a potential backward compat issue with FLAGS.NONE for clients that expect the absence of the 'flags' field to mean 'Not read'. I'm calculating this would be extremely rare, and is inadviasable in any case as user flags can be arbitrarily set, so fixing it up now will ensure future client access will be consistent.
> * The folder name of an email is now included as a field called 'folder' (e.g. folder=INBOX.Sent). This is quite handy in search/post-indexing processing
> * The addPartToDocument() method that processes attachments is significantly re-written, as there looked to be no real way the existing code would ever actually process attachment content and add it to the row data
> Tested on the 3.x trunk with a number of popular imap servers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Re: [jira] Commented: (SOLR-2245) MailEntityProcessor Update

Posted by Bill Bell <bi...@gmail.com>.
3.1 may be too late

Bill Bell
Sent from mobile


On Feb 15, 2011, at 8:52 AM, "Peter Sturge (JIRA)" <ji...@apache.org> wrote:

> 
>    [ https://issues.apache.org/jira/browse/SOLR-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994847#comment-12994847 ] 
> 
> Peter Sturge commented on SOLR-2245:
> ------------------------------------
> 
> I've been meaning to get back to this, as I have made some local updates to this that help performance.
> Could you give me some feedback on these 2 questions please - it would be really useful:
>  * Is there a "committer's standard" or similar spec that describes what tests should be included, and if so, could you point me to it please?
>      I can then make sure I include appropriate tests
>  * Is there a time-frame for committing for this or next release?
>      I have a product release of my own coming fup or beg-March, so if I know the time-scales, I can plan accordingly.
> 
> Thanks!
> Peter
> 
> 
>> MailEntityProcessor Update
>> --------------------------
>> 
>>                Key: SOLR-2245
>>                URL: https://issues.apache.org/jira/browse/SOLR-2245
>>            Project: Solr
>>         Issue Type: Improvement
>>         Components: contrib - DataImportHandler
>>   Affects Versions: 1.4, 1.4.1
>>           Reporter: Peter Sturge
>>           Priority: Minor
>>            Fix For: 1.4.2
>> 
>>        Attachments: SOLR-2245.patch, SOLR-2245.patch, SOLR-2245.zip
>> 
>> 
>> This patch addresses a number of issues in the MailEntityProcessor contrib-extras module.
>> The changes are outlined here:
>> * Added an 'includeContent' entity attribute to allow specifying content to be included independently of processing attachments
>>     e.g. <entity includeContent="true" processAttachments="false" . . . /> would include message content, but not attachment content
>> * Added a synonym called 'processAttachments', which is synonymous to the mis-spelled (and singular) 'processAttachement' property. This property functions the same as processAttachement. Default= 'true' - if either is false, then attachments are not processed. Note that only one of these should really be specified in a given <entity> tag.
>> * Added a FLAGS.NONE value, so that if an email has no flags (i.e. it is unread, not deleted etc.), there is still a property value stored in the 'flags' field (the value is the string "none")
>> Note: there is a potential backward compat issue with FLAGS.NONE for clients that expect the absence of the 'flags' field to mean 'Not read'. I'm calculating this would be extremely rare, and is inadviasable in any case as user flags can be arbitrarily set, so fixing it up now will ensure future client access will be consistent.
>> * The folder name of an email is now included as a field called 'folder' (e.g. folder=INBOX.Sent). This is quite handy in search/post-indexing processing
>> * The addPartToDocument() method that processes attachments is significantly re-written, as there looked to be no real way the existing code would ever actually process attachment content and add it to the row data
>> Tested on the 3.x trunk with a number of popular imap servers.
> 
> -- 
> This message is automatically generated by JIRA.
> -
> For more information on JIRA, see: http://www.atlassian.com/software/jira
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2245) MailEntityProcessor Update

Posted by "Peter Sturge (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994847#comment-12994847 ] 

Peter Sturge commented on SOLR-2245:
------------------------------------

I've been meaning to get back to this, as I have made some local updates to this that help performance.
Could you give me some feedback on these 2 questions please - it would be really useful:
  * Is there a "committer's standard" or similar spec that describes what tests should be included, and if so, could you point me to it please?
      I can then make sure I include appropriate tests
  * Is there a time-frame for committing for this or next release?
      I have a product release of my own coming fup or beg-March, so if I know the time-scales, I can plan accordingly.

Thanks!
Peter


> MailEntityProcessor Update
> --------------------------
>
>                 Key: SOLR-2245
>                 URL: https://issues.apache.org/jira/browse/SOLR-2245
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4, 1.4.1
>            Reporter: Peter Sturge
>            Priority: Minor
>             Fix For: 1.4.2
>
>         Attachments: SOLR-2245.patch, SOLR-2245.patch, SOLR-2245.zip
>
>
> This patch addresses a number of issues in the MailEntityProcessor contrib-extras module.
> The changes are outlined here:
> * Added an 'includeContent' entity attribute to allow specifying content to be included independently of processing attachments
>      e.g. <entity includeContent="true" processAttachments="false" . . . /> would include message content, but not attachment content
> * Added a synonym called 'processAttachments', which is synonymous to the mis-spelled (and singular) 'processAttachement' property. This property functions the same as processAttachement. Default= 'true' - if either is false, then attachments are not processed. Note that only one of these should really be specified in a given <entity> tag.
> * Added a FLAGS.NONE value, so that if an email has no flags (i.e. it is unread, not deleted etc.), there is still a property value stored in the 'flags' field (the value is the string "none")
> Note: there is a potential backward compat issue with FLAGS.NONE for clients that expect the absence of the 'flags' field to mean 'Not read'. I'm calculating this would be extremely rare, and is inadviasable in any case as user flags can be arbitrarily set, so fixing it up now will ensure future client access will be consistent.
> * The folder name of an email is now included as a field called 'folder' (e.g. folder=INBOX.Sent). This is quite handy in search/post-indexing processing
> * The addPartToDocument() method that processes attachments is significantly re-written, as there looked to be no real way the existing code would ever actually process attachment content and add it to the row data
> Tested on the 3.x trunk with a number of popular imap servers.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (SOLR-2245) MailEntityProcessor Update

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994797#comment-12994797 ] 

Yonik Seeley commented on SOLR-2245:
------------------------------------

Thanks Peter,
If we can get someone who knows more DIH stuff to add some tests, we can get this committed!

> MailEntityProcessor Update
> --------------------------
>
>                 Key: SOLR-2245
>                 URL: https://issues.apache.org/jira/browse/SOLR-2245
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4, 1.4.1
>            Reporter: Peter Sturge
>            Priority: Minor
>             Fix For: 1.4.2
>
>         Attachments: SOLR-2245.patch, SOLR-2245.patch, SOLR-2245.zip
>
>
> This patch addresses a number of issues in the MailEntityProcessor contrib-extras module.
> The changes are outlined here:
> * Added an 'includeContent' entity attribute to allow specifying content to be included independently of processing attachments
>      e.g. <entity includeContent="true" processAttachments="false" . . . /> would include message content, but not attachment content
> * Added a synonym called 'processAttachments', which is synonymous to the mis-spelled (and singular) 'processAttachement' property. This property functions the same as processAttachement. Default= 'true' - if either is false, then attachments are not processed. Note that only one of these should really be specified in a given <entity> tag.
> * Added a FLAGS.NONE value, so that if an email has no flags (i.e. it is unread, not deleted etc.), there is still a property value stored in the 'flags' field (the value is the string "none")
> Note: there is a potential backward compat issue with FLAGS.NONE for clients that expect the absence of the 'flags' field to mean 'Not read'. I'm calculating this would be extremely rare, and is inadviasable in any case as user flags can be arbitrarily set, so fixing it up now will ensure future client access will be consistent.
> * The folder name of an email is now included as a field called 'folder' (e.g. folder=INBOX.Sent). This is quite handy in search/post-indexing processing
> * The addPartToDocument() method that processes attachments is significantly re-written, as there looked to be no real way the existing code would ever actually process attachment content and add it to the row data
> Tested on the 3.x trunk with a number of popular imap servers.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org