You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org> on 2009/04/17 09:19:14 UTC

[jira] Created: (SOLR-1120) Simplify EntityProcessor API

Simplify EntityProcessor API
----------------------------

                 Key: SOLR-1120
                 URL: https://issues.apache.org/jira/browse/SOLR-1120
             Project: Solr
          Issue Type: Improvement
          Components: contrib - DataImportHandler
    Affects Versions: 1.3
            Reporter: Shalin Shekhar Mangar
            Assignee: Shalin Shekhar Mangar
             Fix For: 1.4


Writing an EntityProcessor is deceptively complex. There are so many gotchas.

I propose the following:
# Extract out the Transformer application logic from EntityProcessor and add it to DocBuilder. Then EntityProcessor do not need to call applyTransformer or know about rowIterator and getFromRowCache() methods.
# Change the meaning of EntityProcessor#destroy to be called on end of parent's row -- Right now init is called once per parent row but destroy actually means the end of import. In fact, there is no correct way for an entity processor to do clean up right now. Most do clean up when returning null (end of data) but with the introduction of $skipDoc, a transformer can return $skipDoc and the entity processor will never get a chance to clean up for the current init.
# EntityProcessor will use the EventListener API to listen for import end. This should be used by EntityProcessor to do a final cleanup.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1120) Simplify EntityProcessor API

Posted by "Fergus McMenemie (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700085#action_12700085 ] 

Fergus McMenemie commented on SOLR-1120:
----------------------------------------

Good idea. Dont know if you are interested in the following.

* Further to extracting out the Transformer application logic, I was wondering if every entity attribute read should automatically be processed by replaceTokens. Is there any ligitimate place where one would want to disallow replaceTokens? The following snippet of code is repeated far too many times; but is important if DIH is to provide simple predictable behaviour.  

{code}
    s = context.getEntityAttribute(CHANGELIST_OMIT);
    if (s != null) s = resolver.replaceTokens(s);

{code}

* The regexp transformer now has several combinations of mutually exclusive attributes. It would be nice to check the attributes for nonsensical combinations. However given that the transformer is invoked for every row such checking code could be a nasty overhead. I dont know how to sort this, but somehow we need to catch the first invocation of a fields transformer and allow far more detailed checking of the attributes

> Simplify EntityProcessor API
> ----------------------------
>
>                 Key: SOLR-1120
>                 URL: https://issues.apache.org/jira/browse/SOLR-1120
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.3
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>
> Writing an EntityProcessor is deceptively complex. There are so many gotchas.
> I propose the following:
> # Extract out the Transformer application logic from EntityProcessor and add it to DocBuilder. Then EntityProcessor do not need to call applyTransformer or know about rowIterator and getFromRowCache() methods.
> # Change the meaning of EntityProcessor#destroy to be called on end of parent's row -- Right now init is called once per parent row but destroy actually means the end of import. In fact, there is no correct way for an entity processor to do clean up right now. Most do clean up when returning null (end of data) but with the introduction of $skipDoc, a transformer can return $skipDoc and the entity processor will never get a chance to clean up for the current init.
> # EntityProcessor will use the EventListener API to listen for import end. This should be used by EntityProcessor to do a final cleanup.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (SOLR-1120) Simplify EntityProcessor API

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Noble Paul reopened SOLR-1120:
------------------------------


this has broken some features in XPathEntityProcessor

if a field is added by a transformer then XPathEntityProcessor is unable to use it in common fields .Features like $nextUrl , $hasMore cannot work

> Simplify EntityProcessor API
> ----------------------------
>
>                 Key: SOLR-1120
>                 URL: https://issues.apache.org/jira/browse/SOLR-1120
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.3
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch
>
>
> Writing an EntityProcessor is deceptively complex. There are so many gotchas.
> I propose the following:
> # Extract out the Transformer application logic from EntityProcessor and add it to DocBuilder. Then EntityProcessor do not need to call applyTransformer or know about rowIterator and getFromRowCache() methods.
> # Change the meaning of EntityProcessor#destroy to be called on end of parent's row -- Right now init is called once per parent row but destroy actually means the end of import. In fact, there is no correct way for an entity processor to do clean up right now. Most do clean up when returning null (end of data) but with the introduction of $skipDoc, a transformer can return $skipDoc and the entity processor will never get a chance to clean up for the current init.
> # EntityProcessor will use the EventListener API to listen for import end. This should be used by EntityProcessor to do a final cleanup.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (SOLR-1120) Simplify EntityProcessor API

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Noble Paul resolved SOLR-1120.
------------------------------

    Resolution: Fixed

committed revision:781272

thanks Steffen Baumgart

> Simplify EntityProcessor API
> ----------------------------
>
>                 Key: SOLR-1120
>                 URL: https://issues.apache.org/jira/browse/SOLR-1120
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.3
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch
>
>
> Writing an EntityProcessor is deceptively complex. There are so many gotchas.
> I propose the following:
> # Extract out the Transformer application logic from EntityProcessor and add it to DocBuilder. Then EntityProcessor do not need to call applyTransformer or know about rowIterator and getFromRowCache() methods.
> # Change the meaning of EntityProcessor#destroy to be called on end of parent's row -- Right now init is called once per parent row but destroy actually means the end of import. In fact, there is no correct way for an entity processor to do clean up right now. Most do clean up when returning null (end of data) but with the introduction of $skipDoc, a transformer can return $skipDoc and the entity processor will never get a chance to clean up for the current init.
> # EntityProcessor will use the EventListener API to listen for import end. This should be used by EntityProcessor to do a final cleanup.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1120) Simplify EntityProcessor API

Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shalin Shekhar Mangar updated SOLR-1120:
----------------------------------------

    Attachment: SOLR-1120.patch

Another patch which adds a new method to Context:
# Context#getResolvedEntityAttribute(String) which returns the resolved value of the entity attribute

I'll commit shortly.

> Simplify EntityProcessor API
> ----------------------------
>
>                 Key: SOLR-1120
>                 URL: https://issues.apache.org/jira/browse/SOLR-1120
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.3
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch
>
>
> Writing an EntityProcessor is deceptively complex. There are so many gotchas.
> I propose the following:
> # Extract out the Transformer application logic from EntityProcessor and add it to DocBuilder. Then EntityProcessor do not need to call applyTransformer or know about rowIterator and getFromRowCache() methods.
> # Change the meaning of EntityProcessor#destroy to be called on end of parent's row -- Right now init is called once per parent row but destroy actually means the end of import. In fact, there is no correct way for an entity processor to do clean up right now. Most do clean up when returning null (end of data) but with the introduction of $skipDoc, a transformer can return $skipDoc and the entity processor will never get a chance to clean up for the current init.
> # EntityProcessor will use the EventListener API to listen for import end. This should be used by EntityProcessor to do a final cleanup.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1120) Simplify EntityProcessor API

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Noble Paul updated SOLR-1120:
-----------------------------

    Attachment: SOLR-1120.patch

> Simplify EntityProcessor API
> ----------------------------
>
>                 Key: SOLR-1120
>                 URL: https://issues.apache.org/jira/browse/SOLR-1120
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.3
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch
>
>
> Writing an EntityProcessor is deceptively complex. There are so many gotchas.
> I propose the following:
> # Extract out the Transformer application logic from EntityProcessor and add it to DocBuilder. Then EntityProcessor do not need to call applyTransformer or know about rowIterator and getFromRowCache() methods.
> # Change the meaning of EntityProcessor#destroy to be called on end of parent's row -- Right now init is called once per parent row but destroy actually means the end of import. In fact, there is no correct way for an entity processor to do clean up right now. Most do clean up when returning null (end of data) but with the introduction of $skipDoc, a transformer can return $skipDoc and the entity processor will never get a chance to clean up for the current init.
> # EntityProcessor will use the EventListener API to listen for import end. This should be used by EntityProcessor to do a final cleanup.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (SOLR-1120) Simplify EntityProcessor API

Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shalin Shekhar Mangar resolved SOLR-1120.
-----------------------------------------

    Resolution: Fixed

Committed revision 783750.

Thanks Noble!

> Simplify EntityProcessor API
> ----------------------------
>
>                 Key: SOLR-1120
>                 URL: https://issues.apache.org/jira/browse/SOLR-1120
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.3
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch
>
>
> Writing an EntityProcessor is deceptively complex. There are so many gotchas.
> I propose the following:
> # Extract out the Transformer application logic from EntityProcessor and add it to DocBuilder. Then EntityProcessor do not need to call applyTransformer or know about rowIterator and getFromRowCache() methods.
> # Change the meaning of EntityProcessor#destroy to be called on end of parent's row -- Right now init is called once per parent row but destroy actually means the end of import. In fact, there is no correct way for an entity processor to do clean up right now. Most do clean up when returning null (end of data) but with the introduction of $skipDoc, a transformer can return $skipDoc and the entity processor will never get a chance to clean up for the current init.
> # EntityProcessor will use the EventListener API to listen for import end. This should be used by EntityProcessor to do a final cleanup.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (SOLR-1120) Simplify EntityProcessor API

Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shalin Shekhar Mangar resolved SOLR-1120.
-----------------------------------------

    Resolution: Fixed

Committed revision 766608.

Thanks Noble!

> Simplify EntityProcessor API
> ----------------------------
>
>                 Key: SOLR-1120
>                 URL: https://issues.apache.org/jira/browse/SOLR-1120
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.3
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch
>
>
> Writing an EntityProcessor is deceptively complex. There are so many gotchas.
> I propose the following:
> # Extract out the Transformer application logic from EntityProcessor and add it to DocBuilder. Then EntityProcessor do not need to call applyTransformer or know about rowIterator and getFromRowCache() methods.
> # Change the meaning of EntityProcessor#destroy to be called on end of parent's row -- Right now init is called once per parent row but destroy actually means the end of import. In fact, there is no correct way for an entity processor to do clean up right now. Most do clean up when returning null (end of data) but with the introduction of $skipDoc, a transformer can return $skipDoc and the entity processor will never get a chance to clean up for the current init.
> # EntityProcessor will use the EventListener API to listen for import end. This should be used by EntityProcessor to do a final cleanup.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1120) Simplify EntityProcessor API

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Noble Paul updated SOLR-1120:
-----------------------------

    Attachment: SOLR-1120.patch

A new class added EntityProcessorWrapper which does all the transformer related actions. EntityProcessors are now agnostic of Transformers alltogether

> Simplify EntityProcessor API
> ----------------------------
>
>                 Key: SOLR-1120
>                 URL: https://issues.apache.org/jira/browse/SOLR-1120
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.3
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-1120.patch
>
>
> Writing an EntityProcessor is deceptively complex. There are so many gotchas.
> I propose the following:
> # Extract out the Transformer application logic from EntityProcessor and add it to DocBuilder. Then EntityProcessor do not need to call applyTransformer or know about rowIterator and getFromRowCache() methods.
> # Change the meaning of EntityProcessor#destroy to be called on end of parent's row -- Right now init is called once per parent row but destroy actually means the end of import. In fact, there is no correct way for an entity processor to do clean up right now. Most do clean up when returning null (end of data) but with the introduction of $skipDoc, a transformer can return $skipDoc and the entity processor will never get a chance to clean up for the current init.
> # EntityProcessor will use the EventListener API to listen for import end. This should be used by EntityProcessor to do a final cleanup.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1120) Simplify EntityProcessor API

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Noble Paul updated SOLR-1120:
-----------------------------

    Attachment: SOLR-1120.patch

> Simplify EntityProcessor API
> ----------------------------
>
>                 Key: SOLR-1120
>                 URL: https://issues.apache.org/jira/browse/SOLR-1120
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.3
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-1120.patch, SOLR-1120.patch
>
>
> Writing an EntityProcessor is deceptively complex. There are so many gotchas.
> I propose the following:
> # Extract out the Transformer application logic from EntityProcessor and add it to DocBuilder. Then EntityProcessor do not need to call applyTransformer or know about rowIterator and getFromRowCache() methods.
> # Change the meaning of EntityProcessor#destroy to be called on end of parent's row -- Right now init is called once per parent row but destroy actually means the end of import. In fact, there is no correct way for an entity processor to do clean up right now. Most do clean up when returning null (end of data) but with the introduction of $skipDoc, a transformer can return $skipDoc and the entity processor will never get a chance to clean up for the current init.
> # EntityProcessor will use the EventListener API to listen for import end. This should be used by EntityProcessor to do a final cleanup.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1120) Simplify EntityProcessor API

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Noble Paul updated SOLR-1120:
-----------------------------

    Attachment: SOLR-1120.patch

futher simplifying DocBuilder

> Simplify EntityProcessor API
> ----------------------------
>
>                 Key: SOLR-1120
>                 URL: https://issues.apache.org/jira/browse/SOLR-1120
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.3
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch
>
>
> Writing an EntityProcessor is deceptively complex. There are so many gotchas.
> I propose the following:
> # Extract out the Transformer application logic from EntityProcessor and add it to DocBuilder. Then EntityProcessor do not need to call applyTransformer or know about rowIterator and getFromRowCache() methods.
> # Change the meaning of EntityProcessor#destroy to be called on end of parent's row -- Right now init is called once per parent row but destroy actually means the end of import. In fact, there is no correct way for an entity processor to do clean up right now. Most do clean up when returning null (end of data) but with the introduction of $skipDoc, a transformer can return $skipDoc and the entity processor will never get a chance to clean up for the current init.
> # EntityProcessor will use the EventListener API to listen for import end. This should be used by EntityProcessor to do a final cleanup.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1120) Simplify EntityProcessor API

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Noble Paul updated SOLR-1120:
-----------------------------

    Attachment: SOLR-1120.patch

> Simplify EntityProcessor API
> ----------------------------
>
>                 Key: SOLR-1120
>                 URL: https://issues.apache.org/jira/browse/SOLR-1120
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.3
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch
>
>
> Writing an EntityProcessor is deceptively complex. There are so many gotchas.
> I propose the following:
> # Extract out the Transformer application logic from EntityProcessor and add it to DocBuilder. Then EntityProcessor do not need to call applyTransformer or know about rowIterator and getFromRowCache() methods.
> # Change the meaning of EntityProcessor#destroy to be called on end of parent's row -- Right now init is called once per parent row but destroy actually means the end of import. In fact, there is no correct way for an entity processor to do clean up right now. Most do clean up when returning null (end of data) but with the introduction of $skipDoc, a transformer can return $skipDoc and the entity processor will never get a chance to clean up for the current init.
> # EntityProcessor will use the EventListener API to listen for import end. This should be used by EntityProcessor to do a final cleanup.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1120) Simplify EntityProcessor API

Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shalin Shekhar Mangar updated SOLR-1120:
----------------------------------------

    Attachment: SOLR-1120.patch

Thanks Noble.

This patch has the following changes on top of Noble's patch:
# Added EntityProcessor#close method which will be called at the end of import. This is simpler than using EventListeners.
# Fixed a bug with calling destroy in the wrong place in DataConfig

All tests pass. I'll commit this shortly.

> Simplify EntityProcessor API
> ----------------------------
>
>                 Key: SOLR-1120
>                 URL: https://issues.apache.org/jira/browse/SOLR-1120
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.3
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch
>
>
> Writing an EntityProcessor is deceptively complex. There are so many gotchas.
> I propose the following:
> # Extract out the Transformer application logic from EntityProcessor and add it to DocBuilder. Then EntityProcessor do not need to call applyTransformer or know about rowIterator and getFromRowCache() methods.
> # Change the meaning of EntityProcessor#destroy to be called on end of parent's row -- Right now init is called once per parent row but destroy actually means the end of import. In fact, there is no correct way for an entity processor to do clean up right now. Most do clean up when returning null (end of data) but with the introduction of $skipDoc, a transformer can return $skipDoc and the entity processor will never get a chance to clean up for the current init.
> # EntityProcessor will use the EventListener API to listen for import end. This should be used by EntityProcessor to do a final cleanup.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (SOLR-1120) Simplify EntityProcessor API

Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shalin Shekhar Mangar reopened SOLR-1120:
-----------------------------------------


The initial commit for this issue broke the Debug functionality.

Refer to http://www.lucidimagination.com/search/document/42c345a606820f9/npe_in_dataimport_debuglogger_peekstack_dih_development_console

> Simplify EntityProcessor API
> ----------------------------
>
>                 Key: SOLR-1120
>                 URL: https://issues.apache.org/jira/browse/SOLR-1120
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.3
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch
>
>
> Writing an EntityProcessor is deceptively complex. There are so many gotchas.
> I propose the following:
> # Extract out the Transformer application logic from EntityProcessor and add it to DocBuilder. Then EntityProcessor do not need to call applyTransformer or know about rowIterator and getFromRowCache() methods.
> # Change the meaning of EntityProcessor#destroy to be called on end of parent's row -- Right now init is called once per parent row but destroy actually means the end of import. In fact, there is no correct way for an entity processor to do clean up right now. Most do clean up when returning null (end of data) but with the introduction of $skipDoc, a transformer can return $skipDoc and the entity processor will never get a chance to clean up for the current init.
> # EntityProcessor will use the EventListener API to listen for import end. This should be used by EntityProcessor to do a final cleanup.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1120) Simplify EntityProcessor API

Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703216#action_12703216 ] 

Shalin Shekhar Mangar commented on SOLR-1120:
---------------------------------------------

Committed revision 769058.

Thanks Noble!

> Simplify EntityProcessor API
> ----------------------------
>
>                 Key: SOLR-1120
>                 URL: https://issues.apache.org/jira/browse/SOLR-1120
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.3
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch
>
>
> Writing an EntityProcessor is deceptively complex. There are so many gotchas.
> I propose the following:
> # Extract out the Transformer application logic from EntityProcessor and add it to DocBuilder. Then EntityProcessor do not need to call applyTransformer or know about rowIterator and getFromRowCache() methods.
> # Change the meaning of EntityProcessor#destroy to be called on end of parent's row -- Right now init is called once per parent row but destroy actually means the end of import. In fact, there is no correct way for an entity processor to do clean up right now. Most do clean up when returning null (end of data) but with the introduction of $skipDoc, a transformer can return $skipDoc and the entity processor will never get a chance to clean up for the current init.
> # EntityProcessor will use the EventListener API to listen for import end. This should be used by EntityProcessor to do a final cleanup.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1120) Simplify EntityProcessor API

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700087#action_12700087 ] 

Noble Paul commented on SOLR-1120:
----------------------------------

bq.Is there any ligitimate place where one would want to disallow replaceTokens?

yes . the XPathEntityProcessor uses it directly just to know what are the variables in the url so that it can read them and store . probably we can  add amethod getEntityAttributeResolved() to get the resolved value

> Simplify EntityProcessor API
> ----------------------------
>
>                 Key: SOLR-1120
>                 URL: https://issues.apache.org/jira/browse/SOLR-1120
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.3
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>
> Writing an EntityProcessor is deceptively complex. There are so many gotchas.
> I propose the following:
> # Extract out the Transformer application logic from EntityProcessor and add it to DocBuilder. Then EntityProcessor do not need to call applyTransformer or know about rowIterator and getFromRowCache() methods.
> # Change the meaning of EntityProcessor#destroy to be called on end of parent's row -- Right now init is called once per parent row but destroy actually means the end of import. In fact, there is no correct way for an entity processor to do clean up right now. Most do clean up when returning null (end of data) but with the introduction of $skipDoc, a transformer can return $skipDoc and the entity processor will never get a chance to clean up for the current init.
> # EntityProcessor will use the EventListener API to listen for import end. This should be used by EntityProcessor to do a final cleanup.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1120) Simplify EntityProcessor API

Posted by "Noble Paul (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Noble Paul updated SOLR-1120:
-----------------------------

    Attachment: SOLR-1120.patch

a new method added to EntityProcessor (postTransform) . this can be used by the EntityProcessor implementations to get a callback after the transformations are done

> Simplify EntityProcessor API
> ----------------------------
>
>                 Key: SOLR-1120
>                 URL: https://issues.apache.org/jira/browse/SOLR-1120
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.3
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch
>
>
> Writing an EntityProcessor is deceptively complex. There are so many gotchas.
> I propose the following:
> # Extract out the Transformer application logic from EntityProcessor and add it to DocBuilder. Then EntityProcessor do not need to call applyTransformer or know about rowIterator and getFromRowCache() methods.
> # Change the meaning of EntityProcessor#destroy to be called on end of parent's row -- Right now init is called once per parent row but destroy actually means the end of import. In fact, there is no correct way for an entity processor to do clean up right now. Most do clean up when returning null (end of data) but with the introduction of $skipDoc, a transformer can return $skipDoc and the entity processor will never get a chance to clean up for the current init.
> # EntityProcessor will use the EventListener API to listen for import end. This should be used by EntityProcessor to do a final cleanup.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-1120) Simplify EntityProcessor API

Posted by "Shalin Shekhar Mangar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shalin Shekhar Mangar updated SOLR-1120:
----------------------------------------

    Attachment: SOLR-1120.patch

Same as Noble's patch but DocWrapper extends SolrInputDocument now.

I'll commit this shortly.

> Simplify EntityProcessor API
> ----------------------------
>
>                 Key: SOLR-1120
>                 URL: https://issues.apache.org/jira/browse/SOLR-1120
>             Project: Solr
>          Issue Type: Improvement
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.3
>            Reporter: Shalin Shekhar Mangar
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch, SOLR-1120.patch
>
>
> Writing an EntityProcessor is deceptively complex. There are so many gotchas.
> I propose the following:
> # Extract out the Transformer application logic from EntityProcessor and add it to DocBuilder. Then EntityProcessor do not need to call applyTransformer or know about rowIterator and getFromRowCache() methods.
> # Change the meaning of EntityProcessor#destroy to be called on end of parent's row -- Right now init is called once per parent row but destroy actually means the end of import. In fact, there is no correct way for an entity processor to do clean up right now. Most do clean up when returning null (end of data) but with the introduction of $skipDoc, a transformer can return $skipDoc and the entity processor will never get a chance to clean up for the current init.
> # EntityProcessor will use the EventListener API to listen for import end. This should be used by EntityProcessor to do a final cleanup.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.