You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Mikhail Khludnev (JIRA)" <ji...@apache.org> on 2013/07/10 23:39:48 UTC

[jira] [Comment Edited] (SOLR-4799) SQLEntityProcessor for zipper join

    [ https://issues.apache.org/jira/browse/SOLR-4799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13705099#comment-13705099 ] 

Mikhail Khludnev edited comment on SOLR-4799 at 7/10/13 9:38 PM:
-----------------------------------------------------------------

Attaching the first drop. 

I don't say I share your idea [~jdyer] about adding zipper ability across all processor, anyway let's check how it would be.

Implementation itself is not a big deal 'cause it's based on guava, it's enabled by {{join='zipper'}} . Note: it doesn't support case of {{People \*-> Country}}, but only classic {{People -\*> Sports}}. though oneliner  covers that. 

I extracted DIHSupport constructor, which parses attrs into Relation class. I introduced Zipper as EP internal strategy like DIHCacheSupport. It seems all these stuff should be extracted as few proper strategies at future.

derby test covers only sports, not countries. They can be also covered, but not both. Joining both sides by zipper will make test super puzzling. So, it needs to be addressed later. 

The most thing which I worry about is the test data. From what I see, we have only vanilla data: for every people we have few or single sports. Zipper caveats are orphaned sports and sportless peoples. if there is a bug in zipper it can mess following entities. btw, giving my experience obtained in DIH vs Threads battle, I can say it menaces to caching implementations also. Ideally, I'd like to pause this one, improve derby test for orphaned children and childless parents and continue with zipper afterwards. 

Please let me know what you think!  
                
      was (Author: mkhludnev):
    Attaching the first drop. 

I don't say I share your idea [~jdyer] about adding zipper ability across all processor, anyway let's check how it would be.

Implementation itself is not a big deal 'cause it's based on guava, it's enabled by join='zipper' . Note: it doesn't support case of People *-> Country, but only classic People -*> Sports. though oneliner  covers that. 

I extracted DIHSupport constructor, which parses attrs into Relation class. I introduced Zipper as EP internal strategy like DIHCacheSupport. It seems all these stuff should be extracted as few proper strategies at future.

derby test covers only sports, not countries. They can be also covered, but not both. Joining both sides by zipper will make test super puzzling. So, it needs to be addressed later. 

The most thing which I worry about is the test data. From what I see, we have only vanilla data: for every people we have few or single sports. Zipper caveats are orphaned sports and sportless peoples. if there is a bug in zipper it can mess following entities. btw, giving my experience obtained in DIH vs Threads battle, I can say it menaces to caching implementations also. Ideally, I'd like to pause this one, improve derby test for orphaned children and childless parents and continue with zipper afterwards. 

Please let me know what you think!  
                  
> SQLEntityProcessor for zipper join
> ----------------------------------
>
>                 Key: SOLR-4799
>                 URL: https://issues.apache.org/jira/browse/SOLR-4799
>             Project: Solr
>          Issue Type: New Feature
>          Components: contrib - DataImportHandler
>            Reporter: Mikhail Khludnev
>            Priority: Minor
>              Labels: dih
>         Attachments: SOLR-4085.patch
>
>
> DIH is mostly considered as a playground tool, and real usages end up with SolrJ. I want to contribute few improvements target DIH performance.
> This one provides performant approach for joining SQL Entities with miserable memory at contrast to http://wiki.apache.org/solr/DataImportHandler#CachedSqlEntityProcessor  
> The idea is:
> * parent table is explicitly ordered by it’s PK in SQL
> * children table is explicitly ordered by parent_id FK in SQL
> * children entity processor joins ordered resultsets by ‘zipper’ algorithm.
> Do you think it’s worth to contribute it into DIH?
> cc: [~goksron] [~jdyer]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org