You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "liyunzhang_intel (JIRA)" <ji...@apache.org> on 2016/04/27 09:56:12 UTC

[jira] [Commented] (PIG-4810) Implement Merge join for spark engine

    [ https://issues.apache.org/jira/browse/PIG-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15259738#comment-15259738 ] 

liyunzhang_intel commented on PIG-4810:
---------------------------------------

[~kexianda]:  some comments:  
1. add joinOp.setIndexFile(strFile.getFileName()) in spark like what did in mr, later it will upload this index file to distributed cache(org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.JoinDistributedCacheVisitor#visitMergeJoin) so that nodes in the distributed cluster can access to the index file more efficiently. I think we can make more copys for index file by FileSytem.setReplication(indexFile, 10) later to make other nodes to access the file more efficiently.
2.  For TestMerge#testMergeJoinWithReplicatedJoin, it need not add order by before regular join(it does not require data  sorted before in regular join)
{code}
 if(! Util.isSparkExecType(cluster.getExecType())) {
                pigServer.registerQuery("D = join A by f1, B by f1 using 'replicated';");
            } else {
                // currently, the implementation of FRJoin can't guarantee the order in spark mode
                // the input for MergeJoin should be in asc order.
                pigServer.registerQuery("D0 = join A by f1, B by f1 using 'replicated';");
                pigServer.registerQuery("D = ORDER D0 BY A::f1 ASC;");
            }

{code}
can be
{code}
   pigServer.registerQuery("D = join A by f1, B by f1 using 'replicated';");
{code}
3. code format like indent 

> Implement Merge join for spark engine
> -------------------------------------
>
>                 Key: PIG-4810
>                 URL: https://issues.apache.org/jira/browse/PIG-4810
>             Project: Pig
>          Issue Type: Sub-task
>          Components: spark
>            Reporter: liyunzhang_intel
>            Assignee: Xianda Ke
>             Fix For: spark-branch
>
>         Attachments: PIG-4810-2.patch, PIG-4810-3.patch, PIG-4810-4.patch, PIG-4810-5.patch, PIG-4810.patch
>
>
> In current code base(a9151ac), we use regular join to implement merge join in spark mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)