You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "liyunzhang_intel (JIRA)" <ji...@apache.org> on 2016/04/27 09:56:12 UTC
[jira] [Commented] (PIG-4810) Implement Merge join for spark engine
[ https://issues.apache.org/jira/browse/PIG-4810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15259738#comment-15259738 ]
liyunzhang_intel commented on PIG-4810:
---------------------------------------
[~kexianda]: some comments:
1. add joinOp.setIndexFile(strFile.getFileName()) in spark like what did in mr, later it will upload this index file to distributed cache(org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.JoinDistributedCacheVisitor#visitMergeJoin) so that nodes in the distributed cluster can access to the index file more efficiently. I think we can make more copys for index file by FileSytem.setReplication(indexFile, 10) later to make other nodes to access the file more efficiently.
2. For TestMerge#testMergeJoinWithReplicatedJoin, it need not add order by before regular join(it does not require data sorted before in regular join)
{code}
if(! Util.isSparkExecType(cluster.getExecType())) {
pigServer.registerQuery("D = join A by f1, B by f1 using 'replicated';");
} else {
// currently, the implementation of FRJoin can't guarantee the order in spark mode
// the input for MergeJoin should be in asc order.
pigServer.registerQuery("D0 = join A by f1, B by f1 using 'replicated';");
pigServer.registerQuery("D = ORDER D0 BY A::f1 ASC;");
}
{code}
can be
{code}
pigServer.registerQuery("D = join A by f1, B by f1 using 'replicated';");
{code}
3. code format like indent
> Implement Merge join for spark engine
> -------------------------------------
>
> Key: PIG-4810
> URL: https://issues.apache.org/jira/browse/PIG-4810
> Project: Pig
> Issue Type: Sub-task
> Components: spark
> Reporter: liyunzhang_intel
> Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4810-2.patch, PIG-4810-3.patch, PIG-4810-4.patch, PIG-4810-5.patch, PIG-4810.patch
>
>
> In current code base(a9151ac), we use regular join to implement merge join in spark mode.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)