You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Rohini Palaniswamy (JIRA)" <ji...@apache.org> on 2017/09/19 22:10:00 UTC

[jira] [Commented] (PIG-4120) Broadcast the index file in case of POMergeCoGroup and POMergeJoin

    [ https://issues.apache.org/jira/browse/PIG-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16172428#comment-16172428 ] 

Rohini Palaniswamy commented on PIG-4120:
-----------------------------------------

Comments:
   - This needs to be implemented for POMergeCoGroup as well.


POMergeJoin.java:
  1) Remove protected transient TupleFactory mTupleFactory. mTupleFactory already comes from PhysicalOperator
  2) copy method
     - LRs = copy.LRs; -> this.LRs = copy.LRs;
    - It should copy every non-transient field. Currently it is missing signature, rightInputFileName, etc

POMergeJoinTez.java:
  1) List<LogicalInput> logicalInputs is not used anywhere. Can be removed
  2) List<KeyValueReader> keyValueReaders - You only need one KeyValueReader. 
  3) Get rid of the getName() function and refer to super.name(). Currently it is missing the case of sparse join.
{code}
            @Override
	    public String name() {
	        return super.name().replace("MergeJoin", "MergeJoinTez") + "\t<-\t " + this.inputKey;
	    }
{code}
4) Cast to UnorderedKVReader is not required.
5) Tuple copy code can be shorter
{code}
while (reader.next()) {
   Tuple origTuple =(Tuple) reader.getCurrentValue();
   Tuple copy = mTupleFactory.newTuple(origTuple.getAll()); 
   index.add(copy);
}
{code}
6) Creating another copy of index is unnecessary
{code}
 LinkedList<Tuple> indexList = new LinkedList<Tuple>(index);
{code}

TezCompiler.java:
1) We keep else start in the same line as if block end.
2) joinOp.setupRightPipeline(rightPipelinePlan); and joinOp.setSignature(rightLoader.getSignature()); not required if copy() is fixed.

DefaultIndexableLoader.java:
1) Can you rename setIndex() to loadIndex() 

> Broadcast the index file in case of POMergeCoGroup and POMergeJoin
> ------------------------------------------------------------------
>
>                 Key: PIG-4120
>                 URL: https://issues.apache.org/jira/browse/PIG-4120
>             Project: Pig
>          Issue Type: Sub-task
>          Components: tez
>            Reporter: Rohini Palaniswamy
>            Assignee: Satish Subhashrao Saley
>             Fix For: 0.18.0
>
>         Attachments: PIG-4120-1.patch
>
>
> Currently merge join and merge cogroup use two DAGs - the first DAG creates the index file in hdfs and second DAG does the merge join.  Similar to replicate join, we can broadcast the index file and cache it and use it in merge join and merge cogroup. This will give better performance and also eliminate need for the second DAG.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)