You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by kelly zhang <li...@intel.com> on 2016/07/07 07:31:22 UTC

Re: Review Request 45667: Support Pig On Spark


> On May 22, 2016, 9:57 p.m., Rohini Palaniswamy wrote:
> > test/org/apache/pig/test/TestCombiner.java, lines 189-191
> > <https://reviews.apache.org/r/45667/diff/1/?file=1323873#file1323873line189>
> >
> >     Can be removed with the change to checkCombinerUsed

can not change checkCombinerUsed(pigServer, "c", true) to assertTrue(baos.toString().matches("(?si).*combine plan.*")). If we use original code, this test will fail in spark mode because there is combine plan in spark mode. In spark mode, there is no combine plan but POReduceBySpark implement the combine function.


> On May 22, 2016, 9:57 p.m., Rohini Palaniswamy wrote:
> > test/org/apache/pig/test/TestCombiner.java, lines 239-241
> > <https://reviews.apache.org/r/45667/diff/1/?file=1323873#file1323873line239>
> >
> >     Can be removed with the change to checkCombinerUsed

can not change checkCombinerUsed(pigServer, "c", true) to assertTrue(baos.toString().matches("(?si).*combine plan.*")). If we use original code, this test will fail in spark mode because there is combine plan in spark mode. In spark mode, there is no combine plan but POReduceBySpark implement the combine function.


> On May 22, 2016, 9:57 p.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/hbase/HBaseStorage.java, line 313
> > <https://reviews.apache.org/r/45667/diff/1/?file=1323847#file1323847line313>
> >
> >     Just change the code to use UDFContext.getUDFContext().getJobConf() which should not be null instead of getClientSystemProps(). Not sure why it is using getClientSystemProps() in the first place.

Here if we change to UDFContext.getUDFContext().getJobConf(), problem still exists.


The reason why verify  UDFContext.getUDFContext().getJobConf() or not is because spark executor first initializes all the object then UDFContext.deserialize is called, HBaseStorage constructor is called before UDFContext.deserialized(), so here we need to verify  UDFContext.getUDFContext().getJobConf() is null or not otherwise NPE will be thrown out here.


> On May 22, 2016, 9:57 p.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/util/AccumulatorOptimizerUtil.java, line 294
> > <https://reviews.apache.org/r/45667/diff/1/?file=1323844#file1323844line294>
> >
> >     The utility class is for common code between mapreduce, tez and spark. Please move all Spark specific code to the spark AccumulatorOptimizer

OK, i will move AccumulatorOptimizerUtil#addAccumulatorSpark to spark AccumulatorOptimizer but need make AccumulatorOptimizerUtil#check public because this will be used in spark AccumulatorOptimizer.


> On May 22, 2016, 9:57 p.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/impl/PigContext.java, line 924
> > <https://reviews.apache.org/r/45667/diff/1/?file=1323849#file1323849line924>
> >
> >     This can be reverted. PigContext need not be serialized to the backend. See PIG-4866

PIG-4866 is not serialize pigcontext in configuration while here we override PigContext#writeObject and PigContext#readObject to only serialize and deserialize 1 attribute(packageImportList)  in spark mode.


> On May 22, 2016, 9:57 p.m., Rohini Palaniswamy wrote:
> > test/org/apache/pig/test/TestPigRunner.java, lines 333-334
> > <https://reviews.apache.org/r/45667/diff/1/?file=1323890#file1323890line333>
> >
> >     1 job for 1 POStore can impact performance. It is not exactly multi-query. Is there a jira to optimize this better and use 1 job?

Yes, 1 job for 1 POStore can impact performance. PIG-4863 will implement it.  I have added a TODO in the comment of code:
  //TODO: 1 job for 1 POStore can impact performance. PIG-4863 will fix this problem
 assertTrue(stats.getJobGraph().size() == 2);


> On May 22, 2016, 9:57 p.m., Rohini Palaniswamy wrote:
> > test/org/apache/pig/test/TestStoreBase.java, line 155
> > <https://reviews.apache.org/r/45667/diff/1/?file=1323899#file1323899line155>
> >
> >     Second store should not succeed.

Yes, the second store should not succeed because of the first one fails.

I will change the comment from
 A = load xx;
             store A into '1.out' using DummyStore('true','1');   -- first job should fail
             store A into '2.out' using DummyStore('false','1');  -- second job should success
             After multiQuery optimization the spark plan will be:
           Split - scope-14
            |   |
            |   a: Store(hdfs://1.out:myudfs.DummyStore('true','1')) - scope-4
            |   |
            |   a: Store(hdfs://2.out:myudfs.DummyStore('false','1')) - scope-7
            |
            |---a: Load(hdfs://zly2.sh.intel.com:8020/user/root/multiStore.txt:org.apache.pig.builtin.PigStorage) - scope-0------
            In current code base, once the first job fails, the second job will not be executed.
            the FILE_SETUPJOB_CALLED of second job will not exist.
            I explain more detail in PIG-4243

to
          
A = load xx;
             store A into '1.out' using DummyStore('true','1');   -- first store fails
             store A into '2.out' using DummyStore('false','1');  -- second store will not be executed after first store fails
             After multiQuery optimization the spark plan will be:
           Split - scope-14
            |   |
            |   a: Store(hdfs://1.out:myudfs.DummyStore('true','1')) - scope-4
            |   |
            |   a: Store(hdfs://2.out:myudfs.DummyStore('false','1')) - scope-7
            |
            |---a: Load(hdfs://zly2.sh.intel.com:8020/user/root/multiStore.txt:org.apache.pig.builtin.PigStorage) - scope-0------
            In current code base, once the first job fails, the second job will not be executed.
            the FILE_SETUPJOB_CALLED of second job will not exist.
            I explain more detail in PIG-4243


> On May 22, 2016, 9:57 p.m., Rohini Palaniswamy wrote:
> > test/org/apache/pig/test/TestStoreBase.java, line 165
> > <https://reviews.apache.org/r/45667/diff/1/?file=1323899#file1323899line165>
> >
> >     Setup job file of second one should also be there. Same with setup task as well. They are supposed to happen before putNext of both the stores is called.

In mr mode, one MapReduceOper is related with one job. If the first store in the job fails, second store will not be executed. But 
FILE_SETUPJOB_CALLED, FILE_SETUPJOB_CALLED of first and second job are both exists because setupTask() and setupJob() is  happened before putNext() of first store.
 In curent implementation of spark mode, this SparkOperator related with two stores. One store is related to one job so there are two jobs. Once one job fails the other jobs  will not be executed.
 the FILE_SETUPJOB_CALLED, FILE_SETUPTASK_CALLED of second job will not exist because setupJob and setupTask of second job is after putNext of first store.


> On May 22, 2016, 9:57 p.m., Rohini Palaniswamy wrote:
> > test/org/apache/pig/test/TestMapSideCogroup.java, line 340
> > <https://reviews.apache.org/r/45667/diff/1/?file=1323883#file1323883line340>
> >
> >     Why is output ordering different with merge join? The assumption with merge join is that files are sorted and join is done based on that. So ordering should be same.

The script of this unit test is like following:

REGISTER myudfs.jar;
A = load './multiSplits1.txt,./multiSplits3.txt' using myudfs.DummyCollectableLoader() as (c1:chararray,c2:int);
B = load './multiSplits2.txt' using myudfs.DummyIndexableLoader() as (c1:chararray,c2:int);
C = cogroup A by c1, B by c1 using 'merge';
store C into './multiSplits.out';

 cat bin/multiSplits1.txt ?
1     1
1     2
1     3
2     1
2     2
2     3
3     1
3     2
3     3
cat bin/multiSplits3.txt ?
4     1
4     2
4     3
5     1
5     2
5     3
6     1
6     2
6     3
7     1
7     2
7     3
8     1
8     2
8     3
9     1
9     2
9     3
 cat bin/multiSplits2.txt 
3	1
3	2
3	3
4	1
4	2
4	3
5	1
5	2
5	3

It is interesting that there are different behavior in "load './multiSplits1.txt,./multiSplits3.txt'" in spark and mr mode.
In mr, it load multiple files according to the size of splits(bigger split go first). it first load multiSplits3.txt then multiSplits1.txt.  So the result of the script in mr will be:


4,{(4,1),(4,2),(4,3)},{(4,1),(4,2),(4,3)}
5,{(5,2),(5,1),(5,3)},{(5,1),(5,2),(5,3)}
6,{(6,1),(6,2),(6,3)},{}
7,{(7,1),(7,2),(7,3)},{}
8,{(8,1),(8,2),(8,3)},{}
9,{(9,1),(9,2),(9,3)},{}
1,{(1,1),(1,2),(1,3)},{}
2,{(2,1),(2,2),(2,3)},{}
3,{(3,3),(3,2),(3,1)},{(3,1),(3,2),(3,3)}

In spark, it loads multiple files in sequence. So the result of the script in spark will be:

1,{(1,1),(1,2),(1,3)},{}
2,{(2,1),(2,2),(2,3)},{}
3,{(3,3),(3,2),(3,1)},{(3,1),(3,2),(3,3)}
4,{(4,1),(4,2),(4,3)},{(4,1),(4,2),(4,3)}
5,{(5,2),(5,1),(5,3)},{(5,1),(5,2),(5,3)}
6,{(6,1),(6,2),(6,3)},{}
7,{(7,1),(7,2),(7,3)},{}
8,{(8,1),(8,2),(8,3)},{}
9,{(9,1),(9,2),(9,3)},{}

The reason why there are difference when load multiple splits in mr and spark mode is because org.apache.hadoop.mapreduce.JobSubmitter#writeNewSplits  sort the splits by the size(bigger split first) in mr mode.


> On May 22, 2016, 9:57 p.m., Rohini Palaniswamy wrote:
> > build.xml, line 366
> > <https://reviews.apache.org/r/45667/diff/1/?file=1323774#file1323774line366>
> >
> >     Can you confirm that with all the spark changes running with hadoop 1.x (ant test -Dhadoopversion=20) is good?

Pig will drop  hadoop 1 support since 0.17(PIG-4923), so should we support hadoop 1.x in pig on spark?


> On May 22, 2016, 9:57 p.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/util/SecondaryKeyOptimizerUtil.java, line 59
> > <https://reviews.apache.org/r/45667/diff/1/?file=1323846#file1323846line59>
> >
> >     It is not a good idea to use a static variable to keep state in this static class. Multi-threaded use of Pig will have issues if different modes are used.
> >     
> >     This utility class is for code common to mapreduce, tez and spark. Please move the spark specific code to spark's SecondaryKeyOptimizer. This is also necessary if we ever complete the mavenization and move tez and spark specific code to separate modules.

OK, i will move specific code to spark's SecondaryKeyOptimizer, but i need make following function static to be reuse in spark's SecondaryKeyOptimizer:
org.apache.pig.backend.hadoop.executionengine.util.SecondaryKeyOptimizerUtil#getSortKeyInfo
org.apache.pig.backend.hadoop.executionengine.util.SecondaryKeyOptimizerUtil#setSecondaryPlan
org.apache.pig.backend.hadoop.executionengine.util.SecondaryKeyOptimizerUtil#collectColumnChain


> On May 22, 2016, 9:57 p.m., Rohini Palaniswamy wrote:
> > test/org/apache/pig/test/TestEvalPipelineLocal.java, lines 1135-1136
> > <https://reviews.apache.org/r/45667/diff/1/?file=1323877#file1323877line1135>
> >
> >     Code should be taking care of this.

Have moved the code of reset UDFContext.getUDFContext().addJonConf(null) to SparkLauncher.java


> On May 22, 2016, 9:57 p.m., Rohini Palaniswamy wrote:
> > test/org/apache/pig/test/TestMultiQuery.java, line 116
> > <https://reviews.apache.org/r/45667/diff/1/?file=1323886#file1323886line116>
> >
> >     Why are we using checkQueryOutputsAfterSortRecursive in many places when checkQueryOutputsAfterSort would do? It will unnecessarily increase the test execution time. Can they all be changed? Or am I missing something and checkQueryOutputsAfterSort cannot be used for some reason?
> 
> kelly zhang wrote:
>     The difference between Util.checkQueryOutputsAfterSortRecursive and Util.checkQueryOutputsAfterSort: we can send schema:LogicalSchema to Util.checkQueryOutputsAfterSortRecursive, so function will help change expectedResArray:String[] to expectedRes:ArrayList<Tuple> with proper schema(int,string or other type) 
>     
>                static public void checkQueryOutputsAfterSortRecursive(Iterator<Tuple> actualResultsIt,
>                 String[] expectedResArray, LogicalSchema schema) throws IOException {
>                 ...
>                 }

The output of group,join, distinct has difference between mr and spark mode, the output is in order in mr because sort is done in the shuffle process while there is no such operation in spark. So add following 
functions in org.apache.pig.test.Util to only sort the output when in mr mode

public static void checkQueryOutputs(Iterator<Tuple> actualResults, List<Tuple> expectedResults, boolean checkAfterSort) 

public static void checkQueryOutputs(Iterator<Tuple> actualResults, String[] expectedResults,LogicalSchema logicalSchema, boolean checkAfterSort) throws IOException


> On May 22, 2016, 9:57 p.m., Rohini Palaniswamy wrote:
> > test/org/apache/pig/test/TestPigRunner.java, line 1155
> > <https://reviews.apache.org/r/45667/diff/1/?file=1323890#file1323890line1155>
> >
> >     Is there a jira to fix the number of records?

In spark mode, if pig.disable.counter = true, the number of records of the input are not calculated.  I have add some comments to explain this.


> On May 22, 2016, 9:57 p.m., Rohini Palaniswamy wrote:
> > test/org/apache/pig/test/TestSecondarySort.java, line 520
> > <https://reviews.apache.org/r/45667/diff/1/?file=1323897#file1323897line520>
> >
> >     Use Assume.assumeTrue

TestSecondarySort#testCustomPartitionerWithSort pass in spark mode now, need not skip this test


- kelly


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/45667/#review134255
-----------------------------------------------------------


On June 14, 2016, 4:39 a.m., Pallavi Rao wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/45667/
> -----------------------------------------------------------
> 
> (Updated June 14, 2016, 4:39 a.m.)
> 
> 
> Review request for pig, Daniel Dai and Rohini Palaniswamy.
> 
> 
> Bugs: PIG-4059 and PIG-4854
>     https://issues.apache.org/jira/browse/PIG-4059
>     https://issues.apache.org/jira/browse/PIG-4854
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> The patch contains all the work done in the spark branch, so far.
> 
> 
> Diffs
> -----
> 
>   bin/pig 81f1426 
>   build.xml 99ba1f4 
>   ivy.xml dd9878e 
>   ivy/libraries.properties 3a819a5 
>   shims/test/hadoop20/org/apache/pig/test/SparkMiniCluster.java PRE-CREATION 
>   shims/test/hadoop23/org/apache/pig/test/SparkMiniCluster.java PRE-CREATION 
>   shims/test/hadoop23/org/apache/pig/test/TezMiniCluster.java 792a1bd 
>   shims/test/hadoop23/org/apache/pig/test/YarnMiniCluster.java PRE-CREATION 
>   src/META-INF/services/org.apache.pig.ExecType 5c034c8 
>   src/docs/src/documentation/content/xdocs/start.xml 36f9952 
>   src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/PhysicalOperator.java 1ff1abd 
>   src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java ecf780c 
>   src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhysicalPlan.java 2376d03 
>   src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POCollectedGroup.java bcbfe2b 
>   src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POFRJoin.java d80951a 
>   src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POForEach.java 21b75f1 
>   src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POGlobalRearrange.java 52cfb73 
>   src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POMergeJoin.java 13f70c0 
>   src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POSort.java c3a82c3 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/JobGraphBuilder.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/JobMetricsListener.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/KryoSerializer.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/MapReducePartitionerWrapper.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/SparkExecType.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/SparkExecutionEngine.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/SparkLauncher.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/SparkLocalExecType.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/SparkUtil.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/UDFJarsFinder.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/converter/CollectedGroupConverter.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/converter/CounterConverter.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/converter/DistinctConverter.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/converter/FRJoinConverter.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/converter/FilterConverter.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/converter/ForEachConverter.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/converter/GlobalRearrangeConverter.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/converter/IndexedKey.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/converter/IteratorTransform.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/converter/LimitConverter.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/converter/LoadConverter.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/converter/LocalRearrangeConverter.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/converter/MergeCogroupConverter.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/converter/MergeJoinConverter.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/converter/OutputConsumerIterator.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/converter/PackageConverter.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/converter/PigSecondaryKeyComparatorSpark.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/converter/RDDConverter.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/converter/RankConverter.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/converter/ReduceByConverter.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/converter/SkewedJoinConverter.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/converter/SortConverter.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/converter/SplitConverter.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/converter/StoreConverter.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/converter/StreamConverter.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/converter/UnionConverter.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/operator/NativeSparkOperator.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/operator/POGlobalRearrangeSpark.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/operator/POReduceBySpark.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/optimizer/AccumulatorOptimizer.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/optimizer/CombinerOptimizer.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/optimizer/MultiQueryOptimizerSpark.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/optimizer/NoopFilterRemover.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/optimizer/ParallelismSetter.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/optimizer/SecondaryKeyOptimizerSpark.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/plan/DotSparkPrinter.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/plan/SparkCompiler.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/plan/SparkCompilerException.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/plan/SparkOpPlanVisitor.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/plan/SparkOperPlan.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/plan/SparkOperator.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/plan/SparkPOPackageAnnotator.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/plan/SparkPrinter.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/running/PigInputFormatSpark.java PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/util/AccumulatorOptimizerUtil.java c4b44ad 
>   src/org/apache/pig/backend/hadoop/executionengine/util/CombinerOptimizerUtil.java 889c01b 
>   src/org/apache/pig/backend/hadoop/executionengine/util/SecondaryKeyOptimizerUtil.java 0b59c9c 
>   src/org/apache/pig/backend/hadoop/hbase/HBaseStorage.java e0581d9 
>   src/org/apache/pig/data/SelfSpillBag.java d17f0a8 
>   src/org/apache/pig/impl/PigContext.java d43949f 
>   src/org/apache/pig/impl/plan/OperatorPlan.java 8b2e2e7 
>   src/org/apache/pig/tools/pigstats/PigStatsUtil.java 542cc2e 
>   src/org/apache/pig/tools/pigstats/spark/SparkCounter.java PRE-CREATION 
>   src/org/apache/pig/tools/pigstats/spark/SparkCounterGroup.java PRE-CREATION 
>   src/org/apache/pig/tools/pigstats/spark/SparkCounters.java PRE-CREATION 
>   src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java PRE-CREATION 
>   src/org/apache/pig/tools/pigstats/spark/SparkPigStats.java PRE-CREATION 
>   src/org/apache/pig/tools/pigstats/spark/SparkPigStatusReporter.java PRE-CREATION 
>   src/org/apache/pig/tools/pigstats/spark/SparkScriptState.java PRE-CREATION 
>   src/org/apache/pig/tools/pigstats/spark/SparkStatsUtil.java PRE-CREATION 
>   test/e2e/pig/build.xml f7c38ba 
>   test/e2e/pig/conf/spark.conf PRE-CREATION 
>   test/e2e/pig/drivers/TestDriverPig.pm bf9c302 
>   test/e2e/pig/tests/streaming.conf 18f2fb2 
>   test/excluded-tests-spark PRE-CREATION 
>   test/org/apache/pig/newplan/logical/relational/TestLocationInPhysicalPlan.java 94b34b3 
>   test/org/apache/pig/spark/TestIndexedKey.java PRE-CREATION 
>   test/org/apache/pig/spark/TestSecondarySortSpark.java PRE-CREATION 
>   test/org/apache/pig/test/MiniGenericCluster.java 9347269 
>   test/org/apache/pig/test/TestAssert.java 6d4b5c6 
>   test/org/apache/pig/test/TestBuiltin.java fbc3f1e 
>   test/org/apache/pig/test/TestCase.java c9bb2fa 
>   test/org/apache/pig/test/TestCollectedGroup.java a958d33 
>   test/org/apache/pig/test/TestCombiner.java df44293 
>   test/org/apache/pig/test/TestCubeOperator.java de96e6c 
>   test/org/apache/pig/test/TestEvalPipeline.java 9efde13 
>   test/org/apache/pig/test/TestEvalPipeline2.java c8f51d7 
>   test/org/apache/pig/test/TestEvalPipelineLocal.java c12d595 
>   test/org/apache/pig/test/TestFinish.java f18c103 
>   test/org/apache/pig/test/TestForEachNestedPlanLocal.java b0aa3a8 
>   test/org/apache/pig/test/TestGrunt.java 9eaf298 
>   test/org/apache/pig/test/TestHBaseStorage.java 8d2ad85 
>   test/org/apache/pig/test/TestLimitVariable.java 53b9dae 
>   test/org/apache/pig/test/TestMapSideCogroup.java 2c78b4a 
>   test/org/apache/pig/test/TestMergeJoin.java f1a9608 
>   test/org/apache/pig/test/TestMergeJoinOuter.java 81aee55 
>   test/org/apache/pig/test/TestMultiQuery.java c32eab7 
>   test/org/apache/pig/test/TestMultiQueryLocal.java b9ac035 
>   test/org/apache/pig/test/TestNativeMapReduce.java c4f6573 
>   test/org/apache/pig/test/TestNullConstant.java 3ea4509 
>   test/org/apache/pig/test/TestPigRunner.java fde8609 
>   test/org/apache/pig/test/TestPigServerLocal.java fbabd03 
>   test/org/apache/pig/test/TestProjectRange.java 2e3e7b8 
>   test/org/apache/pig/test/TestPruneColumn.java 3936332 
>   test/org/apache/pig/test/TestRank1.java 9e4ef62 
>   test/org/apache/pig/test/TestRank2.java fc802a9 
>   test/org/apache/pig/test/TestRank3.java 43af10d 
>   test/org/apache/pig/test/TestSecondarySort.java 8991010 
>   test/org/apache/pig/test/TestSkewedJoin.java dba2241 
>   test/org/apache/pig/test/TestStoreBase.java eb3b253 
>   test/org/apache/pig/test/Util.java 36d01e8 
>   test/spark-tests PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/45667/diff/
> 
> 
> Testing
> -------
> 
> New UTs were added where required and ensure old UTs pass -> https://builds.apache.org/job/Pig-spark/
> 
> 
> Thanks,
> 
> Pallavi Rao
> 
>