You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Na Yang <ny...@maprtech.com> on 2014/11/07 03:35:33 UTC

Review Request 27719: numRows and rawDataSize are not collected by the Spark stats [Spark Branch]

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27719/
-----------------------------------------------------------

Review request for hive, Brock Noland, Szehon Ho, and Xuefu Zhang.


Bugs: Hive-8756
    https://issues.apache.org/jira/browse/Hive-8756


Repository: hive-git


Description
-------

numRows and rawDataSize are not collected by the Spark stats. That is caused by the FileSinkOperator in the ReduceWork is not set the stats config. In the GenSparkUtils.removeUnionOperators, the operator tree gets cloned and new FileSinkOperator is generated and set to the reduce work. However, during processFileSink, the original FileSinkOperator is set the collectStats tag in GenMapRedUtils.addStatsTask, not the new FileSinkOperator which is used in the ReduceWork.  


Diffs
-----

  itests/src/test/resources/testconfiguration.properties 79a0132 
  ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java 8290568 
  ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java e8e18a7 
  ql/src/test/results/clientpositive/spark/stats1.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/27719/diff/


Testing
-------


Thanks,

Na Yang


Re: Review Request 27719: numRows and rawDataSize are not collected by the Spark stats [Spark Branch]

Posted by Na Yang <ny...@maprtech.com>.

> On Nov. 7, 2014, 9:42 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java, line 228
> > <https://reviews.apache.org/r/27719/diff/2/?file=754739#file754739line228>
> >
> >     One thing I'm not clear is that why cloning the operator tree doesn't clone the missed stats flags. From Utilities.cloneOperatorTree(), it seems it should.
> 
> Na Yang wrote:
>     Hi Xuefu, the stats flag is set after the cloneOperatorTree happens. The stats flag is set in the processFileSink step - GenMapRedUtils.isMergeRequired. Since we did not put the cloned filesinks to the fileSinkSet, so the stats flags are only set to the original FileSinkOperator. In the GenMapRedUtils.processFileSink API, I add the following to get the stats flag from the original filesinkop and set to the cloned filesinkops. 
>     
>         // Set stats config for FileSinkOperators which are cloned from the fileSink
>         List<FileSinkOperator> fileSinkList = context.fileSinkMap.get(fileSink);
>         if (fileSinkList != null) {
>           for (FileSinkOperator fsOp : fileSinkList) {
>             fsOp.getConf().setGatherStats(fileSink.getConf().isGatherStats());
>             fsOp.getConf().setStatsReliable(fileSink.getConf().isStatsReliable());
>             fsOp.getConf().setMaxStatsKeyPrefixLength(fileSink.getConf().getMaxStatsKeyPrefixLength());
>           }
>         }
> 
> Xuefu Zhang wrote:
>     Thanks for the explanation. If we do put the closed FileSinkOperators in fileSinkSet, is it true that we don't have to manually copy the flags over?

Yes, if we put the cloned FileSinkOperators in fileSinkSet, we do not have to manually copy the flags over. However, this will make the plan more complicated because each file sink operator in the fileSinkSet may genearte a Merge and Move task. To avoid the duplicated Merge and Move tasks, we only put the original FileSink operator to the fileSinkSet. I remember I saw some issues (wrong data result) before when putting the cloned filesink operators in the fileSinkSet (this issue also happens in Tez actually), so we removed the duplicated filesink operators from the fileSinkSet.


- Na


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27719/#review60385
-----------------------------------------------------------


On Nov. 7, 2014, 9:16 p.m., Na Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27719/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2014, 9:16 p.m.)
> 
> 
> Review request for hive, Brock Noland, Szehon Ho, and Xuefu Zhang.
> 
> 
> Bugs: Hive-8756
>     https://issues.apache.org/jira/browse/Hive-8756
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> numRows and rawDataSize are not collected by the Spark stats. That is caused by the FileSinkOperator in the ReduceWork is not set the stats config. In the GenSparkUtils.removeUnionOperators, the operator tree gets cloned and new FileSinkOperator is generated and set to the reduce work. However, during processFileSink, the original FileSinkOperator is set the collectStats tag in GenMapRedUtils.addStatsTask, not the new FileSinkOperator which is used in the ReduceWork.  
> 
> 
> Diffs
> -----
> 
>   itests/src/test/resources/testconfiguration.properties 79a0132 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java 8290568 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java e8e18a7 
>   ql/src/test/results/clientpositive/spark/groupby_sort_1_23.q.out 8d237c5 
>   ql/src/test/results/clientpositive/spark/groupby_sort_skew_1_23.q.out 4946815 
>   ql/src/test/results/clientpositive/spark/semijoin.q.out 9b6802d 
>   ql/src/test/results/clientpositive/spark/stats1.q.out PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/27719/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Na Yang
> 
>


Re: Review Request 27719: numRows and rawDataSize are not collected by the Spark stats [Spark Branch]

Posted by Xuefu Zhang <xz...@cloudera.com>.

> On Nov. 7, 2014, 9:42 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java, line 228
> > <https://reviews.apache.org/r/27719/diff/2/?file=754739#file754739line228>
> >
> >     One thing I'm not clear is that why cloning the operator tree doesn't clone the missed stats flags. From Utilities.cloneOperatorTree(), it seems it should.
> 
> Na Yang wrote:
>     Hi Xuefu, the stats flag is set after the cloneOperatorTree happens. The stats flag is set in the processFileSink step - GenMapRedUtils.isMergeRequired. Since we did not put the cloned filesinks to the fileSinkSet, so the stats flags are only set to the original FileSinkOperator. In the GenMapRedUtils.processFileSink API, I add the following to get the stats flag from the original filesinkop and set to the cloned filesinkops. 
>     
>         // Set stats config for FileSinkOperators which are cloned from the fileSink
>         List<FileSinkOperator> fileSinkList = context.fileSinkMap.get(fileSink);
>         if (fileSinkList != null) {
>           for (FileSinkOperator fsOp : fileSinkList) {
>             fsOp.getConf().setGatherStats(fileSink.getConf().isGatherStats());
>             fsOp.getConf().setStatsReliable(fileSink.getConf().isStatsReliable());
>             fsOp.getConf().setMaxStatsKeyPrefixLength(fileSink.getConf().getMaxStatsKeyPrefixLength());
>           }
>         }

Thanks for the explanation. If we do put the closed FileSinkOperators in fileSinkSet, is it true that we don't have to manually copy the flags over?


- Xuefu


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27719/#review60385
-----------------------------------------------------------


On Nov. 7, 2014, 9:16 p.m., Na Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27719/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2014, 9:16 p.m.)
> 
> 
> Review request for hive, Brock Noland, Szehon Ho, and Xuefu Zhang.
> 
> 
> Bugs: Hive-8756
>     https://issues.apache.org/jira/browse/Hive-8756
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> numRows and rawDataSize are not collected by the Spark stats. That is caused by the FileSinkOperator in the ReduceWork is not set the stats config. In the GenSparkUtils.removeUnionOperators, the operator tree gets cloned and new FileSinkOperator is generated and set to the reduce work. However, during processFileSink, the original FileSinkOperator is set the collectStats tag in GenMapRedUtils.addStatsTask, not the new FileSinkOperator which is used in the ReduceWork.  
> 
> 
> Diffs
> -----
> 
>   itests/src/test/resources/testconfiguration.properties 79a0132 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java 8290568 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java e8e18a7 
>   ql/src/test/results/clientpositive/spark/groupby_sort_1_23.q.out 8d237c5 
>   ql/src/test/results/clientpositive/spark/groupby_sort_skew_1_23.q.out 4946815 
>   ql/src/test/results/clientpositive/spark/semijoin.q.out 9b6802d 
>   ql/src/test/results/clientpositive/spark/stats1.q.out PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/27719/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Na Yang
> 
>


Re: Review Request 27719: numRows and rawDataSize are not collected by the Spark stats [Spark Branch]

Posted by Na Yang <ny...@maprtech.com>.

> On Nov. 7, 2014, 9:42 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java, line 228
> > <https://reviews.apache.org/r/27719/diff/2/?file=754739#file754739line228>
> >
> >     One thing I'm not clear is that why cloning the operator tree doesn't clone the missed stats flags. From Utilities.cloneOperatorTree(), it seems it should.

Hi Xuefu, the stats flag is set after the cloneOperatorTree happens. The stats flag is set in the processFileSink step - GenMapRedUtils.isMergeRequired. Since we did not put the cloned filesinks to the fileSinkSet, so the stats flags are only set to the original FileSinkOperator. In the GenMapRedUtils.processFileSink API, I add the following to get the stats flag from the original filesinkop and set to the cloned filesinkops. 

    // Set stats config for FileSinkOperators which are cloned from the fileSink
    List<FileSinkOperator> fileSinkList = context.fileSinkMap.get(fileSink);
    if (fileSinkList != null) {
      for (FileSinkOperator fsOp : fileSinkList) {
        fsOp.getConf().setGatherStats(fileSink.getConf().isGatherStats());
        fsOp.getConf().setStatsReliable(fileSink.getConf().isStatsReliable());
        fsOp.getConf().setMaxStatsKeyPrefixLength(fileSink.getConf().getMaxStatsKeyPrefixLength());
      }
    }


- Na


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27719/#review60385
-----------------------------------------------------------


On Nov. 7, 2014, 9:16 p.m., Na Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27719/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2014, 9:16 p.m.)
> 
> 
> Review request for hive, Brock Noland, Szehon Ho, and Xuefu Zhang.
> 
> 
> Bugs: Hive-8756
>     https://issues.apache.org/jira/browse/Hive-8756
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> numRows and rawDataSize are not collected by the Spark stats. That is caused by the FileSinkOperator in the ReduceWork is not set the stats config. In the GenSparkUtils.removeUnionOperators, the operator tree gets cloned and new FileSinkOperator is generated and set to the reduce work. However, during processFileSink, the original FileSinkOperator is set the collectStats tag in GenMapRedUtils.addStatsTask, not the new FileSinkOperator which is used in the ReduceWork.  
> 
> 
> Diffs
> -----
> 
>   itests/src/test/resources/testconfiguration.properties 79a0132 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java 8290568 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java e8e18a7 
>   ql/src/test/results/clientpositive/spark/groupby_sort_1_23.q.out 8d237c5 
>   ql/src/test/results/clientpositive/spark/groupby_sort_skew_1_23.q.out 4946815 
>   ql/src/test/results/clientpositive/spark/semijoin.q.out 9b6802d 
>   ql/src/test/results/clientpositive/spark/stats1.q.out PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/27719/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Na Yang
> 
>


Re: Review Request 27719: numRows and rawDataSize are not collected by the Spark stats [Spark Branch]

Posted by Xuefu Zhang <xz...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27719/#review60385
-----------------------------------------------------------



ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java
<https://reviews.apache.org/r/27719/#comment101749>

    One thing I'm not clear is that why cloning the operator tree doesn't clone the missed stats flags. From Utilities.cloneOperatorTree(), it seems it should.


- Xuefu Zhang


On Nov. 7, 2014, 9:16 p.m., Na Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27719/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2014, 9:16 p.m.)
> 
> 
> Review request for hive, Brock Noland, Szehon Ho, and Xuefu Zhang.
> 
> 
> Bugs: Hive-8756
>     https://issues.apache.org/jira/browse/Hive-8756
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> numRows and rawDataSize are not collected by the Spark stats. That is caused by the FileSinkOperator in the ReduceWork is not set the stats config. In the GenSparkUtils.removeUnionOperators, the operator tree gets cloned and new FileSinkOperator is generated and set to the reduce work. However, during processFileSink, the original FileSinkOperator is set the collectStats tag in GenMapRedUtils.addStatsTask, not the new FileSinkOperator which is used in the ReduceWork.  
> 
> 
> Diffs
> -----
> 
>   itests/src/test/resources/testconfiguration.properties 79a0132 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java 8290568 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java e8e18a7 
>   ql/src/test/results/clientpositive/spark/groupby_sort_1_23.q.out 8d237c5 
>   ql/src/test/results/clientpositive/spark/groupby_sort_skew_1_23.q.out 4946815 
>   ql/src/test/results/clientpositive/spark/semijoin.q.out 9b6802d 
>   ql/src/test/results/clientpositive/spark/stats1.q.out PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/27719/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Na Yang
> 
>


Re: Review Request 27719: numRows and rawDataSize are not collected by the Spark stats [Spark Branch]

Posted by Xuefu Zhang <xz...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27719/#review62557
-----------------------------------------------------------

Ship it!


Ship It!

- Xuefu Zhang


On Nov. 7, 2014, 9:16 p.m., Na Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27719/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2014, 9:16 p.m.)
> 
> 
> Review request for hive, Brock Noland, Szehon Ho, and Xuefu Zhang.
> 
> 
> Bugs: Hive-8756
>     https://issues.apache.org/jira/browse/Hive-8756
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> numRows and rawDataSize are not collected by the Spark stats. That is caused by the FileSinkOperator in the ReduceWork is not set the stats config. In the GenSparkUtils.removeUnionOperators, the operator tree gets cloned and new FileSinkOperator is generated and set to the reduce work. However, during processFileSink, the original FileSinkOperator is set the collectStats tag in GenMapRedUtils.addStatsTask, not the new FileSinkOperator which is used in the ReduceWork.  
> 
> 
> Diffs
> -----
> 
>   itests/src/test/resources/testconfiguration.properties 79a0132 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java 8290568 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java e8e18a7 
>   ql/src/test/results/clientpositive/spark/groupby_sort_1_23.q.out 8d237c5 
>   ql/src/test/results/clientpositive/spark/groupby_sort_skew_1_23.q.out 4946815 
>   ql/src/test/results/clientpositive/spark/semijoin.q.out 9b6802d 
>   ql/src/test/results/clientpositive/spark/stats1.q.out PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/27719/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Na Yang
> 
>


Re: Review Request 27719: numRows and rawDataSize are not collected by the Spark stats [Spark Branch]

Posted by Na Yang <ny...@maprtech.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27719/
-----------------------------------------------------------

(Updated Nov. 7, 2014, 9:16 p.m.)


Review request for hive, Brock Noland, Szehon Ho, and Xuefu Zhang.


Changes
-------

1. removed whilespace characters
2. handle operators which have multiple children
3. update stats config info for all cloned FileSinkOperators


Bugs: Hive-8756
    https://issues.apache.org/jira/browse/Hive-8756


Repository: hive-git


Description
-------

numRows and rawDataSize are not collected by the Spark stats. That is caused by the FileSinkOperator in the ReduceWork is not set the stats config. In the GenSparkUtils.removeUnionOperators, the operator tree gets cloned and new FileSinkOperator is generated and set to the reduce work. However, during processFileSink, the original FileSinkOperator is set the collectStats tag in GenMapRedUtils.addStatsTask, not the new FileSinkOperator which is used in the ReduceWork.  


Diffs (updated)
-----

  itests/src/test/resources/testconfiguration.properties 79a0132 
  ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java 8290568 
  ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java e8e18a7 
  ql/src/test/results/clientpositive/spark/groupby_sort_1_23.q.out 8d237c5 
  ql/src/test/results/clientpositive/spark/groupby_sort_skew_1_23.q.out 4946815 
  ql/src/test/results/clientpositive/spark/semijoin.q.out 9b6802d 
  ql/src/test/results/clientpositive/spark/stats1.q.out PRE-CREATION 

Diff: https://reviews.apache.org/r/27719/diff/


Testing
-------


Thanks,

Na Yang


Re: Review Request 27719: numRows and rawDataSize are not collected by the Spark stats [Spark Branch]

Posted by Na Yang <ny...@maprtech.com>.

> On Nov. 7, 2014, 3:32 a.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java, line 220
> > <https://reviews.apache.org/r/27719/diff/1/?file=754282#file754282line220>
> >
> >     Could you please remove the trailing spaces?

Sure.


> On Nov. 7, 2014, 3:32 a.m., Xuefu Zhang wrote:
> > ql/src/test/results/clientpositive/spark/stats1.q.out, line 182
> > <https://reviews.apache.org/r/27719/diff/1/?file=754283#file754283line182>
> >
> >     This seems slightly different from MR's output. I'm wondering if this is expected.

Xuefu, thank you for doing the code review. The spark output is missing one filesinkoperator's stats data. I need to fix that.


On Nov. 7, 2014, 3:32 a.m., Na Yang wrote:
> > The original code is pretty much cloned from Tez, I'm wondering if Tez suffers the same problem.

We modified the remove union code in spark by removing the newly cloned FileSinkOperators from the fileSinkSet to avoid generating multiple duplicated merge tasks.  However, this caused the stats flag missing from the cloned FileSinkOperators which are actually used in the SparkWork. My current patch only adds the stats flag to one of the cloned FileSinkOperators, not all of the cloned FileSinkOperators. That causes the wrong output. I will re-consider the fix and update the patch accordingly. Thank you Xuefu for the code review!


- Na


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27719/#review60294
-----------------------------------------------------------


On Nov. 7, 2014, 2:35 a.m., Na Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27719/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2014, 2:35 a.m.)
> 
> 
> Review request for hive, Brock Noland, Szehon Ho, and Xuefu Zhang.
> 
> 
> Bugs: Hive-8756
>     https://issues.apache.org/jira/browse/Hive-8756
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> numRows and rawDataSize are not collected by the Spark stats. That is caused by the FileSinkOperator in the ReduceWork is not set the stats config. In the GenSparkUtils.removeUnionOperators, the operator tree gets cloned and new FileSinkOperator is generated and set to the reduce work. However, during processFileSink, the original FileSinkOperator is set the collectStats tag in GenMapRedUtils.addStatsTask, not the new FileSinkOperator which is used in the ReduceWork.  
> 
> 
> Diffs
> -----
> 
>   itests/src/test/resources/testconfiguration.properties 79a0132 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java 8290568 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java e8e18a7 
>   ql/src/test/results/clientpositive/spark/stats1.q.out PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/27719/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Na Yang
> 
>


Re: Review Request 27719: numRows and rawDataSize are not collected by the Spark stats [Spark Branch]

Posted by Xuefu Zhang <xz...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27719/#review60294
-----------------------------------------------------------



ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java
<https://reviews.apache.org/r/27719/#comment101661>

    Could you please remove the trailing spaces?



ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java
<https://reviews.apache.org/r/27719/#comment101663>

    The iterations seems assuming that each operator has only one child. I'm wondering whether this assumption holds all the time.



ql/src/test/results/clientpositive/spark/stats1.q.out
<https://reviews.apache.org/r/27719/#comment101664>

    This seems slightly different from MR's output. I'm wondering if this is expected.


The original code is pretty much cloned from Tez, I'm wondering if Tez suffers the same problem.

- Xuefu Zhang


On Nov. 7, 2014, 2:35 a.m., Na Yang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27719/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2014, 2:35 a.m.)
> 
> 
> Review request for hive, Brock Noland, Szehon Ho, and Xuefu Zhang.
> 
> 
> Bugs: Hive-8756
>     https://issues.apache.org/jira/browse/Hive-8756
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> numRows and rawDataSize are not collected by the Spark stats. That is caused by the FileSinkOperator in the ReduceWork is not set the stats config. In the GenSparkUtils.removeUnionOperators, the operator tree gets cloned and new FileSinkOperator is generated and set to the reduce work. However, during processFileSink, the original FileSinkOperator is set the collectStats tag in GenMapRedUtils.addStatsTask, not the new FileSinkOperator which is used in the ReduceWork.  
> 
> 
> Diffs
> -----
> 
>   itests/src/test/resources/testconfiguration.properties 79a0132 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java 8290568 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java e8e18a7 
>   ql/src/test/results/clientpositive/spark/stats1.q.out PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/27719/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Na Yang
> 
>