You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by chengxiang li <ch...@intel.com> on 2015/05/20 04:37:08 UTC
Review Request 34455: HIVE-10550 Dynamic RDD caching optimization for
HoS.[Spark Branch]
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34455/
-----------------------------------------------------------
Review request for hive, Chao Sun, Jimmy Xiang, and Xuefu Zhang.
Bugs: HIVE-10550
https://issues.apache.org/jira/browse/HIVE-10550
Repository: hive-git
Description
-------
see jira description
Diffs
-----
common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 43c53fc
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java PRE-CREATION
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/LocalHiveSparkClient.java 19d3fee
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java 26cfebd
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java 2170243
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java e60dfac
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RemoteHiveSparkClient.java 8b15099
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ShuffleTran.java a774395
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java ee5c78a
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 3f240f5
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/impl/LocalSparkJobStatus.java 5d62596
ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java 8e56263
ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkRddCachingResolver.java PRE-CREATION
ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkSkewJoinProcFactory.java 5990d17
ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SplitSparkWorkResolver.java fb20080
ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70
ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java bb5dd79
spark-client/src/main/java/org/apache/hive/spark/client/JobContext.java af6332e
spark-client/src/main/java/org/apache/hive/spark/client/JobContextImpl.java beed8a3
spark-client/src/main/java/org/apache/hive/spark/client/MonitorCallback.java e1e899e
spark-client/src/main/java/org/apache/hive/spark/client/RemoteDriver.java b77c9e8
spark-client/src/test/java/org/apache/hive/spark/client/TestSparkClient.java d33ad7e
Diff: https://reviews.apache.org/r/34455/diff/
Testing
-------
Thanks,
chengxiang li
Re: Review Request 34455: HIVE-10550 Dynamic RDD caching optimization
for HoS.[Spark Branch]
Posted by chengxiang li <ch...@intel.com>.
> On 五月 20, 2015, 9:12 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java, line 41
> > <https://reviews.apache.org/r/34455/diff/1/?file=964754#file964754line41>
> >
> > Currently the storage level is memory+disk. Any reason to change it to memory_only?
Cache data to disk means that data need serialization and deserialization, it's costly, and sometime may overwhlem the gain of cache, and it's hard to measure programatically, as read from source file just do deserialization, cache in disk need an additional serialization
Instead of add an optimizer which may or may not promote performance for user, i think it may be better to narrow the the optimzir scope a little bit, to make sure this optimizer do promote the performance.
> On 五月 20, 2015, 9:12 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java, line 63
> > <https://reviews.apache.org/r/34455/diff/1/?file=964756#file964756line63>
> >
> > Can we keep the old code around. I understand it's not currently used.
Of course we can, it just make the code a little mess, you knon, for others who want to read the cache related code.
> On 五月 20, 2015, 9:12 p.m., Xuefu Zhang wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java, line 25
> > <https://reviews.apache.org/r/34455/diff/1/?file=964757#file964757line25>
> >
> > I cannot construct a case where a MapTran would need caching. Do you have an example?
For any queries which contains SparkWork like this: MapWork --> ReduceWork
\ --> ReduceWork
for example, from person_orc insert overwrite table p1 select city, count(*) as s group by city order by s insert overwrite table p2 select city, avg(age) as g group by city order by g;
> On 五月 20, 2015, 9:12 p.m., Xuefu Zhang wrote:
> > spark-client/src/main/java/org/apache/hive/spark/client/RemoteDriver.java, line 419
> > <https://reviews.apache.org/r/34455/diff/1/?file=964774#file964774line419>
> >
> > Do you think it makes sense for us to release the cache as soon as the job is completed, as it's done here?
Theoretically we does not need to, i mean it would not lead to any extra memory leak issue, the only benefit of unpersist cache manually i can image is that it reduce GC effort, as Hive do it programatically instead of let GC collect it.
The reason i remove it is that, it add extra complexility to code, and not expandable for share cached RDD cross Spark job.
- chengxiang
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34455/#review84572
-----------------------------------------------------------
On 五月 20, 2015, 2:37 a.m., chengxiang li wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34455/
> -----------------------------------------------------------
>
> (Updated 五月 20, 2015, 2:37 a.m.)
>
>
> Review request for hive, Chao Sun, Jimmy Xiang, and Xuefu Zhang.
>
>
> Bugs: HIVE-10550
> https://issues.apache.org/jira/browse/HIVE-10550
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> see jira description
>
>
> Diffs
> -----
>
> common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 43c53fc
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java PRE-CREATION
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/LocalHiveSparkClient.java 19d3fee
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java 26cfebd
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java 2170243
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java e60dfac
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RemoteHiveSparkClient.java 8b15099
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ShuffleTran.java a774395
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java ee5c78a
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 3f240f5
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/impl/LocalSparkJobStatus.java 5d62596
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java 8e56263
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkRddCachingResolver.java PRE-CREATION
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkSkewJoinProcFactory.java 5990d17
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SplitSparkWorkResolver.java fb20080
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70
> ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java bb5dd79
> spark-client/src/main/java/org/apache/hive/spark/client/JobContext.java af6332e
> spark-client/src/main/java/org/apache/hive/spark/client/JobContextImpl.java beed8a3
> spark-client/src/main/java/org/apache/hive/spark/client/MonitorCallback.java e1e899e
> spark-client/src/main/java/org/apache/hive/spark/client/RemoteDriver.java b77c9e8
> spark-client/src/test/java/org/apache/hive/spark/client/TestSparkClient.java d33ad7e
>
> Diff: https://reviews.apache.org/r/34455/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> chengxiang li
>
>
Re: Review Request 34455: HIVE-10550 Dynamic RDD caching optimization
for HoS.[Spark Branch]
Posted by Xuefu Zhang <xz...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34455/#review84572
-----------------------------------------------------------
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java
<https://reviews.apache.org/r/34455/#comment135909>
Currently the storage level is memory+disk. Any reason to change it to memory_only?
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java
<https://reviews.apache.org/r/34455/#comment135908>
Can we keep the old code around. I understand it's not currently used.
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java
<https://reviews.apache.org/r/34455/#comment135911>
I cannot construct a case where a MapTran would need caching. Do you have an example?
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java
<https://reviews.apache.org/r/34455/#comment135910>
Can we keep the comment around?
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/impl/LocalSparkJobStatus.java
<https://reviews.apache.org/r/34455/#comment135904>
We need to consider if it makes to unpersist rdds explicitly.
ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkRddCachingResolver.java
<https://reviews.apache.org/r/34455/#comment135896>
Nit: variable name.
ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkRddCachingResolver.java
<https://reviews.apache.org/r/34455/#comment135897>
Nit: class name. Maybe CommonParentWorkMatcher?
ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkRddCachingResolver.java
<https://reviews.apache.org/r/34455/#comment135903>
This nested loop doesn't seem efficient. SparkWork is basically a graph. Finding a node that has multiple children should be fairly easy using graph traversing. An example would be in SplitSparkWorkResolver.
Getting the value for the threshold should be outside the look, nevertheless.
ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkRddCachingResolver.java
<https://reviews.apache.org/r/34455/#comment135900>
We need to consider the case where the statistics may not be present.
ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java
<https://reviews.apache.org/r/34455/#comment135894>
I think we need a better name for this variable, something like cachedWorks or worksToCache.
spark-client/src/main/java/org/apache/hive/spark/client/RemoteDriver.java
<https://reviews.apache.org/r/34455/#comment135893>
Do you think it makes sense for us to release the cache as soon as the job is completed, as it's done here?
- Xuefu Zhang
On May 20, 2015, 2:37 a.m., chengxiang li wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34455/
> -----------------------------------------------------------
>
> (Updated May 20, 2015, 2:37 a.m.)
>
>
> Review request for hive, Chao Sun, Jimmy Xiang, and Xuefu Zhang.
>
>
> Bugs: HIVE-10550
> https://issues.apache.org/jira/browse/HIVE-10550
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> see jira description
>
>
> Diffs
> -----
>
> common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 43c53fc
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java PRE-CREATION
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/LocalHiveSparkClient.java 19d3fee
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java 26cfebd
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java 2170243
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java e60dfac
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/RemoteHiveSparkClient.java 8b15099
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ShuffleTran.java a774395
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java ee5c78a
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 3f240f5
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/impl/LocalSparkJobStatus.java 5d62596
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java 8e56263
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkRddCachingResolver.java PRE-CREATION
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkSkewJoinProcFactory.java 5990d17
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SplitSparkWorkResolver.java fb20080
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70
> ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java bb5dd79
> spark-client/src/main/java/org/apache/hive/spark/client/JobContext.java af6332e
> spark-client/src/main/java/org/apache/hive/spark/client/JobContextImpl.java beed8a3
> spark-client/src/main/java/org/apache/hive/spark/client/MonitorCallback.java e1e899e
> spark-client/src/main/java/org/apache/hive/spark/client/RemoteDriver.java b77c9e8
> spark-client/src/test/java/org/apache/hive/spark/client/TestSparkClient.java d33ad7e
>
> Diff: https://reviews.apache.org/r/34455/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> chengxiang li
>
>
Re: Review Request 34455: HIVE-10550 Dynamic RDD caching optimization
for HoS.[Spark Branch]
Posted by Alexander Pivovarov <ap...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34455/#review84876
-----------------------------------------------------------
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java
<https://reviews.apache.org/r/34455/#comment136299>
use 2 spaces for indent
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java
<https://reviews.apache.org/r/34455/#comment136300>
use 2 spaces for indent
- Alexander Pivovarov
On May 22, 2015, 6:18 a.m., chengxiang li wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34455/
> -----------------------------------------------------------
>
> (Updated May 22, 2015, 6:18 a.m.)
>
>
> Review request for hive, Chao Sun, Jimmy Xiang, and Xuefu Zhang.
>
>
> Bugs: HIVE-10550
> https://issues.apache.org/jira/browse/HIVE-10550
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> see jira description
>
>
> Diffs
> -----
>
> common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 43c53fc
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java PRE-CREATION
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java 2170243
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java e60dfac
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java ee5c78a
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 3f240f5
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkRddCachingResolver.java PRE-CREATION
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70
> ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java bb5dd79
>
> Diff: https://reviews.apache.org/r/34455/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> chengxiang li
>
>
Re: Review Request 34455: HIVE-10550 Dynamic RDD caching optimization
for HoS.[Spark Branch]
Posted by chengxiang li <ch...@intel.com>.
> On 五月 27, 2015, 10:13 p.m., Xuefu Zhang wrote:
> > common/src/java/org/apache/hadoop/hive/conf/HiveConf.java, line 2062
> > <https://reviews.apache.org/r/34455/diff/3/?file=972428#file972428line2062>
> >
> > Sorry for pointing this out late. I'm not certain if it's a good idea to expose these two configurations. Also this introduces a change of behavior. For now, can we get rid of them and change the persistency level back to MEM+DISK?
> >
> > We can come back to revisit this later on. At this moment, I don't feel confident to make the call.
>
> chengxiang li wrote:
> persistent to MEM + DISK may hurt the performance in certain cases, i think at least we should have a switch to open/close this optimization,
>
> Xuefu Zhang wrote:
> Agreed. However, before we find out more about in what cases this helps or hurts, I think it's better we keep the existing behavior. This doesn't prevent us from adding a flag later on.
Ok, i would remove these configurations from patch in temp, we can discuss later when we got more knowledge about it.
- chengxiang
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34455/#review85451
-----------------------------------------------------------
On 五月 27, 2015, 1:50 a.m., chengxiang li wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34455/
> -----------------------------------------------------------
>
> (Updated 五月 27, 2015, 1:50 a.m.)
>
>
> Review request for hive, Chao Sun, Jimmy Xiang, and Xuefu Zhang.
>
>
> Bugs: HIVE-10550
> https://issues.apache.org/jira/browse/HIVE-10550
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> see jira description
>
>
> Diffs
> -----
>
> common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 43c53fc
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java PRE-CREATION
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java 2170243
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java e60dfac
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java ee5c78a
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 3f240f5
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkRddCachingResolver.java PRE-CREATION
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70
> ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java bb5dd79
>
> Diff: https://reviews.apache.org/r/34455/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> chengxiang li
>
>
Re: Review Request 34455: HIVE-10550 Dynamic RDD caching optimization
for HoS.[Spark Branch]
Posted by Xuefu Zhang <xz...@cloudera.com>.
> On May 27, 2015, 10:13 p.m., Xuefu Zhang wrote:
> > common/src/java/org/apache/hadoop/hive/conf/HiveConf.java, line 2062
> > <https://reviews.apache.org/r/34455/diff/3/?file=972428#file972428line2062>
> >
> > Sorry for pointing this out late. I'm not certain if it's a good idea to expose these two configurations. Also this introduces a change of behavior. For now, can we get rid of them and change the persistency level back to MEM+DISK?
> >
> > We can come back to revisit this later on. At this moment, I don't feel confident to make the call.
>
> chengxiang li wrote:
> persistent to MEM + DISK may hurt the performance in certain cases, i think at least we should have a switch to open/close this optimization,
>
> Xuefu Zhang wrote:
> Agreed. However, before we find out more about in what cases this helps or hurts, I think it's better we keep the existing behavior. This doesn't prevent us from adding a flag later on.
>
> chengxiang li wrote:
> Ok, i would remove these configurations from patch in temp, we can discuss later when we got more knowledge about it.
Please feel free to create a followup JIRA to do more research. We can try different data sizes and persistancy levels to see the result. At that time, we can decide if it makes sense to introduce configurations. Thanks.
- Xuefu
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34455/#review85451
-----------------------------------------------------------
On May 28, 2015, 3:30 a.m., chengxiang li wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34455/
> -----------------------------------------------------------
>
> (Updated May 28, 2015, 3:30 a.m.)
>
>
> Review request for hive, Chao Sun, Jimmy Xiang, and Xuefu Zhang.
>
>
> Bugs: HIVE-10550
> https://issues.apache.org/jira/browse/HIVE-10550
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> see jira description
>
>
> Diffs
> -----
>
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java PRE-CREATION
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java 2170243
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java e60dfac
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java ee5c78a
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 3f240f5
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c
>
> Diff: https://reviews.apache.org/r/34455/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> chengxiang li
>
>
Re: Review Request 34455: HIVE-10550 Dynamic RDD caching optimization
for HoS.[Spark Branch]
Posted by chengxiang li <ch...@intel.com>.
> On 五月 27, 2015, 10:13 p.m., Xuefu Zhang wrote:
> > common/src/java/org/apache/hadoop/hive/conf/HiveConf.java, line 2062
> > <https://reviews.apache.org/r/34455/diff/3/?file=972428#file972428line2062>
> >
> > Sorry for pointing this out late. I'm not certain if it's a good idea to expose these two configurations. Also this introduces a change of behavior. For now, can we get rid of them and change the persistency level back to MEM+DISK?
> >
> > We can come back to revisit this later on. At this moment, I don't feel confident to make the call.
persistent to MEM + DISK may hurt the performance in certain cases, i think at least we should have a switch to open/close this optimization,
- chengxiang
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34455/#review85451
-----------------------------------------------------------
On 五月 27, 2015, 1:50 a.m., chengxiang li wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34455/
> -----------------------------------------------------------
>
> (Updated 五月 27, 2015, 1:50 a.m.)
>
>
> Review request for hive, Chao Sun, Jimmy Xiang, and Xuefu Zhang.
>
>
> Bugs: HIVE-10550
> https://issues.apache.org/jira/browse/HIVE-10550
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> see jira description
>
>
> Diffs
> -----
>
> common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 43c53fc
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java PRE-CREATION
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java 2170243
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java e60dfac
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java ee5c78a
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 3f240f5
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkRddCachingResolver.java PRE-CREATION
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70
> ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java bb5dd79
>
> Diff: https://reviews.apache.org/r/34455/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> chengxiang li
>
>
Re: Review Request 34455: HIVE-10550 Dynamic RDD caching optimization
for HoS.[Spark Branch]
Posted by Xuefu Zhang <xz...@cloudera.com>.
> On May 27, 2015, 10:13 p.m., Xuefu Zhang wrote:
> > common/src/java/org/apache/hadoop/hive/conf/HiveConf.java, line 2062
> > <https://reviews.apache.org/r/34455/diff/3/?file=972428#file972428line2062>
> >
> > Sorry for pointing this out late. I'm not certain if it's a good idea to expose these two configurations. Also this introduces a change of behavior. For now, can we get rid of them and change the persistency level back to MEM+DISK?
> >
> > We can come back to revisit this later on. At this moment, I don't feel confident to make the call.
>
> chengxiang li wrote:
> persistent to MEM + DISK may hurt the performance in certain cases, i think at least we should have a switch to open/close this optimization,
Agreed. However, before we find out more about in what cases this helps or hurts, I think it's better we keep the existing behavior. This doesn't prevent us from adding a flag later on.
- Xuefu
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34455/#review85451
-----------------------------------------------------------
On May 27, 2015, 1:50 a.m., chengxiang li wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34455/
> -----------------------------------------------------------
>
> (Updated May 27, 2015, 1:50 a.m.)
>
>
> Review request for hive, Chao Sun, Jimmy Xiang, and Xuefu Zhang.
>
>
> Bugs: HIVE-10550
> https://issues.apache.org/jira/browse/HIVE-10550
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> see jira description
>
>
> Diffs
> -----
>
> common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 43c53fc
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java PRE-CREATION
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java 2170243
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java e60dfac
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java ee5c78a
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 3f240f5
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkRddCachingResolver.java PRE-CREATION
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70
> ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java bb5dd79
>
> Diff: https://reviews.apache.org/r/34455/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> chengxiang li
>
>
Re: Review Request 34455: HIVE-10550 Dynamic RDD caching optimization
for HoS.[Spark Branch]
Posted by Xuefu Zhang <xz...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34455/#review85451
-----------------------------------------------------------
common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
<https://reviews.apache.org/r/34455/#comment137023>
Sorry for pointing this out late. I'm not certain if it's a good idea to expose these two configurations. Also this introduces a change of behavior. For now, can we get rid of them and change the persistency level back to MEM+DISK?
We can come back to revisit this later on. At this moment, I don't feel confident to make the call.
- Xuefu Zhang
On May 27, 2015, 1:50 a.m., chengxiang li wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34455/
> -----------------------------------------------------------
>
> (Updated May 27, 2015, 1:50 a.m.)
>
>
> Review request for hive, Chao Sun, Jimmy Xiang, and Xuefu Zhang.
>
>
> Bugs: HIVE-10550
> https://issues.apache.org/jira/browse/HIVE-10550
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> see jira description
>
>
> Diffs
> -----
>
> common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 43c53fc
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java PRE-CREATION
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java 2170243
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java e60dfac
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java ee5c78a
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 3f240f5
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkRddCachingResolver.java PRE-CREATION
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70
> ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java bb5dd79
>
> Diff: https://reviews.apache.org/r/34455/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> chengxiang li
>
>
Re: Review Request 34455: HIVE-10550 Dynamic RDD caching optimization
for HoS.[Spark Branch]
Posted by Xuefu Zhang <xz...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34455/#review85509
-----------------------------------------------------------
Ship it!
Ship It!
- Xuefu Zhang
On May 28, 2015, 3:30 a.m., chengxiang li wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34455/
> -----------------------------------------------------------
>
> (Updated May 28, 2015, 3:30 a.m.)
>
>
> Review request for hive, Chao Sun, Jimmy Xiang, and Xuefu Zhang.
>
>
> Bugs: HIVE-10550
> https://issues.apache.org/jira/browse/HIVE-10550
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> see jira description
>
>
> Diffs
> -----
>
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java PRE-CREATION
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java 2170243
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java e60dfac
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java ee5c78a
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 3f240f5
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c
>
> Diff: https://reviews.apache.org/r/34455/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> chengxiang li
>
>
Re: Review Request 34455: HIVE-10550 Dynamic RDD caching optimization
for HoS.[Spark Branch]
Posted by chengxiang li <ch...@intel.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34455/
-----------------------------------------------------------
(Updated May 28, 2015, 3:30 a.m.)
Review request for hive, Chao Sun, Jimmy Xiang, and Xuefu Zhang.
Changes
-------
remove configs, and move common parent match logic in SparkPlanGenerator directly.
Bugs: HIVE-10550
https://issues.apache.org/jira/browse/HIVE-10550
Repository: hive-git
Description
-------
see jira description
Diffs (updated)
-----
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java PRE-CREATION
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java 2170243
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java e60dfac
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java ee5c78a
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 3f240f5
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c
Diff: https://reviews.apache.org/r/34455/diff/
Testing
-------
Thanks,
chengxiang li
Re: Review Request 34455: HIVE-10550 Dynamic RDD caching optimization
for HoS.[Spark Branch]
Posted by chengxiang li <ch...@intel.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34455/
-----------------------------------------------------------
(Updated May 27, 2015, 1:50 a.m.)
Review request for hive, Chao Sun, Jimmy Xiang, and Xuefu Zhang.
Changes
-------
fix what listed in comments.
Bugs: HIVE-10550
https://issues.apache.org/jira/browse/HIVE-10550
Repository: hive-git
Description
-------
see jira description
Diffs (updated)
-----
common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 43c53fc
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java PRE-CREATION
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java 2170243
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java e60dfac
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java ee5c78a
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 3f240f5
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c
ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkRddCachingResolver.java PRE-CREATION
ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70
ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java bb5dd79
Diff: https://reviews.apache.org/r/34455/diff/
Testing
-------
Thanks,
chengxiang li
Re: Review Request 34455: HIVE-10550 Dynamic RDD caching optimization
for HoS.[Spark Branch]
Posted by Xuefu Zhang <xz...@cloudera.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34455/#review85107
-----------------------------------------------------------
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java
<https://reviews.apache.org/r/34455/#comment136586>
Could you change this to DEBUG level and "if (LOG.isDebugEnabled())"?
ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java
<https://reviews.apache.org/r/34455/#comment136589>
worksToCache or WorksToClone?
- Xuefu Zhang
On May 22, 2015, 6:18 a.m., chengxiang li wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34455/
> -----------------------------------------------------------
>
> (Updated May 22, 2015, 6:18 a.m.)
>
>
> Review request for hive, Chao Sun, Jimmy Xiang, and Xuefu Zhang.
>
>
> Bugs: HIVE-10550
> https://issues.apache.org/jira/browse/HIVE-10550
>
>
> Repository: hive-git
>
>
> Description
> -------
>
> see jira description
>
>
> Diffs
> -----
>
> common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 43c53fc
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java PRE-CREATION
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java 2170243
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java e60dfac
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java ee5c78a
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 3f240f5
> ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkRddCachingResolver.java PRE-CREATION
> ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70
> ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java bb5dd79
>
> Diff: https://reviews.apache.org/r/34455/diff/
>
>
> Testing
> -------
>
>
> Thanks,
>
> chengxiang li
>
>
Re: Review Request 34455: HIVE-10550 Dynamic RDD caching optimization
for HoS.[Spark Branch]
Posted by chengxiang li <ch...@intel.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34455/
-----------------------------------------------------------
(Updated May 22, 2015, 6:18 a.m.)
Review request for hive, Chao Sun, Jimmy Xiang, and Xuefu Zhang.
Changes
-------
Keep all the previous multi-insert cache code.
Bugs: HIVE-10550
https://issues.apache.org/jira/browse/HIVE-10550
Repository: hive-git
Description
-------
see jira description
Diffs (updated)
-----
common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 43c53fc
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/CacheTran.java PRE-CREATION
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java 2170243
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java e60dfac
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlan.java ee5c78a
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 3f240f5
ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java e6c845c
ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkRddCachingResolver.java PRE-CREATION
ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 19aae70
ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java bb5dd79
Diff: https://reviews.apache.org/r/34455/diff/
Testing
-------
Thanks,
chengxiang li