You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Jesús Camacho Rodríguez <jc...@hortonworks.com> on 2018/02/01 20:36:26 UTC

Re: Review Request 65304: HIVE-18513 Query results caching


> On Jan. 29, 2018, 6:26 p.m., Jesús Camacho Rodríguez wrote:
> > common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
> > Lines 3699 (patched)
> > <https://reviews.apache.org/r/65304/diff/3/?file=1946552#file1946552line3699>
> >
> >     Is this the permission for each results directory? Does this mean that results cannot be shared by different users?
> >     
> >     Why does this need to be configurable? (I would assume this is not something that you let the user decide).
> 
> Jason Dere wrote:
>     If this is HiveServer2 (with doAs=false), the hive user would be the directory owner regardless of which user is submitting the query, so I would not expect issues with sharing the cache in the HiveServer2 case. One danger with making this directory readable by others is the possibility that it allows a user to access cached results which the user may not have permission to see, if user-level filtering/masking rules are enabled. Different instances of Hive do not share caches, so sharing results between different Hive CLI instances is not something I am worrying about here.
>     
>     I was basing the directory creation rules on the scratchdir logic, which does allow this to be configurable. But really the setting at the time of cache initialization is what matters. If you think this should just be hardcoded to 700 let me know and I can make the change.

We can remove the configuration flag indeed, I do not think it should be configurable from Hive via conf.


> On Jan. 29, 2018, 6:26 p.m., Jesús Camacho Rodríguez wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/Driver.java
> > Lines 1829 (patched)
> > <https://reviews.apache.org/r/65304/diff/3/?file=1946566#file1946566line1829>
> >
> >     Currently what happens when we get this exception?
> >     
> >     It seems to me that mechanism in HIVE-17626 works when execution itself fails, but in this case, whether the cache entry is still valid or not can be inferred statically at planning time, hence I am not sure whether it should be handled the same way? It seems we will have certain overhead that might not be necessary.
> >     
> >     Can't we check the validity of the entry when we are replacing the plan by the scan on the cached results, e.g., in SemanticAnalyzer?
> 
> Jason Dere wrote:
>     So the table locks for the query are not acquired until query execution, which occurs after query compilation. If we implement automatic invalidation of the cache based on updates to the Hive tables, the following could happen:
>     
>     1. Query A goes through query compilation and finds an entry in the cache that can be used to satisfy the query. At this point there have been no updates to the table which would invalidate the cache.
>     2. Query B acquires a write lock and begins making updates to one or more of the tables involved in query A
>     3. Query A attempts to acquire read locks and blocks while query B is running.
>     4. Query B finishes updating the tables and releases its lock.
>     5. Query A now acquires the read lock, but at this point the cached result is stale.
>     
>     The options here would be to either attempt to recompile the query (the current approach), or to just go ahead and serve the stale results.

Should query A get a read lock on the original source tables or only on the cached results? AFAIK write lock is only acquired when we make the move of the results to the final directory. Hence, we do not actually get any proper "transactional" guarantees with these locks. Hive is also moving towards making ACID/MM the default (I think this has already happened in master) which do not rely on this type of locking. Hence, I think we can remove this.


> On Jan. 29, 2018, 6:26 p.m., Jesús Camacho Rodríguez wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/cache/results/QueryResultsCache.java
> > Lines 174 (patched)
> > <https://reviews.apache.org/r/65304/diff/3/?file=1946568#file1946568line174>
> >
> >     What happens if execution fails? Will the results still be cleaned properly?
> 
> Jason Dere wrote:
>     This is being called in Driver.closeInProcess()/Driver.close(), so my impression was this should be the case.

Makes sense. What if HS2 fails or closes? Should we add it to the shutdown hook so it cleans it even if readers number is greater than zero? (Not sure about the order in which things happen in this case, just want to make sure we do not leave any residual data/dirs).


- Jesús


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65304/#review196440
-----------------------------------------------------------


On Jan. 31, 2018, 6:16 a.m., Jason Dere wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65304/
> -----------------------------------------------------------
> 
> (Updated Jan. 31, 2018, 6:16 a.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan, Gopal V, and Jesús Camacho Rodríguez.
> 
> 
> Bugs: HIVE-18513
>     https://issues.apache.org/jira/browse/HIVE-18513
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> - For queries that result in MR/Tez/Spark jobs on the cluster, save the temporary query results to a cache directory where they can be re-used.
> - Add QueryResultsCache to manage cached results. Currently cache invalidation is time-based, update-based cache invalidation needs to be added later.
> - Driver/SemanticAnalyzer/Calcite planner changes to lookup queries in the cache and use in place of the query plan.
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java b7d3e99 
>   common/src/java/org/apache/hadoop/hive/ql/log/PerfLogger.java 2767bca 
>   data/conf/hive-site.xml 01f83d1 
>   data/conf/llap/hive-site.xml cdda875 
>   data/conf/perf-reg/spark/hive-site.xml 497a61f 
>   data/conf/perf-reg/tez/hive-site.xml 012369f 
>   data/conf/rlist/hive-site.xml 9de00e5 
>   data/conf/spark/local/hive-site.xml fd0e6a0 
>   data/conf/spark/standalone/hive-site.xml 1e5bd65 
>   data/conf/spark/yarn-client/hive-site.xml a9a788b 
>   data/conf/tez/hive-site.xml 4519678 
>   itests/hive-blobstore/src/test/resources/hive-site.xml 038db0d 
>   itests/src/test/resources/testconfiguration.properties d86ff58 
>   itests/util/src/main/java/org/apache/hadoop/hive/ql/QTestUtil.java 4432aca 
>   ql/src/java/org/apache/hadoop/hive/ql/Driver.java 74595b0 
>   ql/src/java/org/apache/hadoop/hive/ql/cache/results/CacheUsage.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/cache/results/QueryResultsCache.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java d7b3e4b 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveCalciteUtil.java f0dd167 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/HiveRelOpMaterializationValidator.java PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java 372cfad 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/CalcitePlanner.java 85a1f34 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/QBParseInfo.java ae2ec3d 
>   ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 83dfb47 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/FetchWork.java 7243dc7 
>   ql/src/java/org/apache/hadoop/hive/ql/session/SessionState.java d6083e6 
>   ql/src/test/queries/clientpositive/results_cache_1.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/results_cache_2.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/results_cache_capacity.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/results_cache_lifetime.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/results_cache_temptable.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/results_cache_with_masking.q PRE-CREATION 
>   ql/src/test/results/clientpositive/llap/results_cache_1.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/results_cache_1.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/results_cache_2.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/results_cache_capacity.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/results_cache_lifetime.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/results_cache_temptable.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/results_cache_with_masking.q.out PRE-CREATION 
>   service/src/java/org/apache/hive/service/server/HiveServer2.java 2a528cd 
> 
> 
> Diff: https://reviews.apache.org/r/65304/diff/4/
> 
> 
> Testing
> -------
> 
> qfile tests added.
> 
> 
> Thanks,
> 
> Jason Dere
> 
>