You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Gopal Vijayaraghavan <go...@apache.org> on 2018/02/03 02:42:48 UTC

Re: Question on accessing LLAP as data cache from external containers

> For example, a Hive job may start Tez containers, which then retrieve data from LLAP running concurrently. In the current implementation, this is unrealistic

That is how LLAP was built - to push work from Tez to LLAP vertex by vertex, instead of an all-or-nothing implementation.

Here are the slides describing how that is plugged in LLAP from Hadoop Summit 2015.

https://www.slideshare.net/Hadoop_Summit/llap-longlived-execution-in-hive/21

The flag in question is hive.llap.execution.mode - the most common use-case imagined for it was something like the mode=map, where only table-scan + all secure operators (i.e no temporary UDFs) are run inside LLAP (to take advantage of the cache).

LLAP can shuffle data to a Tez container, but it cannot shuffle data from a Tez container back into the daemon (& that's not very useful, since it won't be cached).

Here's the class that decides the hybrid execution tree & the plans the split between LLAP and Tez in the same query DAG.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/LlapDecider.java#L81

If you want to consume the LLAP cached rows from something like GPUs running Caffee, you can access LLAP cache via the SparkSQL data-source APIs.

https://github.com/hortonworks/spark-llap-release/blob/HDP-2.6.3.0-235-tag/examples/src/main/python/spark_llap_dsl.py

This is faster than directly reading off Cloud filesystems (because of LLAP's SSD cache), but even with a perf penalty on-prem it is very useful to restrict the access of the Spark ML[1] to certain columns (i.e you can extract lat/long, from a table which has other PII data) without having to make a complete copy of the data after projections to share from the EDW end of the shop to the ML side of it, even if the entire data-set is HDFS encrypted.

Cheers,
Gopal
[1] - https://hortonworks.com/blog/row-column-level-control-apache-spark/