You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hive.apache.org by "Ádám Szita (Jira)" <ji...@apache.org> on 2020/07/29 13:20:00 UTC

[jira] [Work started] (HIVE-23947) Cache affinity is unset for text files read by LLAP

     [ https://issues.apache.org/jira/browse/HIVE-23947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on HIVE-23947 started by Ádám Szita.
-----------------------------------------
> Cache affinity is unset for text files read by LLAP
> ---------------------------------------------------
>
>                 Key: HIVE-23947
>                 URL: https://issues.apache.org/jira/browse/HIVE-23947
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Ádám Szita
>            Assignee: Ádám Szita
>            Priority: Major
>
> LLAP relies on HostAffinitySplitLocationProvider to route the same splits to always the same LLAP daemons. By having such consistent split of data among the nodes we can gain a good hit ratio and thus good performance.
> For text files this is almost never granted: HostAffinitySplitLocationProvider is never used, because HS2 does not set the cache affinity flag in the job conf for text inputformat content during compile. The launched Tez AM will have to rely on HDFS location information to route the splits (and therefore tasks) to the executor nodes. This location information might not have a good overlap with where the actual daemons are, or in S3 case, the Tez AM will mostly choose executors in a random way.
> This in turn will result in the hit ratio hardly reaching 100%, each time we re-run the same query, some disk/s3 read will still occur. That is until the same content gets populated into all the daemons (after running the query tens or hundreds of times) causing poor performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)