You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Sahil Takiar (Jira)" <ji...@apache.org> on 2020/01/24 20:55:00 UTC

[jira] [Created] (IMPALA-9327) Data cache should be able to borrow spill-to-disk space

Sahil Takiar created IMPALA-9327:
------------------------------------

             Summary: Data cache should be able to borrow spill-to-disk space
                 Key: IMPALA-9327
                 URL: https://issues.apache.org/jira/browse/IMPALA-9327
             Project: IMPALA
          Issue Type: Improvement
          Components: Backend
            Reporter: Sahil Takiar


Currently, users typically allocate a fixed amount of space for the data cache and spill-to-disk using the configuration options {{--data_cache}} and {{--scratch_dirs}}. For example, {{data_cache=/impala/cache:200GB}} and {{scratch_dirs=/impala/scratch:200GB}}.

The issue with this type of static configuration is if there are no queries that spill to disk, then that 200GB reserved for the scratch space will be un-used. It would improve Impala performance and resource utilization if the data cache was able to steal disk space from the spill-to-disk manager. The space could be returned (e.g. data from the cache is evicted) to the spill-to-disk manager, when required.

Users don't have to put a limit on the data_cache / scratch_dirs size (e.g. if the 200GB was omitted Impala would just write files until there is no more disk space left). The problem here is that there is no fairness policy between the data cache and scratch space. What will likely happen is that the data cache will consume all the disk capacity, leaving none for the scratch space. Impala needs to have logic that allows disk space stealing, but still enforces a minimum reserved disk capacity.

One issue with this approach is predictability. When queries start spilling, performance can potentially be impacted since data from the cache will be evicted. In practice, this may not be a big deal, especially since the data evicted from the cache will be the least recently used data. However, we should still think through the tradeoffs between predictability vs. performance / utilization, and think of ways to expose metrics indicating that spill-to-disk is taking space away from the data cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)