You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2023/06/07 20:05:00 UTC
[jira] [Commented] (IMPALA-11904) Data cache should support dumping metadata for reloading

    [ https://issues.apache.org/jira/browse/IMPALA-11904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17730266#comment-17730266 ] 

ASF subversion and git services commented on IMPALA-11904:
----------------------------------------------------------

Commit c209b50867d846029b8f149ecbd0187d6eae9455 in impala's branch refs/heads/master from Eyizoha
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=c209b5086 ]

IMPALA-11904: Data cache support dumping for reloading

Data cache mainly includes cache metadata and cache files. The cache
files are located on the disk and is responsible for storing cached data
content, while the cache metadata is located in the memory and is
responsible for indexing to the cache file according to the cache key.
Before this patch, if the impalad process exits, the cache metadata will
be lost. After the Impalad process restarts, we cannot reuse the cache
file even though it is still on the disk, because there is no
corresponding cache metadata for index.

This patch implements the dump and load functions of the data cache.
After enabling the dump&load function with setting
'data_cache_keep_across_restarts=true', when the Impalad process is
closed by graceful shutdown (kill -SIGRTMIN $pid), the data cache will
collect the cache metadata and dump them to the location where the cache
directory is located. When the Impalad process restarts, it will try to
load the dumped files on the disk to restore the original cache
metadata, so that the existing cache files can be reused without
refilling the cache.

The cache can be safely dumped during query execution, because before
the dump starts, the data cache will be set to read-only to prevent the
inconsistency between the metadata dump and the cache file. Note that
the dump files will also use disk space. After testing, the size of the
dump file is generally not more than 0.5% of the size of all cache
files.

Testing:
- Add DataCacheTest,#SetReadOnly
Used to test whether set/revoke read-only takes effect, even when there
are writes in progress.
- Add DataCacheTest,#DumpAndLoad
Used to test whether the original cache contents can be read after a
data cache dump and reload.
- Add DataCacheTest,#ChangeConfBeforeLoad
Used to test whether the original cache contents can be read after the
data cache is dumped and the configuration is changed and then reloaded.
- Add end-to-end test in test_data_cache.py
Perform end-to-end testing in a custom cluster, including executing
queries, gracefully restarting, verifying metrics, re-executing the same
query and verifying hits/misses. This also includes testing the
modification of cache capacity and restart, as well as testing restarts
while querie is in progress.

Change-Id: Id867f4fc7343898e4906332c3caa40eb57a03101
Reviewed-on: http://gerrit.cloudera.org:8080/19532
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-by: Joe McDonnell <jo...@cloudera.com>


> Data cache should support dumping metadata for reloading
> --------------------------------------------------------
>
>                 Key: IMPALA-11904
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11904
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 4.3.0
>            Reporter: Ye Zihao
>            Assignee: Ye Zihao
>            Priority: Major
>
> Data cache mainly includes cache metadata and cache files. The cache files are located on the disk and is responsible for storing cached data content, while the cache metadata is located in the memory and is responsible for indexing to the cache file according to the cache key.
> Currently, if the impalad process exits, the cache metadata will be lost.   After the Impalad process restarts, we cannot reuse the cache file even though it is still on the disk, because there is no corresponding cache metadata for index.
> If we can support dumping the cache metadata to disk when the process exits, then the next time the process starts it can be reloaded back into memory and the previous cache files can be reused. This would be helpful in a real production environment, where cache data often exceeds TB in size (per process), and loss of cache data due to a configuration change or version upgrade can take days to recover.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org