You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2023/03/23 16:21:00 UTC

[jira] [Commented] (IMPALA-11886) Data cache should support asynchronous writes

    [ https://issues.apache.org/jira/browse/IMPALA-11886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704237#comment-17704237 ] 

ASF subversion and git services commented on IMPALA-11886:
----------------------------------------------------------

Commit 1cfd41e8b10f6e91fc79d50ab58e671d63b65eec in impala's branch refs/heads/master from Eyizoha
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=1cfd41e8b ]

IMPALA-11886: Data cache should support asynchronous writes

This patch implements asynchronous writes to the data cache to improve
scan performance when a cache miss happens.
Previously, writes to the data cache are synchronous with hdfs file
reads, and both are handled by remote hdfs IO threads. In other words,
if a cache miss occurs,  the IO thread needs to take additional
responsibility for cache writes,  which will lead to scan performance
deterioration.
This patch uses a thread pool for asynchronous writes, and the number of
threads in the pool is determined by the new configuration
'data_cache_num_write_threads'. In asynchronous write mode, the IO
thread only needs to copy data to the temporary buffer when storing data
into the data cache. The additional memory consumption caused by
temporary buffers can be limited, depending on the new configuration
'data_cache_write_buffer_limit'.

Testing:
- Add test cases for asynchronous data writing to the original
DataCacheTest using different number of threads.
- Add DataCacheTest,#OutOfWriteBufferLimit
Used to test the limit of memory consumed by temporary buffers in the
case of asynchronous writes
- Add a timer to the MultiThreadedReadWrite function to get the average
time of multithreaded writes. Here are some test cases and their time
that differ significantly between synchronous and asynchronous:
Test case                | Policy | Sync/Async | write time in ms
MultiThreadedNoMisses    | LRU    | Sync       |   12.20
MultiThreadedNoMisses    | LRU    | Async      |   20.74
MultiThreadedNoMisses    | LIRS   | Sync       |    9.42
MultiThreadedNoMisses    | LIRS   | Async      |   16.75
MultiThreadedWithMisses  | LRU    | Sync       |  510.87
MultiThreadedWithMisses  | LRU    | Async      |   10.06
MultiThreadedWithMisses  | LIRS   | Sync       | 1872.11
MultiThreadedWithMisses  | LIRS   | Async      |   11.02
MultiPartitions          | LRU    | Sync       |    1.20
MultiPartitions          | LRU    | Async      |    5.23
MultiPartitions          | LIRS   | Sync       |    1.26
MultiPartitions          | LIRS   | Async      |    7.91
AccessTraceAnonymization | LRU    | Sync       | 1963.89
AccessTraceAnonymization | LRU    | Sync       | 2073.62
AccessTraceAnonymization | LRU    | Async      |    9.43
AccessTraceAnonymization | LRU    | Async      |   13.13
AccessTraceAnonymization | LIRS   | Sync       | 1663.93
AccessTraceAnonymization | LIRS   | Sync       | 1501.86
AccessTraceAnonymization | LIRS   | Async      |   12.83
AccessTraceAnonymization | LIRS   | Async      |   12.74

Change-Id: I878f7486d485b6288de1a9145f49576b7155d312
Reviewed-on: http://gerrit.cloudera.org:8080/19475
Reviewed-by: Joe McDonnell <jo...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Data cache should support asynchronous writes
> ---------------------------------------------
>
>                 Key: IMPALA-11886
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11886
>             Project: IMPALA
>          Issue Type: Improvement
>    Affects Versions: Impala 4.3.0
>            Reporter: Ye Zihao
>            Assignee: Ye Zihao
>            Priority: Major
>
> Currently, writes to the data cache are synchronized with hdfs file reads, and both are handled by remote hdfs IO threads. In other words, if a cache miss occurs, the IO thread needs to take additional responsibility for cache writes, which will lead to query performance deterioration in some cases.
> Therefore, the data cache should be able to defer the writes to another thread(or thread pool) which writes asynchronously, allowing the IO thread to copy the data into the temporary buffer and immediately return it to the Scanner. Also need to bound the extra memory consumption for holding the temporary buffer though.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org