You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Michael Ho (Code Review)" <ge...@cloudera.org> on 2019/04/10 17:12:22 UTC

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Michael Ho has uploaded this change for review. ( http://gerrit.cloudera.org:8080/12987


Change subject: IMPALA-8341: Data cache for remote reads
......................................................................

IMPALA-8341: Data cache for remote reads

This is a patch based on PhilZ's prototype: https://gerrit.cloudera.org/#/c/12683/

This change implements an IO data cache which is backed by
local storage. It implicitly relies on the OS page cache
management to shuffle data between memory and the storage
device. This is useful for caching data read from remote
filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS).

A data cache is divided into one or more partitions based on
the configuration string which is a list of <directory>:<quota>
pairs, separated by comma. An example configuration string:
  --data_cache_config=/data/0:150GB,/data/1:150GB

Each partition has a meta-data cache which tracks the mappings
of cache keys to the locations of the cached data. A cache key
is a tuple of (file's name, file's modification time, file offset)
and a cache entry is a tuple of (backing file, offset in the backing
file, length of the cached data, optional checksum).

Each partition stores its set of cached data in backing files created
on local storage. When inserting new data into the cache, the data is
appended to the current backing file in use. The storage consumption
of each cache entry counts towards the quota of that partition. When a
partition reaches its capacity, the least recently used (LRU) data in
that partition is evicted. Evicted data is removed from the underlying
storage by punching holes in the backing file it's stored in. As a
backing file reaches a certain size (by default 4TB), new data will
stop being appended to it and a new file will be created instead. Note
that due to hole punching, the backing file is actually sparse.

Optionally, checksumming can be enabled to verify read from the cache
is consistent with what was inserted and to verify that multiple attempted
insertions with the same cache key have the same cache content.

To probe for cached data in the cache, the interface Lookup() is used;
To insert data into the cache, the interface Store() is used. Please note
that eviction happens inline currently during Store().

Testing done: a new BE test was added; core test with cache enabled.

Perf:
- 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement
over runs without the cache. Each node has a cache size of 150GB per node.

Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/runtime/io/CMakeLists.txt
A be/src/runtime/io/data-cache-test.cc
A be/src/runtime/io/data-cache.cc
A be/src/runtime/io/data-cache.h
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/runtime/io/hdfs-file-reader.h
M be/src/runtime/io/request-context.h
M be/src/util/impalad-metrics.cc
M be/src/util/impalad-metrics.h
M common/thrift/metrics.json
14 files changed, 1,115 insertions(+), 9 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/1
-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho <kw...@cloudera.com>

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has uploaded a new patch set (#10). ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................

IMPALA-8341: Data cache for remote reads

This is a patch based on PhilZ's prototype: https://gerrit.cloudera.org/#/c/12683/

This change implements an IO data cache which is backed by
local storage. It implicitly relies on the OS page cache
management to shuffle data between memory and the storage
device. This is useful for caching data read from remote
filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS).

A data cache is divided into one or more partitions based on
the configuration string which is a list of directories, separated
by comma, followed by the storage capacity per directory.
An example configuration string is like the following:
  --data_cache_config=/data/0,/data/1:150GB

In the configuration above, the cache may use up to 300GB of
storage space, with 150GB max for /data/0 and /data/1 respectively.

Each partition has a meta-data cache which tracks the mappings
of cache keys to the locations of the cached data. A cache key
is a tuple of (file's name, file's modification time, file offset)
and a cache entry is a tuple of (backing file, offset in the backing
file, length of the cached data, optional checksum). Note that the
cache currently doesn't support overlapping ranges. In other words,
if the cache contains an entry of a file for range [m, m+4MB), a lookup
for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen
this as a problem but this may require further evaluation in the future.

Each partition stores its set of cached data in backing files created
on local storage. When inserting new data into the cache, the data is
appended to the current backing file in use. The storage consumption
of each cache entry counts towards the quota of that partition. When a
partition reaches its capacity, the least recently used (LRU) data in
that partition is evicted. Evicted data is removed from the underlying
storage by punching holes in the backing file it's stored in. As a
backing file reaches a certain size (by default 4TB), new data will
stop being appended to it and a new file will be created instead. Note
that due to hole punching, the backing file is actually sparse. When
the number of backing files per partition exceeds,
--data_cache_max_files_per_partition, files are deleted in the order
in which they are created. Stale cache entries referencing deleted
files are erased lazily or evicted due to inactivity.

Optionally, checksumming can be enabled to verify read from the cache
is consistent with what was inserted and to verify that multiple attempted
insertions with the same cache key have the same cache content.
Checksumming is enabled by default for debug builds.

To probe for cached data in the cache, the interface Lookup() is used;
To insert data into the cache, the interface Store() is used. Please note
that eviction happens inline currently during Store().

This patch also added two startup flags for start-impala-cluster.py:
'--data_cache_dir' specifies the base directory in which each Impalad
creates the caching directory
'--data_cache_size' specifies the capacity string for each cache directory.

Testing done:
- added a new BE and EE test
- exhaustive (debug, release) builds with cache enabled
- core ASAN build with cache enabled

Perf:
- 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement
over runs without the cache. Each node has a cache size of 150GB per node.
The performance is at parity with a configuration of a HDFS cluster using
EBS as the storage.

Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/runtime/io/CMakeLists.txt
A be/src/runtime/io/data-cache-test.cc
A be/src/runtime/io/data-cache.cc
A be/src/runtime/io/data-cache.h
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/runtime/io/hdfs-file-reader.h
M be/src/runtime/io/request-context.h
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/impalad-metrics.cc
M be/src/util/impalad-metrics.h
M bin/start-impala-cluster.py
M common/thrift/metrics.json
A testdata/workloads/functional-query/queries/QueryTest/data-cache.test
M tests/common/custom_cluster_test_suite.py
M tests/common/impala_test_suite.py
A tests/custom_cluster/test_data_cache.py
M tests/custom_cluster/test_krpc_metrics.py
23 files changed, 2,150 insertions(+), 55 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/10
-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 10
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 11:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4146/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 11
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 03 May 2019 02:49:46 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 9:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/3031/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 9
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 02 May 2019 03:07:47 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 1:

(1 comment)

> I think the _IO_ apportioned relative to capacity may be fine as that's a reasonable expectation if a user specifies more capacity on a slower device. One thing to consider may be to use multiple backing files for larger partition to avoid per-file lock problem.

I think, unfortunately, the lower capacity device is most likely to be the faster one (eg a small SSD plus big HDDs).

Maybe we can simplify this for now by just requiring that all partitions of the cache be allocated the same amount of space? The most immediate scenario I can see is on Amazon instances like r5d.4xlarge (two local SSDs with the same size) where you'd want to give the same amount of capacity to each cache drive anyway.

Put another way, I think we should constrain the configuration space in such a way that only good configs are possible, rather than hope that users don't do something like: /ssd/:100G,/hdd1:1TB,/hdd2:1TB,/hdd3:1TB and find that their SSD's fast IO performance is basically ignored.

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@79
PS1, Line 79:   // Unlink the file so the disk space is recycled once the process exits.
> Thanks for the feedback. I also considered the approach of cleaning up the 
In terms of preventing the "wiping" of a directory if a previous Impala's already running, we can always use advisory locks. kudu::Env::Default()->LockFile() can do this for you pretty easily. 

One question I don't know the answer to: there might be an advantage to operating on a deleted inode in performance. It may be that XFS or ext4 has optimizations that kick in when a file is unlinked. For example, normally, every write to a file will update the file's mtime, which requires writing to the filesystem journal, etc. (i've often seen stacks blocked in file_update_time() waiting on a jbd2 lock in the kernel under heavy IO). I'm not sure if this is optimized or not but you could certainly imagine that, for an unlinked file, journaling would be skipped, etc, since the goal is that on a crash we don't need to restore that file. May also be that this optimization isn't really implemented in practice :) I looked through the kernel source for a bit to see if I could find such an optimization but wasn't one obviously present.



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 12 Apr 2019 16:00:21 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "David Rorke (Code Review)" <ge...@cloudera.org>.
David Rorke has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@79
PS1, Line 79:   // Unlink the file so the disk space is recycled once the process exits.
> I'm not sure how I feel about this - it could cause a lot of confusion havi
Agree and I also think there are some legitimate reasons that people might need to be able to access the files by name including capturing the file for debugging in support cases.



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 10 Apr 2019 21:02:39 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 7:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache-test.cc
File be/src/runtime/io/data-cache-test.cc:

http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache-test.cc@274
PS7, Line 274:       ScopedFlagSetter<int64_t>::Make(&FLAGS_data_cache_file_max_size_bytes, 1024 * 1024);
I think gflags has a built in 'google::FlagsSaver', no? It handles restoring all modified flags upon scope exit. Soi you could just do:

google::FlagSaver s;
FLAGS_data_cache_file_max_size_bytes = 1024 * 1024;

(in kudu we do the FlagSaver thing in our base test class, not sure if there is something equivalent in Impala)


http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache.cc@245
PS7, Line 245:   CacheFile(std::string path) : path_(move(path)) { }
nit: explicit


http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache.cc@301
PS7, Line 301:   DCHECK(partition_lock.owns_lock());
here I think you could just call lock_.DCheckLocked()  right?


http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache.cc@409
PS7, Line 409:   UnpackCacheEntry(meta_cache_->Value(handle), &entry);
why not have UnpackCacheEntry return the CacheEntry object? (the generated code would be equivalent due to return value optimization) Or make a constructor for CacheEntry which takes a Slice directly?



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 7
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 01 May 2019 22:47:18 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@88
PS4, Line 88: /// - bound number of backing files per partition by consolidating the content of very
> I think with our current assumption of 4TB of logical space per file, if yo
Oh I didn't realise the files were that large, ok, that makes sense then.



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 4
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 25 Apr 2019 18:10:57 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 12: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/4147/


-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 12
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 03 May 2019 10:32:33 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Hello Thomas Marshall, Lars Volker, David Rorke, Sahil Takiar, Todd Lipcon, Tim Armstrong, Joe McDonnell, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12987

to look at the new patch set (#4).

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................

IMPALA-8341: Data cache for remote reads

This is a patch based on PhilZ's prototype: https://gerrit.cloudera.org/#/c/12683/

This change implements an IO data cache which is backed by
local storage. It implicitly relies on the OS page cache
management to shuffle data between memory and the storage
device. This is useful for caching data read from remote
filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS).

A data cache is divided into one or more partitions based on
the configuration string which is a list of directories, separated
by comma, followed by the storage capacity per directory.
An example configuration string is like the following:
  --data_cache_config=/data/0,/data/1:150GB

In the configuration above, the cache may use up to 300GB of
storage space, with 150GB max for /data/0 and /data/1 respectively.

Each partition has a meta-data cache which tracks the mappings
of cache keys to the locations of the cached data. A cache key
is a tuple of (file's name, file's modification time, file offset)
and a cache entry is a tuple of (backing file, offset in the backing
file, length of the cached data, optional checksum). Note that the
cache currently doesn't support overlapping ranges. In other words,
if the cache contains an entry of a file for range [m, m+4MB), a lookup
for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen
this as a problem but this may require further evaluation in the future.

Each partition stores its set of cached data in backing files created
on local storage. When inserting new data into the cache, the data is
appended to the current backing file in use. The storage consumption
of each cache entry counts towards the quota of that partition. When a
partition reaches its capacity, the least recently used (LRU) data in
that partition is evicted. Evicted data is removed from the underlying
storage by punching holes in the backing file it's stored in. As a
backing file reaches a certain size (by default 4TB), new data will
stop being appended to it and a new file will be created instead. Note
that due to hole punching, the backing file is actually sparse.

Optionally, checksumming can be enabled to verify read from the cache
is consistent with what was inserted and to verify that multiple attempted
insertions with the same cache key have the same cache content.
Checksumming is enabled by default for debug builds.

To probe for cached data in the cache, the interface Lookup() is used;
To insert data into the cache, the interface Store() is used. Please note
that eviction happens inline currently during Store().

This patch also added two startup flags for start-impala-cluster.py:
'--data_cache_dir' specifies the base directory in which each Impalad
creates the caching directory
'--data_cache_size' specifies the capacity string for each cache directory.

Testing done: a new BE test was added; core test with cache enabled.

Perf:
- 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement
over runs without the cache. Each node has a cache size of 150GB per node.
The performance is at parity with a configuration of a HDFS cluster using
EBS as the storage.

Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/runtime/io/CMakeLists.txt
A be/src/runtime/io/data-cache-test.cc
A be/src/runtime/io/data-cache.cc
A be/src/runtime/io/data-cache.h
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/runtime/io/hdfs-file-reader.h
M be/src/runtime/io/request-context.h
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/impalad-metrics.cc
M be/src/util/impalad-metrics.h
M bin/start-impala-cluster.py
M common/thrift/metrics.json
18 files changed, 1,706 insertions(+), 29 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/4
-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 4
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 2:

TODO:

- add a test to rotate the files in data-cache-test.cc
- add a test for the changes in filesystem-util.cc


-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 2
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 16 Apr 2019 19:46:40 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 12: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 12
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 03 May 2019 08:06:27 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 9: Code-Review+2

(1 comment)

just simple nit, otherwise +2

http://gerrit.cloudera.org:8080/#/c/12987/9/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/9/be/src/runtime/io/data-cache.cc@260
PS9, Line 260:   CacheEntry(const Slice& value) {
explicit :)



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 9
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 02 May 2019 17:58:29 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 12: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 12
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 03 May 2019 19:39:41 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 5:

(17 comments)

http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache-test.cc
File be/src/runtime/io/data-cache-test.cc:

http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache-test.cc@273
PS5, Line 273:   FLAGS_data_cache_file_max_size = 1024 * 1024;
I just found out we have ScopedFlagSetter in scoped-flag-setter.h, I think it fits here and in the other tests.


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h@215
PS5, Line 215: too_many_files
'start_reclaim'?


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h@337
PS5, Line 337:   std::unique_ptr<ThreadPool<int>> file_deleter_pool_;
Can you mention in the comment that the pool has only 1 thread and why you're using a pool? I think it's because the pool makes handling the thread's lifetime easier, but I'm not sure that's correct.


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h@341
PS5, Line 341:   void CloseOldFiles(uint32_t thread_id, int partition_idx);
Some functions around deleting files are called "Close...". We should point out in the comments somewhere that closing now also deletes. We could also rename the thread pool to file_closing_pool or rename the methods to "DeleteOldFiles" for consistency. I think I prefer the latter, since deletion implies closing, but the contraposition is not obvious.


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@72
PS4, Line 72:     "(Advanced) Enable checksumming for the cached buffer.");
> This is actually a static class member of DataCache.
Sry for missing that.


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@187
PS4, Line 187: inline
> Not sure which one you are referring to ? Isn't it in #include "common/name
Yeah, I think we commonly omit the explicit include for vector


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@95
PS5, Line 95: file deleter thread
switch to single thread, or mention pool here


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@112
PS5, Line 112: RetireFile
Can we call this DeleteFile? Otherwise there's a third thing to keep track of (Close, Delete, Retire) and the differences are subtle. I feel it's clear enough that DeleteFile would make sure it's closed.


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@125
PS5, Line 125: percpu_rwlock
It's not obvious to me why we only need a percpu_rwlock here. Can you add a comment?


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@208
PS5, Line 208: holes
nit: singular


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@335
PS5, Line 335: CloseAndVerifyFileSizes
Similar to other comments, I'd call this "VerifySizeAndDeleteFiles", I think that captures well what's going on and the caller can expect the files to get closed. I don't feel strongly about that one though.


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@395
PS5, Line 395:     meta_cache_->Erase(key);
Will this handle hole punching through the eviction logic?


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@436
PS5, Line 436:   VLOG(2) << Substitute("Storing file $0 offset $1 len $2 checksum $3 ",
nit: only append the "checksum $3" part if checksumming is enabled? I don't feel strongly about it though.


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@457
PS5, Line 457: too_many_files
start_reclaim?


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@633
PS5, Line 633: too_many_files
start_reclaim?


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/hdfs-file-reader.cc
File be/src/runtime/io/hdfs-file-reader.cc:

http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/hdfs-file-reader.cc@37
PS5, Line 37:  
nit: trailing space


http://gerrit.cloudera.org:8080/#/c/12987/5/tests/custom_cluster/test_data_cache.py
File tests/custom_cluster/test_data_cache.py:

http://gerrit.cloudera.org:8080/#/c/12987/5/tests/custom_cluster/test_data_cache.py@23
PS5, Line 23: cache hit and miss counts
            :   in the runtime profile are as expected.
It actually seems to check the metrics, not the profile counters.



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 5
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Mon, 29 Apr 2019 21:39:59 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 4:

(31 comments)

Mostly looks good to me. Please feel free to defer some of the comments to a subsequent change to make faster progress with this one.

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/exec/hdfs-scan-node-base.cc@365
PS4, Line 365:       "DataCacheHitCount", TUnit::UNIT);
Do we have tests for these to make sure they show up in the profiles and work as expected?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc
File be/src/runtime/io/data-cache-test.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@330
PS1, Line 330:   for (int i = 0; i < NUM_THREADS; ++i) {
> Prefer to keep this test as stand-alone to make development easier.
wfm


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc
File be/src/runtime/io/data-cache-test.cc:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc@138
PS4, Line 138:   const char* fname_ = "foobar";
Move to the other test constants (TEMP_BUFFER_SIZE etc)?


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc@226
PS4, Line 226:   const int64_t cache_size = 4 * 1024 * 1024;
nit: This could now be 4 * FLAGS_data_cache_file_max_size ?


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc@349
PS4, Line 349: This second part
Can they be to separate tests?


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc@410
PS4, Line 410: beyone
nit: typo


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@83
PS4, Line 83: concurrency of one thread
That's controlled by --data_cache_write_concurrency, right? Mention here?


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@87
PS4, Line 87: /// Future work:
Do we plan to look into picking partitions on faster disks with higher probability?


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@88
PS4, Line 88: /// - bound number of backing files per partition by consolidating the content of very
Should we just delete old files if we have reached a configurable high number, 1000 or so, and take the resulting cache hits instead of compacting?


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@145
PS4, Line 145:  
nit: formatting


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@197
PS4, Line 197:     /// to be called in a single threaded environment after all IO threads exited.
Should we also delete the files when we close them? There's a distinction in the implementation and it's not clear to me why it's there.


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@203
PS4, Line 203: const kudu::Slice& key
Should we pass const CacheKey& here and convert it in the implementation?


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@224
PS4, Line 224: VerifyFilesSizes
nit: VerifyFileSizes


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@72
PS4, Line 72: const char* DataCache::Partition::CACHE_FILE_PREFIX = "impala-cache-file-";
static const char*?


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@75
PS4, Line 75: struct DataCache::CacheFile {
Should this be a class, given it has a c'tor and d'tor?


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@173
PS4, Line 173:  make_unique<CacheFile>(path, fd);
             :   cache_files_.emplace_back(std::move
I think you can merge these two lines, which also reduces the risk that someone accidentally uses the now empty cache_file in a future change.


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@187
PS4, Line 187: vector
nit: missing include, but we might generally omit this one.


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@186
PS4, Line 186: // Delete all existing backing files left over from previous runs.
             :   vector<string> entries;
             :   RETURN_IF_ERROR(FileSystemUtil::Directory::GetEntryNames(path_, &entries, 0,
             :       FileSystemUtil::Directory::EntryType::DIR_ENTRY_REG));
             :   for (const string& entry : entries) {
             :     if (entry.find_first_of(CACHE_FILE_PREFIX) == 0) {
             :       const string file_path = JoinPathSegments(path_, entry);
             :       int res;
             :       RETRY_ON_EINTR(res, unlink(file_path.c_str()));
             :       LOG(INFO) << Substitute("$0 old cache file $1 $2",
             :           res == 0 ? "Deleted" : "Failed to delete", file_path,
             :           res == 0 ? "" : GetStrErrMsg());
             :     }
             :   }
This might be a good candidate for a separate method, but I don't feel strongly about it


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@235
PS4, Line 235: total_size
Wrap in PrintBytes, or just put "bytes" into the message


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@316
PS4, Line 316:       meta_cache_->Free(pending_handle);
nit: can be on a single line


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@391
PS4, Line 391:     while (cache_file->current_offset + charge_len > FLAGS_data_cache_file_max_size &&
What are the cases where this loop runs more than once? Would a single if-stmt be sufficient?


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@402
PS4, Line 402:   // Set up a scoped exit to always remove entry from the pending insertion set.
Does this need to go to line 387 before the return statement in the file loop?


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@459
PS4, Line 459: 0
Is a capacity of 1 really supported? Should we require something that technically can work (1 page) or even something reasonable (1M?)?


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@482
PS4, Line 482:   if (partitions_.empty()) return 0;
Is this a valid thing to happen or should we DCHECK?


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/disk-io-mgr.cc
File be/src/runtime/io/disk-io-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/disk-io-mgr.cc@61
PS4, Line 61: should
"must have"? Such a configuration is not supported anyways, but if people try, the daemons might just wipe each others cache files on startup.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.h
File be/src/runtime/io/hdfs-file-reader.h:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.h@48
PS1, Line 48:   /// Please note that this interface is only effective for local HDFS reads as it
> This is a common path shared with local-file-reader.h too. Not sure if it m
I agree. ReadCachedFile would be another alternative. It seems unconventional to have a method read data and not include "Read" somewhere in its name. I don't feel strongly though and am ok with keeping it unchanged


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/hdfs-file-reader.cc
File be/src/runtime/io/hdfs-file-reader.cc:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/hdfs-file-reader.cc@153
PS4, Line 153:         VLOG(2) << "Reopening file " << scan_range_->file_string()
VLOG_FILE?


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/util/filesystem-util-test.cc
File be/src/util/filesystem-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/util/filesystem-util-test.cc@147
PS4, Line 147: directory
file type entries?


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/util/filesystem-util.cc
File be/src/util/filesystem-util.cc:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/util/filesystem-util.cc@339
PS4, Line 339:   RETRY_ON_EINTR(r, fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
Kudu's RWFile encapsulates this. Should we use it instead?


http://gerrit.cloudera.org:8080/#/c/12987/4/bin/start-impala-cluster.py
File bin/start-impala-cluster.py:

http://gerrit.cloudera.org:8080/#/c/12987/4/bin/start-impala-cluster.py@118
PS4, Line 118: None
"_size" implies that the default should be something like 0?


http://gerrit.cloudera.org:8080/#/c/12987/4/common/thrift/metrics.json
File common/thrift/metrics.json:

http://gerrit.cloudera.org:8080/#/c/12987/4/common/thrift/metrics.json@433
PS4, Line 433:     "description": "Total number of insertion skipped in remote data cache due to concurrency limit.",
Do we have a test for the new metrics to make sure they work as expected?



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 4
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 23 Apr 2019 20:48:31 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 12:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4147/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 12
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 03 May 2019 08:06:28 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.cc@343
PS2, Line 343: insertion_offset));
insertion_offset + bytes_written



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 2
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 17 Apr 2019 18:35:31 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 2:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/2802/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 2
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 16 Apr 2019 20:26:56 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 6:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/2975/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 6
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Mon, 29 Apr 2019 19:28:23 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/2842/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 4
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 19 Apr 2019 06:10:32 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 3:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.h@88
PS2, Line 88: consolidatin
> consolidating
Done


http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.h@89
PS2, Line 89: sparse
> sparse
Done


http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.h@166
PS2, Line 166: 
             :   /// Utility function to verify that all partitions' consumption don't exceed their
             :   /// quotas. Retur
> clang-tidy failure: struct instead of class.
Done


http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.h@184
PS2, Line 184: 
> empty
Done


http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.cc@296
PS2, Line 296:       meta_cache_->Erase(key);
             :       return true;
             :     }
             :   }
> TODO: Add a metric for this.
Done


http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.cc@343
PS2, Line 343: 
> insertion_offset + bytes_written
Done


http://gerrit.cloudera.org:8080/#/c/12987/2/bin/start-impala-cluster.py
File bin/start-impala-cluster.py:

http://gerrit.cloudera.org:8080/#/c/12987/2/bin/start-impala-cluster.py@117
PS2, Line 117: 
> flake8: E703 statement ends with a semicolon
Done


http://gerrit.cloudera.org:8080/#/c/12987/2/bin/start-impala-cluster.py@120
PS2, Line 120: 
> flake8: E703 statement ends with a semicolon
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 3
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 19 Apr 2019 02:20:58 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 4:

(3 comments)

I don't know that I'll have time to do a fully detailed review, and it looks like we may already have enough eyes on it. The design and APIs look great - it does seem like we want to get it checked in and exercised more soon.

http://gerrit.cloudera.org:8080/#/c/12987/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12987/4//COMMIT_MSG@61
PS4, Line 61: Testing done: a new BE test was added; core test with cache enabled.
Can we add a custom cluster test to sanity check that it works end to end. Myabe this is where we should sanity check metrics too.


http://gerrit.cloudera.org:8080/#/c/12987/4//COMMIT_MSG@62
PS4, Line 62: 
Do we want to consider enabling this by default for the dockerised tests as a next step?


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@88
PS4, Line 88: /// - bound number of backing files per partition by consolidating the content of very
> Should we just delete old files if we have reached a configurable high numb
Yeah this scenario is a bit concerning for me still since it's conceivable that the workloads we test with might have different properties to real ones. E.g. I can imagine a workload might end up with some set of files that are queries frequently enough to stay resident in the cache indefinitely and keep very old files alive. Not sure if Lars' idea is easier than compaction but it is appealing in some ways.



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 4
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 23 Apr 2019 23:38:20 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 9: Code-Review+1

Carry Lars' +1


-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 9
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 02 May 2019 02:24:51 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 1:

After sleeping on this patch, had a few more higher-level thoughts:

1- we should probably periodically check disk space on the allocated partition and if there is a limited amount remaining we should stop using that partition for cache. In the future maybe we would dynamically evict stuff if we're about to run out of space but I think a simple "stop adding to the problem" is a good starting point. We should also handle ENOSPC gracefully - not sure if we do today.

2- currently we hash the cache keys to pick a partition, and then each partition is separately LRU. That should work OK if the partitions are allocated equal sizes. In the case that they aren't, however, this will have a negative effect on hit ratio. For example, consider the limit where one partition has only 10MB of capacity whereas a second partition has 1TB. If we assume that 10MB of cache provides a very low hit rate, then half of our cache queries will hit the low-hit-rate cache and we'll be limited at close to 50% hit rate for the combined cache.

A few different solutions come to mind: (a) instead of hashmod to pick a partition, we could use the hsah code to place a token in a hash space sized based on capacity. eg in the above example, the first partition owns hash values 0-1MB and the second partition owns hash values 1MB-1.001TB. Upon hashing a key, we mod by the total capacity and pick a partition with weight relative to capacity.

One downside of this approach is that the _IO_ usage will end up apportioned relative to capacity. That may be especially bad considering smaller devices are likely to be faster (a well-meaning user might allocate /data/ssd:100GB,/data/hdd:1TB but now they're getting 10x the IO pushed on their HDD instead of the SSD.

Another option would be to do a global hashtable and have the entries directly identify their partitions (in fact I think they already do via the CacheFile pointer, right?). Eviction still needs to be per-partition based on that partition's capacity, but insertion can try to more smartly allocate across drives.


Last thought: since there are some tricky design decisions above, maybe we can simplify this and just support a single partition for now, and come back to supporting multi-partition?


-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 11 Apr 2019 16:20:55 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 6:

(24 comments)

http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.h@112
PS6, Line 112:   /// 'config' is the configuration string which specifies a list of <directory>:<quota>
             :   /// tuples, delimited by comma.
per the commit message, we've moved to a single quota rather than per-directory quotas, right? or is that a typo in the commit message?


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.h@261
PS6, Line 261:     int oldest_opened_file_ = 0;
perhaps init to -1?


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@64
PS6, Line 64: DEFINE_int64(data_cache_file_max_size, 4L << 40,
- can you add a comment here like /* 4TB */?
- have we tested that 4TB actually works in a long-running cluster? Now that you have the deletion support in, maybe 1TB is a safer default if we're not sure about full FS support?
- can you rename to _max_size_bytes?


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@68
PS6, Line 68:     "(Advanced) The maximum number of allowed opened files per partition.");
Setting this per-partition creates a dependency between this and the number of partitions. I think it would better to have this be a total, and then auto-set the per-partition limit by dividing the capacity among the partitions. Otherwise it's likely people will have to set this to keep fd limit in check, right? Or do we generally assume that ulimit -n is boosted super high for impala?


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@70
PS6, Line 70:     "(Advanced) Number of concurrent threads allowed to insert into the cache");
is this per-partition? should be, right?


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@117
PS6, Line 117:     KUDU_RETURN_IF_ERROR(kudu::Env::Default()->NewRWFile(path, &cache_file->file_),
why not pass the RWFile into the CacheFile constructor vs creating an empty one and callign NewRWFile here?


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@141
PS6, Line 141:     kudu::Status status = kudu::Env::Default()->DeleteFile(path_);
WARN_NOT_OK could be used here

(i think WARN is more appropriate than ERROR since no data is lost, etc)


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@153
PS6, Line 153: inline
'inline' here and elsewhere isn't necessary since you've defined them inline inside the class anyway


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@174
PS6, Line 174:     if (UNLIKELY(!file_)) return false;
worth a DCHECK that offset + bytes_to_read <= current_offset_


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@192
PS6, Line 192:     kudu::Status status = file_->Write(offset, Slice(buffer, buffer_len));
same DCHECK suggested above


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@205
PS6, Line 205:     if (UNLIKELY(!file_)) return;
same


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@228
PS6, Line 228: used for synchronization
instead of just saying used for synchronozation" I think best to say "taken in write mode during deletion, and shared mode everywhere else"


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@281
PS6, Line 281: Status DataCache::Partition::CreateCacheFile() {
can you DCHECK that lock_ is held by the current thread here? (same elsewhere in functions that require the lock to be held on entry)


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@347
PS6, Line 347:     KUDU_RETURN_IF_ERROR(env->GetFileSizeOnDisk(file->path(), &logical_sz),
             :         "CloseAndVerifyFileSizes()");
is this the right method call? seems the same as above


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@352
PS6, Line 352: resize(0);
.clear()


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@360
PS6, Line 360: void DataCache::Partition::Close() {
dcheck the lock is held?


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@380
PS6, Line 380:   const CacheEntry* entry = reinterpret_cast<const CacheEntry*>(value_slice.data());
I think this pattern is used pretty widely in Impala but it's moderately sketchy, since it assumes that value_slice's allocation has the same alignment requirements as CacheEntry. That's likely but not really guaranteed by anything, and we've had crashes in the past in Kudu with similar usage. I'd suggest:

CacheEntry entry;
memcpy(&entry, value_slice.data(), sizeof(entry));

just to be safe (the memcpy of 40 bytes or whatever is probably free, and could be optimized out by the compiler generating unaligned loads)


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@395
PS6, Line 395:     meta_cache_->Erase(key);
should LOG(WARNING) an error here on bad checksum, so people know their drive might be acting bad


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@436
PS6, Line 436:   VLOG(2) << Substitute("Storing file $0 offset $1 len $2 checksum $3 ",
maybe VLOG(3)?


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@444
PS6, Line 444:       reinterpret_cast<CacheEntry*>(meta_cache_->MutableValue(pending_handle));
similar to comment above, I think it's better to create a struct, set the fields, and memcpy into the value in the cache


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@508
PS6, Line 508: insert
nit: emplace


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@588
PS6, Line 588: DCHECK_GT
CHECK here (not a hot path and we'd crash later)


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@616
PS6, Line 616:   if (VLOG_IS_ON(2)) {
3 or 4 maybe? this will be quite noisy


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/util/filesystem-util.cc
File be/src/util/filesystem-util.cc:

http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/util/filesystem-util.cc@305
PS6, Line 305: Status FileSystemUtil::CheckHolePunch(const string& path) {
we should probably test for known buggy el6 versions and ext4 here and disallow hole punching usage (see Kudu source for the code for this). Otherwise using the cache can cause file systems to need fsck repair on startup.



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 6
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Mon, 29 Apr 2019 23:08:36 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has uploaded a new patch set (#8). ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................

IMPALA-8341: Data cache for remote reads

This is a patch based on PhilZ's prototype: https://gerrit.cloudera.org/#/c/12683/

This change implements an IO data cache which is backed by
local storage. It implicitly relies on the OS page cache
management to shuffle data between memory and the storage
device. This is useful for caching data read from remote
filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS).

A data cache is divided into one or more partitions based on
the configuration string which is a list of directories, separated
by comma, followed by the storage capacity per directory.
An example configuration string is like the following:
  --data_cache_config=/data/0,/data/1:150GB

In the configuration above, the cache may use up to 300GB of
storage space, with 150GB max for /data/0 and /data/1 respectively.

Each partition has a meta-data cache which tracks the mappings
of cache keys to the locations of the cached data. A cache key
is a tuple of (file's name, file's modification time, file offset)
and a cache entry is a tuple of (backing file, offset in the backing
file, length of the cached data, optional checksum). Note that the
cache currently doesn't support overlapping ranges. In other words,
if the cache contains an entry of a file for range [m, m+4MB), a lookup
for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen
this as a problem but this may require further evaluation in the future.

Each partition stores its set of cached data in backing files created
on local storage. When inserting new data into the cache, the data is
appended to the current backing file in use. The storage consumption
of each cache entry counts towards the quota of that partition. When a
partition reaches its capacity, the least recently used (LRU) data in
that partition is evicted. Evicted data is removed from the underlying
storage by punching holes in the backing file it's stored in. As a
backing file reaches a certain size (by default 4TB), new data will
stop being appended to it and a new file will be created instead. Note
that due to hole punching, the backing file is actually sparse. When
the number of backing files per partition exceeds,
--data_cache_max_files_per_partition, files are deleted in the order
in which they are created. Stale cache entries referencing deleted
files are erased lazily or evicted due to inactivity.

Optionally, checksumming can be enabled to verify read from the cache
is consistent with what was inserted and to verify that multiple attempted
insertions with the same cache key have the same cache content.
Checksumming is enabled by default for debug builds.

To probe for cached data in the cache, the interface Lookup() is used;
To insert data into the cache, the interface Store() is used. Please note
that eviction happens inline currently during Store().

This patch also added two startup flags for start-impala-cluster.py:
'--data_cache_dir' specifies the base directory in which each Impalad
creates the caching directory
'--data_cache_size' specifies the capacity string for each cache directory.

Testing done:
- added a new BE and EE test
- exhaustive (debug, release) builds with cache enabled
- core ASAN build with cache enabled

Perf:
- 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement
over runs without the cache. Each node has a cache size of 150GB per node.
The performance is at parity with a configuration of a HDFS cluster using
EBS as the storage.

Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/runtime/io/CMakeLists.txt
A be/src/runtime/io/data-cache-test.cc
A be/src/runtime/io/data-cache.cc
A be/src/runtime/io/data-cache.h
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/runtime/io/hdfs-file-reader.h
M be/src/runtime/io/request-context.h
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/impalad-metrics.cc
M be/src/util/impalad-metrics.h
M bin/start-impala-cluster.py
M common/thrift/metrics.json
A testdata/workloads/functional-query/queries/QueryTest/data-cache.test
M tests/common/custom_cluster_test_suite.py
M tests/common/impala_test_suite.py
A tests/custom_cluster/test_data_cache.py
M tests/custom_cluster/test_krpc_metrics.py
23 files changed, 2,150 insertions(+), 55 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/8
-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 8
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/request-context.h
File be/src/runtime/io/request-context.h:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/request-context.h@161
PS1, Line 161:   int data_cache_partial_hit_count() const { return data_cache_partial_hit_count_.Load(); }
line too long (91 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 10 Apr 2019 17:13:26 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@147
PS1, Line 147:   if (charge_len > capacity_) return false;
> Given that we won't actually insert into the cache until after trying to wr
This actually seems to be a problematic behavior. We may temporarily exceed the capacity limit as a result of this.



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 17 Apr 2019 05:57:05 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Hello Thomas Marshall, Lars Volker, David Rorke, Sahil Takiar, Todd Lipcon, Tim Armstrong, Joe McDonnell, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12987

to look at the new patch set (#7).

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................

IMPALA-8341: Data cache for remote reads

This is a patch based on PhilZ's prototype: https://gerrit.cloudera.org/#/c/12683/

This change implements an IO data cache which is backed by
local storage. It implicitly relies on the OS page cache
management to shuffle data between memory and the storage
device. This is useful for caching data read from remote
filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS).

A data cache is divided into one or more partitions based on
the configuration string which is a list of directories, separated
by comma, followed by the storage capacity per directory.
An example configuration string is like the following:
  --data_cache_config=/data/0,/data/1:150GB

In the configuration above, the cache may use up to 300GB of
storage space, with 150GB max for /data/0 and /data/1 respectively.

Each partition has a meta-data cache which tracks the mappings
of cache keys to the locations of the cached data. A cache key
is a tuple of (file's name, file's modification time, file offset)
and a cache entry is a tuple of (backing file, offset in the backing
file, length of the cached data, optional checksum). Note that the
cache currently doesn't support overlapping ranges. In other words,
if the cache contains an entry of a file for range [m, m+4MB), a lookup
for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen
this as a problem but this may require further evaluation in the future.

Each partition stores its set of cached data in backing files created
on local storage. When inserting new data into the cache, the data is
appended to the current backing file in use. The storage consumption
of each cache entry counts towards the quota of that partition. When a
partition reaches its capacity, the least recently used (LRU) data in
that partition is evicted. Evicted data is removed from the underlying
storage by punching holes in the backing file it's stored in. As a
backing file reaches a certain size (by default 4TB), new data will
stop being appended to it and a new file will be created instead. Note
that due to hole punching, the backing file is actually sparse. When
the number of backing files per partition exceeds,
--data_cache_max_files_per_partition, files are deleted in the order
in which they are created. Stale cache entries referencing deleted
files are erased lazily or evicted due to inactivity.

Optionally, checksumming can be enabled to verify read from the cache
is consistent with what was inserted and to verify that multiple attempted
insertions with the same cache key have the same cache content.
Checksumming is enabled by default for debug builds.

To probe for cached data in the cache, the interface Lookup() is used;
To insert data into the cache, the interface Store() is used. Please note
that eviction happens inline currently during Store().

This patch also added two startup flags for start-impala-cluster.py:
'--data_cache_dir' specifies the base directory in which each Impalad
creates the caching directory
'--data_cache_size' specifies the capacity string for each cache directory.

Testing done:
- added a new BE and EE test
- exhaustive (debug, release) builds with cache enabled
- core ASAN build with cache enabled

Perf:
- 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement
over runs without the cache. Each node has a cache size of 150GB per node.
The performance is at parity with a configuration of a HDFS cluster using
EBS as the storage.

Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/runtime/io/CMakeLists.txt
A be/src/runtime/io/data-cache-test.cc
A be/src/runtime/io/data-cache.cc
A be/src/runtime/io/data-cache.h
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/runtime/io/hdfs-file-reader.h
M be/src/runtime/io/request-context.h
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/impalad-metrics.cc
M be/src/util/impalad-metrics.h
M bin/start-impala-cluster.py
M common/thrift/metrics.json
A testdata/workloads/functional-query/queries/QueryTest/data-cache.test
M tests/common/custom_cluster_test_suite.py
M tests/common/impala_test_suite.py
A tests/custom_cluster/test_data_cache.py
M tests/custom_cluster/test_krpc_metrics.py
23 files changed, 2,157 insertions(+), 55 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/7
-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 7
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 1:

(71 comments)

This new patch changes the configuration string to use a uniform capacity quota for all directories.

http://gerrit.cloudera.org:8080/#/c/12987/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12987/1//COMMIT_MSG@24
PS1, Line 24: is a tuple of (file's name, file's modification time, file offset)
> So this means that you can get a partial cache hit if the offsets match but
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/exec/hdfs-scan-node-base.cc@948
PS1, Line 948:     data_cache_miss_bytes_->Set(reader_context_->data_cache_miss_bytes());
> I think for the above counters we figured that this pattern of copying the 
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc
File be/src/runtime/io/data-cache-test.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@52
PS1, Line 52:   // The callback is invoked by each thread in the multi-threaded tests below.
> callback reads like it's called when something is done, how about ThreadFn?
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@157
PS1, Line 157:   // Read the same same range inserted previously and they should still all in the cache.
> nit: be
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@196
PS1, Line 196:   // Create a buffer way larger than the cache.
> Why does it need to be larger?
Typo. Fixed.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@201
PS1, Line 201:   ASSERT_TRUE(cache.Store("foobar", 12345, 0, large_buffer.get(), cache_size));
> Use variables instead of inline constants?
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@220
PS1, Line 220:   const int64_t cache_size = 1 << 22;
> I find these easier to read in the form of 4 * 1024 * 1024
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@248
PS1, Line 248: differen
> typo
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@301
PS1, Line 301: ootprint) {
> Can you add a comment what this test does?
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@330
PS1, Line 330: int main(int argc, char **argv) {
> This is not needed anymore if you add your test to the unified backend test
Prefer to keep this test as stand-alone to make development easier.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@18
PS1, Line 18: #ifndef IMPALA_RUNTIME_IO_DATA_CACHE_H
> Could use #pragma once instead of include guards
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@53
PS1, Line 53: /// punching holes in the backing file it's stored in. As a backing file reaches a certain
> It's actually a TODO item to retire older files and copy what's left in the
Added a check for filesystem support for hole punching in the new patch.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@55
PS1, Line 55: /// created instead. Note that due to hole punching, the backing file is actually sparse.
> This might be a case where a simple ASCII diagram would illustrate the conc
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@56
PS1, Line 56: ///
> It's by design that we don't support overlapping range for the first implem
Added some explanation in the header comment.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@63
PS1, Line 63: /// happens inline currently during Store().
> It's worth documenting the PAGE_SIZE behaviour since it implies that there'
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@76
PS1, Line 76: class DataCache{
> style nit: missing space before {
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@81
PS1, Line 81:   DataCache(const std::string& config) : config_(config) { }
> totally doesn't matter here, but general best practice in C++11 is to have 
Thanks for the reminder. Almost every time, I have to look up in the internet for the advantage of pass-by-value-then-move over pass-by-reference. It's subtle but it makes sense. We should probably start sticking to this idiom more widely in Impala code base. May be a clang-tidy rule will help ?!

Also totally irrelevant but that's an area where I find passing by pointer in C is sometimes easier to use or understand than C++.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@105
PS1, Line 105:  
> space
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@108
PS1, Line 108:   /// Inserts a new cache entry by copying the content in 'buffer' into the cache.
> Can you specify whether this is synchronous or not? ie should the caller ex
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@110
PS1, Line 110:   /// is installed successfully.
> worth documenting under what cases the data wouldn't be stored successfully
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@110
PS1, Line 110:   /// is installed successfully.
> Is it interesting to document reasons why the content may not be installed?
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@110
PS1, Line 110:   /// is installed successfully.
> Maybe point out explicitly that it returns false for errors and if the cont
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@127
PS1, Line 127:   class Partition : public kudu::Cache::EvictionCallback {
> This could be defined in the .cc file and just forward-declared here, right
Tried that but hit some compilation issue about unique_ptr<> needing to know the size Partition. Moved the definition of CacheFile, CacheKey and CacheEntry into .cc file.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@129
PS1, Line 129:     /// Creates a partition at the given directory 'path' with quota 'capacity'.
> Mention that capacity is bytes? Maybe even add a suffix that indicates it's
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@157
PS1, Line 157:     void EvictedEntry(kudu::Slice key, kudu::Slice value) override;
> Can you add the virtual and override keywords to all virtual methods to mak
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@162
PS1, Line 162:     struct CacheFile {
> Consider making a destructor here responsible for closing fd if it's been s
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@208
PS1, Line 208:     std::vector<CacheFile> cache_files_;
> It seems like CacheEntrys point to CacheFile objects, but here the CacheFil
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@226
PS1, Line 226:     /// failure. Please note that the backing file is unlinked right after it's created
> How will these files look to an administrator? Will this make it harder to 
Switched to removing stale files during initialization during restart. Punt on sharing of cache directory between Impalads on the same host for now.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@236
PS1, Line 236:     /// 'read' is true iff this function is called from Lookup(). This function is
             :     /// also called from Store(), in which case, 'read' is false.
> nit: I'm not a big fan of parameters that get documented like this. From re
Switched to passing in the ops name in the new patch if that's easier to follow. We do need to be able to tell the difference though.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@254
PS1, Line 254:       std::unique_ptr<kudu::faststring>* key);
> You could return the pointer by value
Function deleted in new patch. Added a struct CacheKey instead.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@254
PS1, Line 254:       std::unique_ptr<kudu::faststring>* key);
> why not just take a 'faststring*' here directly, since it's mutable?
Function deleted in new patch. Added a struct CacheKey instead.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@44
PS1, Line 44: using namespace impala;
            : using namespace impala::io;
> why do you need these? The class is already in impala::io so seems like the
Switched to wrapping the following code with
 namespace impala {
 namespace io {
 }
 }

We don't seem to be too consistent on doing that in all files (e.g. disk-io-mgr.cc).


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@57
PS1, Line 57:     "(Advanced) Enable checksumming for the cached buffer.");
> do you have any sense of the overhead here? Checksums with CRC32C are almos
No idea. Please see comments below on why CRC32 isn't picked but it'd be nice to fix the situation in a separate patch.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@62
PS1, Line 62: static const int64_t PAGE_SIZE = 1L << 12;
> any reason for this to be configurable? eg on SSD I think erase blocks are 
This is not configurable so may be I misunderstood your comment.

I just round it up to 4KB as that seems to be the minimum block size handled by the page cache. Not necessarily optimized for the underlying storage media at this moment. This makes hole punching easier as the holes punched are 4KB aligned.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@66
PS1, Line 66: DRAM_CACHE
> Should we add DISK_CACHE here? Otherwise, can you add a comment to explain 
This is actually a cache in memory so DRAM_CACHE seems to make sense to me.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@75
PS1, Line 75:   int64_t fd = open(path.c_str(), O_CREAT | O_TRUNC | O_RDWR, S_IRUSR | S_IWUSR);
> We should probably be retrying this on EINTR (Kudu has a RETRY_ON_EINTR mac
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@79
PS1, Line 79:   // Unlink the file so the disk space is recycled once the process exits.
> In terms of preventing the "wiping" of a directory if a previous Impala's a
The new patch set will clean up any remaining backing files from previous runs at partition initialization time. Please note that this assumes each Impalad process will use a unique caching directory on a given host. We already have similar assumption with the scratch directory today.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@91
PS1, Line 91: Status DataCache::Partition::Init() {
> worth verifying that the path is absolute as well, otherwise a misplaced ch
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@101
PS1, Line 101:   }
> Would be good to borrow 'CheckHolePunch' from kuduf/s/data_dirs.cc in the K
Added something similar in filesystem-util.cc. Importing kudu/fs into the code base can be done later in a separate patch if necessary.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@127
PS1, Line 127:   posix_fadvise(cache_file->fd, entry->offset, bytes_to_read, POSIX_FADV_SEQUENTIAL);
> what's the thinking behind this? This is more likely to cause it to readahe
The thinking behind this is to give advise to kernel that we plan to read sequentially in this region in the backing file. Did you mean the following notes in posix_fadvise():

         Under Linux, POSIX_FADV_NORMAL sets the readahead window to the
       default size for the backing device; POSIX_FADV_SEQUENTIAL doubles
       this size, and POSIX_FADV_RANDOM disables file readahead entirely.
       These changes affect the entire file, not just the specified region
       (but other open file handles to the same file are unaffected).

We do mostly do sequential reads from the backing files albeit at different offsets. Is the concern that the call may be prefetching unnecessary data beyond the requested regions into the page cache and thus evicting useful pages ?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@128
PS1, Line 128:   int64_t bytes_read = pread(cache_file->fd, buffer, bytes_to_read, entry->offset);
> add a TODO to add a metric for time spent reading from cache? We should pro
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@129
PS1, Line 129:   if (UNLIKELY(bytes_read != bytes_to_read)) {
> the pread docs say:
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@139
PS1, Line 139:   meta_cache_->Release(handle);
> use SCOPED_CLEANUP to define this up closer to line 116, and then you can m
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@147
PS1, Line 147:   if (charge_len > capacity_) return false;
> I think the underlying cache ends up sharded once per core unless you expli
Given that we won't actually insert into the cache until after trying to write to the backing file, we may end up exceeding the capacity temporarily. So, this check here may be helpful.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@158
PS1, Line 158:     // TODO: handle cases in which 'store_len' is longer than existing entry;
> It's probably worth documenting this behaviour in the header - it's a littl
Addressed the TODO in the new patch.

Based on the experiment done so far, we rarely if ever has partial hit so this suggests this case is not common in practice. That said, it's not too complicated to handle this case either in the code.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@166
PS1, Line 166:     std::unique_lock<SpinLock> partition_lock(lock_);
> I see you're dropping this lock while writing to disk, but I wonder if inst
Go with a variant of the try-lock solution for now. In particular, the new patch will expose a knob to control the number of concurrent inserts allowed to the cache. Setting it to 1 will simulate the single concurrent insert behavior. This allows us to test for any adverse effect to the performance with a single thread on fast storage or when we are not IO bound. One concern is that we may just skip inserting in the unfortunate cases when the IO threads are scheduled to run at the same time. Probabilistically, it should be low over time but David will probably need to experiment with different concurrencies and see how things go.

For the async write idea, one concern I have is the amount of memory consumed by the queue. We can always put a cap on that and use MemTracker to track and limit the consumption but that's another bound to tune. Another concern is that the allocation of the buffer is from the IO thread and the deallocation of the buffer will be in the async write thread, which is an anti-pattern for TCMalloc. Potentially, we can use buffer pool to work around it. Leave a TODO for it.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@173
PS1, Line 173:     DCHECK(!cache_files_.empty());
> may as well CHECK here, since this isn't that hot of a path, and it'd be be
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@185
PS1, Line 185:     if (cache_file->current_offset > FLAGS_data_cache_file_max_size) {
> So this seems we can overrun the configured max size by one entry. Should w
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@193
PS1, Line 193:   ssize_t bytes_written = 0;
> declare right before usage
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@208
PS1, Line 208: pwrite
> same as above, this needs to loop and handle partial writes and EINTR. 
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@222
PS1, Line 222: done:
> agreed
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@222
PS1, Line 222: done:
> You could use a scoped cleanup lambda and returns instead of goto
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@233
PS1, Line 233:       BitUtil::RoundUp(entry->offset + entry->len, PAGE_SIZE) - entry->offset;
> isn't this equivalent to RoundUp(entry->len, PAGE_SIZE) because we know tha
Yes. Added a DCHECK to confirm entry->offset is rounded to page size.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@234
PS1, Line 234: fallocate
> RETRY_ON_EINTR
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@245
PS1, Line 245:   return HashUtil::FastHash64(buffer, buffer_len, 0xcafebeef);
> Why not use CRC32 which can be optimized to around 8 bytes/cycle? (our impl
Yes, the current implementation in hash-util.h doesn't support length > 32-bit and it's also not using SSE4_crc32_u64(). Changing that will require changing some codegen'd function according to the comments. So, choosing the easier path for now. Added a TODO here.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@262
PS1, Line 262:   for (const string& cache_config : all_cache_configs) {
> gutil has SplitStringIntoKeyValuePairs() in strings.h that could be helpful
Changed the config string format in the new patch. Thanks for the suggestion. There are some useful utility functions in gutil/strings/split.h.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@323
PS1, Line 323:   int idx = HashUtil::FastHash64(key->data(), key->size(), 0) % partitions_.size();
> maybe combine the hashing and the key construction into one method like Con
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.h
File be/src/runtime/io/disk-io-mgr.h:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.h@198
PS1, Line 198: HDFS file blocks
> not necessarily HDFS.
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.h@445
PS1, Line 445:   std::unique_ptr<DataCache> remote_data_cache_;
> worth clarifying in the doc whether this would be null if not configured
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc
File be/src/runtime/io/disk-io-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc@58
PS1, Line 58: // tuples, separated by comma. An example could be: /data/0:1TB,/data/1:1TB.
> This info should go into the gflags help string instead of a comment
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc@240
PS1, Line 240: get
> redundant get() != nullptr -- unique_ptr implicitly casts to bool in the wa
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc@295
PS1, Line 295:       LOG(ERROR) << "Failed to initialize data cache.";
> Agreed probably fatal here
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc@295
PS1, Line 295:       LOG(ERROR) << "Failed to initialize data cache.";
> Should log the error message at least.
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc@295
PS1, Line 295:       LOG(ERROR) << "Failed to initialize data cache.";
> Have you considered making this fatal?
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.h
File be/src/runtime/io/hdfs-file-reader.h:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.h@48
PS1, Line 48:   virtual void CachedFile(uint8_t** data, int64_t* length) override;
> Maybe rename to ReadHdfsCachedFile to make its purpose more clear at the ca
This is a common path shared with local-file-reader.h too. Not sure if it makes sense to call it ReadHdfsCachedFile() in that context ?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.cc
File be/src/runtime/io/hdfs-file-reader.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.cc@110
PS1, Line 110:     DataCache* remote_data_cache = io_mgr->remote_data_cache();
> I feel like this should be factored out into its own function so that the f
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.cc@124
PS1, Line 124:         if (exclusive_hdfs_fh_ != nullptr) {
> I think a comment here would help explain why we do this
Please see changes in ReadFromPosInternal() in the new patch set. This is now removed in the new patch set.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.cc@163
PS1, Line 163:         LOG(INFO) << "Reopening file " << scan_range_->file_string()
> VLOG(3) or something?
Done. Left it as VLOG(2) as it was helpful for debugging at some point.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.cc@186
PS1, Line 186:    if (borrowed_hdfs_fh != nullptr) {
             :       io_mgr->ReleaseCachedHdfsFileHandle(scan_range_->file_string(), borrowed_hdfs_fh);
             :     }
> is it possible this can get skipped in some of the early return error condi
Done. It's impossible with the current code but switching to scoped clean up makes the code more robust against any future changes.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.cc@193
PS1, Line 193:       remote_data_cache->Store(*scan_range_->file_string(), scan_range_->mtime(),
> I had the same question. I think it sort-of works if you have a cache entry
Addressed the concern with larger entry in the new patch set. To answer the question, partial hits will work in the current patch and we only issue reads for the part missing in the cache. The insertion will be for the entire buffer though, not the missing part.

That said, based on the crude study with parquet + TPCDS with this current patch, there is barely any partial hit in the cache (see the partial hit counter above), which proves that the problem with entry at the same offset but different length isn't an issue with the  workload we tested with.

Just curious in what cases will the scanners today issue a scan range with the same offset in a file but different length ?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/request-context.h
File be/src/runtime/io/request-context.h:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/request-context.h@161
PS1, Line 161:   int data_cache_partial_hit_count() const { return data_cache_partial_hit_count_.Load(); }
> line too long (91 > 90)
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 16 Apr 2019 19:41:38 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 1:

(3 comments)

> Another option would be to do a global hashtable and have the
 > entries directly identify their partitions (in fact I think they
 > already do via the CacheFile pointer, right?). Eviction still needs
 > to be per-partition based on that partition's capacity, but
 > insertion can try to more smartly allocate across drives.
 > 

Yes, I had a similar question when thinking over the patch over the weekend. I tried the approach of weighted sampling + a global LRU cache but stopped short of coming up with a way to maintain the quota per partition with a global LRU list so I abandoned it and kind of punted on this problem for now. May be it's okay to do LRU per partition. Should have documented my thought there in the code. 

I think the _IO_ apportioned relative to capacity may be fine as that's a reasonable expectation if a user specifies more capacity on a slower device. One thing to consider may be to use multiple backing files for larger partition to avoid per-file lock problem.

 > Last thought: since there are some tricky design decisions above,
 > maybe we can simplify this and just support a single partition for
 > now, and come back to supporting multi-partition?

It may be reasonable to punt on it for now. Let me see how much work it is to switch over to the global hash-table + per-partition LRU approach.

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@53
PS1, Line 53: /// punching holes in the backing file it's stored in. As a backing file reaches a certain
> This does mean that the number of backing files could grown unbounded right
It's actually a TODO item to retire older files and copy what's left in them to the current file. I didn't get around to doing that in this first version.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@56
PS1, Line 56: ///
> We should mention in the comment how lookup works, in particular what happe
It's by design that we don't support overlapping range for the first implementation. It seems to fare pretty well with the TPC-DS workload and parquet file format we are using.

One of the TODO in the future is to measure the overlapping case and consider handling them if they are actually common.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@79
PS1, Line 79:   // Unlink the file so the disk space is recycled once the process exits.
> I'm not sure how I feel about this - it could cause a lot of confusion havi
Thanks for the feedback. I also considered the approach of cleaning up the directory on startup but this seems to make certain assumption about whether the directory specified by the user for caching is exclusively used by the Impalad process. 

While this is not a common scenario right now, I can see there may be configuration in the future where one may run multiple Impala containers on the same host and these containers may belong to the same cluster. So, wiping out a particular directory on restart isn't exactly safe. Another alternative would be to include the hostname / IP address + port number as a unique name for the file so we will automatically truncate the file on restart. However, there is no guarantee that an Impalad will restart with the same IP address / hostname after a crash, esp. in a public cloud setting. 

That said, one can also argue that this is a configuration issue and those Impala containers shouldn't be sharing directories etc. The feedback so far suggests that the "hidden" disk usage seems even more undesirable so may be wiping out the directory at startup is not as bad and we can just forbid sharing of caching directories by multiple Impalads on the same host.



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 11 Apr 2019 17:13:23 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Hello Thomas Marshall, Lars Volker, David Rorke, Sahil Takiar, Todd Lipcon, Tim Armstrong, Joe McDonnell, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12987

to look at the new patch set (#6).

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................

IMPALA-8341: Data cache for remote reads

This is a patch based on PhilZ's prototype: https://gerrit.cloudera.org/#/c/12683/

This change implements an IO data cache which is backed by
local storage. It implicitly relies on the OS page cache
management to shuffle data between memory and the storage
device. This is useful for caching data read from remote
filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS).

A data cache is divided into one or more partitions based on
the configuration string which is a list of directories, separated
by comma, followed by the storage capacity per directory.
An example configuration string is like the following:
  --data_cache_config=/data/0,/data/1:150GB

In the configuration above, the cache may use up to 300GB of
storage space, with 150GB max for /data/0 and /data/1 respectively.

Each partition has a meta-data cache which tracks the mappings
of cache keys to the locations of the cached data. A cache key
is a tuple of (file's name, file's modification time, file offset)
and a cache entry is a tuple of (backing file, offset in the backing
file, length of the cached data, optional checksum). Note that the
cache currently doesn't support overlapping ranges. In other words,
if the cache contains an entry of a file for range [m, m+4MB), a lookup
for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen
this as a problem but this may require further evaluation in the future.

Each partition stores its set of cached data in backing files created
on local storage. When inserting new data into the cache, the data is
appended to the current backing file in use. The storage consumption
of each cache entry counts towards the quota of that partition. When a
partition reaches its capacity, the least recently used (LRU) data in
that partition is evicted. Evicted data is removed from the underlying
storage by punching holes in the backing file it's stored in. As a
backing file reaches a certain size (by default 4TB), new data will
stop being appended to it and a new file will be created instead. Note
that due to hole punching, the backing file is actually sparse. When
the number of backing files per partition exceeds,
--data_cache_max_files_per_partition, files are deleted in the order
in which they are created. Stale cache entries referencing deleted
files are erased lazily or evicted due to inactivity.

Optionally, checksumming can be enabled to verify read from the cache
is consistent with what was inserted and to verify that multiple attempted
insertions with the same cache key have the same cache content.
Checksumming is enabled by default for debug builds.

To probe for cached data in the cache, the interface Lookup() is used;
To insert data into the cache, the interface Store() is used. Please note
that eviction happens inline currently during Store().

This patch also added two startup flags for start-impala-cluster.py:
'--data_cache_dir' specifies the base directory in which each Impalad
creates the caching directory
'--data_cache_size' specifies the capacity string for each cache directory.

Testing done:
- added a new BE and EE test
- exhaustive (debug, release) builds with cache enabled
- core ASAN build with cache enabled

Perf:
- 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement
over runs without the cache. Each node has a cache size of 150GB per node.
The performance is at parity with a configuration of a HDFS cluster using
EBS as the storage.

Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/runtime/io/CMakeLists.txt
A be/src/runtime/io/data-cache-test.cc
A be/src/runtime/io/data-cache.cc
A be/src/runtime/io/data-cache.h
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/runtime/io/hdfs-file-reader.h
M be/src/runtime/io/request-context.h
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/impalad-metrics.cc
M be/src/util/impalad-metrics.h
M bin/start-impala-cluster.py
M common/thrift/metrics.json
A testdata/workloads/functional-query/queries/QueryTest/data-cache.test
M tests/common/custom_cluster_test_suite.py
M tests/common/impala_test_suite.py
A tests/custom_cluster/test_data_cache.py
M tests/custom_cluster/test_krpc_metrics.py
23 files changed, 2,062 insertions(+), 55 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/6
-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 6
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 11: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 11
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 03 May 2019 02:49:45 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 12:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4148/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 12
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 03 May 2019 14:10:25 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@88
PS4, Line 88: /// - bound number of backing files per partition by consolidating the content of very
> Yeah this scenario is a bit concerning for me still since it's conceivable 
I think with our current assumption of 4TB of logical space per file, if you assume you're writing a relatively aggressive estimate of 100MB/sec to the cache in a super heavy workload, then you'll only need to roll to a new file every 11 hours. So, 1000 file descriptors will last you over a year. Given that fd counts can be set into the 100k range without any real issues, I don't think this is going to be too problematic under these assumptions.

If we find that we need to support a smaller "rolling" interval than 4TB for some reason, we'll definitely need to address it, but it seems like a lot of complexity to take for now.



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 4
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 25 Apr 2019 18:05:56 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 1:

(37 comments)

Didn't look through the tests yet, but here are some comments on the code. Overall looking pretty good! Like the thorough docs.

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@76
PS1, Line 76: class DataCache{
style nit: missing space before {


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@81
PS1, Line 81:   DataCache(const std::string& config) : config_(config) { }
totally doesn't matter here, but general best practice in C++11 is to have such parameters be pass by value, and then assign in the initializer list using std::move:

DataCache(std::string config) : config_(std::move(config)) {}

since it avoids a copy in the case that you pass a temporary or other rvalue reference into the parameter.

This is more important with big heavy structures and things called in hot paths of course... I don't really care too much here, but figured I'd note it. In the Kudu codebase our clang-tidy bot catches and complains about these.


On a separate note: this should be marked explicit.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@108
PS1, Line 108:   /// Inserts a new cache entry by copying the content in 'buffer' into the cache.
Can you specify whether this is synchronous or not? ie should the caller expect this might take a long time (and thus not want to be holding locks, etc)?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@110
PS1, Line 110:   /// is installed successfully.
worth documenting under what cases the data wouldn't be stored successfully


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@127
PS1, Line 127:   class Partition : public kudu::Cache::EvictionCallback {
This could be defined in the .cc file and just forward-declared here, right?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@162
PS1, Line 162:     struct CacheFile {
Consider making a destructor here responsible for closing fd if it's been set (and make it non-copyable so that's safe)


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@208
PS1, Line 208:     std::vector<CacheFile> cache_files_;
It seems like CacheEntrys point to CacheFile objects, but here the CacheFiles can be reallocated when the vector resizes. Would need to make this a vector<unique_ptr<>> to fix.

I'm guessing the test isn't covering creation of enough CacheFiles to trigger the vector reallocation. Might be worth such a stress test.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@236
PS1, Line 236:     /// 'read' is true iff this function is called from Lookup(). This function is
             :     /// also called from Store(), in which case, 'read' is false.
nit: I'm not a big fan of parameters that get documented like this. From reading this I have no idea what it actually does, or why I'd set it to true or false.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@239
PS1, Line 239:     /// Please note that this function will CHECK fail if there is any checksum mismatch.
Can we make checksum errors get treated as cache misses and remove the entry after logging? It seems like we shouldn't be fatal if the disk has some flipped bit. (maybe DCHECK for debug builds to catch bugs)


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@254
PS1, Line 254:       std::unique_ptr<kudu::faststring>* key);
why not just take a 'faststring*' here directly, since it's mutable?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@44
PS1, Line 44: using namespace impala;
            : using namespace impala::io;
why do you need these? The class is already in impala::io so seems like the using directives aren't really necessary (and in general I think style guide discourages them)


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@57
PS1, Line 57:     "(Advanced) Enable checksumming for the cached buffer.");
do you have any sense of the overhead here? Checksums with CRC32C are almost 8bytes/cycle (so <500us to checksum an 8MB read at 2.5Ghz). I think we should probably default them to on for safety's sake


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@62
PS1, Line 62: static const int64_t PAGE_SIZE = 1L << 12;
any reason for this to be configurable? eg on SSD I think erase blocks are typically 64kb so maybe there's a benefit to writing 64kb-aligned? not sure.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@75
PS1, Line 75:   int64_t fd = open(path.c_str(), O_CREAT | O_TRUNC | O_RDWR, S_IRUSR | S_IWUSR);
We should probably be retrying this on EINTR (Kudu has a RETRY_ON_EINTR macro that can be used).

Same goes for most of the other syscalls here (unlink(), pread(), etc, I think can all get interrupted)


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@91
PS1, Line 91: Status DataCache::Partition::Init() {
worth verifying that the path is absolute as well, otherwise a misplaced chdir somewhere could really screw us up


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@101
PS1, Line 101:   }
Would be good to borrow 'CheckHolePunch' from kuduf/s/data_dirs.cc in the Kudu repo -- this does a "test hole punch" on a file to make sure that the filesystem has the necessary support before proceeding.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@127
PS1, Line 127:   posix_fadvise(cache_file->fd, entry->offset, bytes_to_read, POSIX_FADV_SEQUENTIAL);
what's the thinking behind this? This is more likely to cause it to readahead _past_ the requested range, which isn't what we want


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@128
PS1, Line 128:   int64_t bytes_read = pread(cache_file->fd, buffer, bytes_to_read, entry->offset);
add a TODO to add a metric for time spent reading from cache? We should probably have a query profile entry for this too, since the cache may have spilled to disk and we don't know how fast our local disk might be


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@129
PS1, Line 129:   if (UNLIKELY(bytes_read != bytes_to_read)) {
the pread docs say:

       Note that is not an error for a successful call to transfer fewer bytes
       than requested (see read(2) and write(2)).

Usually this won't happen from a filesystem, but I think it can happen if we get interrupted by a signal, according to the manpage. May need to loop.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@139
PS1, Line 139:   meta_cache_->Release(handle);
use SCOPED_CLEANUP to define this up closer to line 116, and then you can make the 'bytes_read = 0' on line 134 just be an early 'return 0' which will likely be easier to follow.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@147
PS1, Line 147:   if (charge_len > capacity_) return false;
I think the underlying cache ends up sharded once per core unless you explicitly override it somehow. That means that actually the max item size is capacity/num_cores or something. Seems like we should already be handling this internal to the cache


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@166
PS1, Line 166:     std::unique_lock<SpinLock> partition_lock(lock_);
I see you're dropping this lock while writing to disk, but I wonder if instead we should be limiting to only a single concurrent disk writer at a time, and use a 'trylock', so that if we have a lot of parallel threads trying to populate the cache, we don't actually slow down "cold cache" queries. As is, I think this is likely to cause a regression in performance when the cache is too small to get a good hit rate, since the cache churning will cause all of our reads to become writes to a single local disk.

A fancier solution would be to defer the writeback to a separate threadpool with a limited queue length (limited by size, as an effective buffer limit). That way no reader thread is ever blocked on cache writeback. But maybe a simple trylock is easier to implement for now?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@173
PS1, Line 173:     DCHECK(!cache_files_.empty());
may as well CHECK here, since this isn't that hot of a path, and it'd be better to get a nice crash here than a less nice SEGV two lines below.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@185
PS1, Line 185:     if (cache_file->current_offset > FLAGS_data_cache_file_max_size) {
So this seems we can overrun the configured max size by one entry. Should we be more conservative and do this up on line 175 so that we switch to a new file if appending our current entry would push us over the max? (and I guess check that current offset is not 0 so we don't get in a loop)


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@208
PS1, Line 208: pwrite
same as above, this needs to loop and handle partial writes and EINTR. 

You could use RWFile from kudu/util/env.h to handle all this stuff for you


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@222
PS1, Line 222: done:
> You could use a scoped cleanup lambda and returns instead of goto
agreed


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@233
PS1, Line 233:       BitUtil::RoundUp(entry->offset + entry->len, PAGE_SIZE) - entry->offset;
isn't this equivalent to RoundUp(entry->len, PAGE_SIZE) because we know that entry->offset is already rounded to the page size?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@234
PS1, Line 234: fallocate
RETRY_ON_EINTR


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@245
PS1, Line 245:   return HashUtil::FastHash64(buffer, buffer_len, 0xcafebeef);
Why not use CRC32 which can be optimized to around 8 bytes/cycle? (our implementation right now in hash-util.h is not that good but there are good open source implementations that are)


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@323
PS1, Line 323:   int idx = HashUtil::FastHash64(key->data(), key->size(), 0) % partitions_.size();
maybe combine the hashing and the key construction into one method like ConstructKey(&key, &hash)?

Also was thinking maybe this would be a bit cleaner to introduce a CacheKey struct with members like 'Encode' and 'Hash'


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.h
File be/src/runtime/io/disk-io-mgr.h:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.h@198
PS1, Line 198: HDFS file blocks
not necessarily HDFS.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.h@445
PS1, Line 445:   std::unique_ptr<DataCache> remote_data_cache_;
worth clarifying in the doc whether this would be null if not configured


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc
File be/src/runtime/io/disk-io-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc@58
PS1, Line 58: // tuples, separated by comma. An example could be: /data/0:1TB,/data/1:1TB.
This info should go into the gflags help string instead of a comment


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc@240
PS1, Line 240: get
redundant get() != nullptr -- unique_ptr implicitly casts to bool in the way you'd expect


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc@295
PS1, Line 295:       LOG(ERROR) << "Failed to initialize data cache.";
> Should log the error message at least.
Agreed probably fatal here


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.cc
File be/src/runtime/io/hdfs-file-reader.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.cc@163
PS1, Line 163:         LOG(INFO) << "Reopening file " << scan_range_->file_string()
VLOG(3) or something?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.cc@186
PS1, Line 186:    if (borrowed_hdfs_fh != nullptr) {
             :       io_mgr->ReleaseCachedHdfsFileHandle(scan_range_->file_string(), borrowed_hdfs_fh);
             :     }
is it possible this can get skipped in some of the early return error conditions above? Maybe would be better to use a SCOPED_CLEANUP here



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 10 Apr 2019 21:24:48 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has uploaded a new patch set (#9). ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................

IMPALA-8341: Data cache for remote reads

This is a patch based on PhilZ's prototype: https://gerrit.cloudera.org/#/c/12683/

This change implements an IO data cache which is backed by
local storage. It implicitly relies on the OS page cache
management to shuffle data between memory and the storage
device. This is useful for caching data read from remote
filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS).

A data cache is divided into one or more partitions based on
the configuration string which is a list of directories, separated
by comma, followed by the storage capacity per directory.
An example configuration string is like the following:
  --data_cache_config=/data/0,/data/1:150GB

In the configuration above, the cache may use up to 300GB of
storage space, with 150GB max for /data/0 and /data/1 respectively.

Each partition has a meta-data cache which tracks the mappings
of cache keys to the locations of the cached data. A cache key
is a tuple of (file's name, file's modification time, file offset)
and a cache entry is a tuple of (backing file, offset in the backing
file, length of the cached data, optional checksum). Note that the
cache currently doesn't support overlapping ranges. In other words,
if the cache contains an entry of a file for range [m, m+4MB), a lookup
for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen
this as a problem but this may require further evaluation in the future.

Each partition stores its set of cached data in backing files created
on local storage. When inserting new data into the cache, the data is
appended to the current backing file in use. The storage consumption
of each cache entry counts towards the quota of that partition. When a
partition reaches its capacity, the least recently used (LRU) data in
that partition is evicted. Evicted data is removed from the underlying
storage by punching holes in the backing file it's stored in. As a
backing file reaches a certain size (by default 4TB), new data will
stop being appended to it and a new file will be created instead. Note
that due to hole punching, the backing file is actually sparse. When
the number of backing files per partition exceeds,
--data_cache_max_files_per_partition, files are deleted in the order
in which they are created. Stale cache entries referencing deleted
files are erased lazily or evicted due to inactivity.

Optionally, checksumming can be enabled to verify read from the cache
is consistent with what was inserted and to verify that multiple attempted
insertions with the same cache key have the same cache content.
Checksumming is enabled by default for debug builds.

To probe for cached data in the cache, the interface Lookup() is used;
To insert data into the cache, the interface Store() is used. Please note
that eviction happens inline currently during Store().

This patch also added two startup flags for start-impala-cluster.py:
'--data_cache_dir' specifies the base directory in which each Impalad
creates the caching directory
'--data_cache_size' specifies the capacity string for each cache directory.

Testing done:
- added a new BE and EE test
- exhaustive (debug, release) builds with cache enabled
- core ASAN build with cache enabled

Perf:
- 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement
over runs without the cache. Each node has a cache size of 150GB per node.
The performance is at parity with a configuration of a HDFS cluster using
EBS as the storage.

Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/runtime/io/CMakeLists.txt
A be/src/runtime/io/data-cache-test.cc
A be/src/runtime/io/data-cache.cc
A be/src/runtime/io/data-cache.h
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/runtime/io/hdfs-file-reader.h
M be/src/runtime/io/request-context.h
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/impalad-metrics.cc
M be/src/util/impalad-metrics.h
M bin/start-impala-cluster.py
M common/thrift/metrics.json
A testdata/workloads/functional-query/queries/QueryTest/data-cache.test
M tests/common/custom_cluster_test_suite.py
M tests/common/impala_test_suite.py
A tests/custom_cluster/test_data_cache.py
M tests/custom_cluster/test_krpc_metrics.py
23 files changed, 2,150 insertions(+), 55 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/9
-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 9
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/2722/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 10 Apr 2019 17:46:06 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 7: Code-Review+1

(3 comments)

http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h@337
PS5, Line 337:   };
> The reason for using a pool is that we want to be able to queue the work fo
That makes sense, thx.


http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/util/impalad-metrics.h
File be/src/util/impalad-metrics.h:

http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/util/impalad-metrics.h@91
PS7, Line 91: skipped
In the rest of the code "skipped" often means "pruned" and is a good thing, e.g. we figured out we didn't have to read something. Here its meaning is closer to "rejected", so would IO_MGR_REMOTE_DATA_CACHE_REJECTED_BYTES be a better choice? I don't feel strongly about it.


http://gerrit.cloudera.org:8080/#/c/12987/5/tests/custom_cluster/test_data_cache.py
File tests/custom_cluster/test_data_cache.py:

http://gerrit.cloudera.org:8080/#/c/12987/5/tests/custom_cluster/test_data_cache.py@23
PS5, Line 23: 
            :   """ This test enables the data cache an
> It checks both. It checks the metrics in the python code but also checks th
Thx for clarifying.



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 7
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 01 May 2019 21:04:33 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "David Rorke (Code Review)" <ge...@cloudera.org>.
David Rorke has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 1:

> (3 comments)
 > 
 > > Another option would be to do a global hashtable and have the
 > > entries directly identify their partitions (in fact I think they
 > > already do via the CacheFile pointer, right?). Eviction still
 > needs
 > > to be per-partition based on that partition's capacity, but
 > > insertion can try to more smartly allocate across drives.
 > >
 > 
 > Yes, I had a similar question when thinking over the patch over the
 > weekend. I tried the approach of weighted sampling + a global LRU
 > cache but stopped short of coming up with a way to maintain the
 > quota per partition with a global LRU list so I abandoned it and
 > kind of punted on this problem for now. May be it's okay to do LRU
 > per partition. Should have documented my thought there in the code.
 > 
 > I think the _IO_ apportioned relative to capacity may be fine as
 > that's a reasonable expectation if a user specifies more capacity
 > on a slower device. One thing to consider may be to use multiple
 > backing files for larger partition to avoid per-file lock problem.
 > 
 > > Last thought: since there are some tricky design decisions above,
 > > maybe we can simplify this and just support a single partition
 > for
 > > now, and come back to supporting multi-partition?
 > 
 > It may be reasonable to punt on it for now. Let me see how much
 > work it is to switch over to the global hash-table + per-partition
 > LRU approach.

My main concern with a single partition in the short term is whether we can get enough IOPS from a single spinning disk per node.  For cloud deployments we should definitely try to use SSD for the cache, but if we expect this to be used for on-prem compute clusters it's likely that many of those will have no SSD but multiple spinning disks.


-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 11 Apr 2019 19:14:32 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 8:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/3029/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 8
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 02 May 2019 01:26:06 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc
File be/src/runtime/io/disk-io-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc@240
PS1, Line 240: get
> redundant get() != nullptr -- unique_ptr implicitly casts to bool in the wa
I'm on your side on this one but some people actually disagreed with me and I think we effectively called a truce on the great "whether to use .get() in null checks". Similar to American vs Commonwealth English. It would be nice to have a consensus going forward - maybe people's feelings have changed.



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 10 Apr 2019 21:51:47 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 2:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.h@88
PS2, Line 88: cosolidating
consolidating


http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.h@89
PS2, Line 89: sprase
sparse


http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.h@184
PS2, Line 184: emtpy
empty


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.cc
File be/src/runtime/io/hdfs-file-reader.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.cc@193
PS1, Line 193:       return Status(TErrorCode::DISK_IO_ERROR, GetBackendString(),
> Addressed the concern with larger entry in the new patch set. To answer the
Two cases I can think of:
* If a scan was not able to get its "ideal" reservation and is scanning with a smaller buffer size. This is hopefully rare.
* With the new parquet page index support if different subset of pages are selected.

I'm ok with deferring this but I think it is worth understanding if there's a bad interaction between this and page indexes.



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 2
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 17 Apr 2019 06:58:00 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12987/5/tests/custom_cluster/test_data_cache.py
File tests/custom_cluster/test_data_cache.py:

http://gerrit.cloudera.org:8080/#/c/12987/5/tests/custom_cluster/test_data_cache.py@22
PS5, Line 22: class TestDataCache(CustomClusterTestSuite):
flake8: E302 expected 2 blank lines, found 1



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 5
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Sat, 27 Apr 2019 07:23:38 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc
File be/src/runtime/io/disk-io-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc@240
PS1, Line 240: get
> I'm on your side on this one but some people actually disagreed with me and
yea, I guess I go this way because clang-tidy has this check which we use in Kudu, so I'm used to it getting flagged: https://clang.llvm.org/extra/clang-tidy/checks/readability-redundant-smartptr-get.html



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 10 Apr 2019 23:50:17 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/12987/2/bin/start-impala-cluster.py
File bin/start-impala-cluster.py:

http://gerrit.cloudera.org:8080/#/c/12987/2/bin/start-impala-cluster.py@117
PS2, Line 117: ;
flake8: E703 statement ends with a semicolon


http://gerrit.cloudera.org:8080/#/c/12987/2/bin/start-impala-cluster.py@120
PS2, Line 120: ;
flake8: E703 statement ends with a semicolon



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 2
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 16 Apr 2019 19:41:39 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 7:

(41 comments)

http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache-test.cc
File be/src/runtime/io/data-cache-test.cc:

http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache-test.cc@273
PS5, Line 273:   auto s =
> I just found out we have ScopedFlagSetter in scoped-flag-setter.h, I think 
Done


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.h@112
PS6, Line 112: 
             :   /// 'config' is the configurati
> per the commit message, we've moved to a single quota rather than per-direc
Yes, comments were stale. Fixed in new patch.


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.h@261
PS6, Line 261: 
> perhaps init to -1?
Done. Switched to initializing it to 0 in CreateCacheFile() instead.


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h@215
PS5, Line 215: 
> 'start_reclaim'?
Done


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h@337
PS5, Line 337:   };
> Can you mention in the comment that the pool has only 1 thread and why you'
The reason for using a pool is that we want to be able to queue the work for deferred processing by another thread and ThreadPool seems to provide the right abstraction. May be there are other classes in utility/ which are also applicable ?


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h@341
PS5, Line 341: 
> Some functions around deleting files are called "Close...". We should point
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@306
PS4, Line 306:   cache_files_.emplace_back(std::mo
> This check is racy. More allocation could have happened since 'insertion_of
Done


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@95
PS5, Line 95: ng files opened, ol
> switch to single thread, or mention pool here
Done


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@112
PS5, Line 112: DeleteFile
> Can we call this DeleteFile? Otherwise there's a third thing to keep track 
Done


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@125
PS5, Line 125: d the lock in
> It's not obvious to me why we only need a percpu_rwlock here. Can you add a
Done. Please also see explanation in comments above.


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@208
PS5, Line 208: 
> nit: singular
Done


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@335
PS5, Line 335: ing files left over fro
> Similar to other comments, I'd call this "VerifySizeAndDeleteFiles", I thin
Done


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@395
PS5, Line 395: }
> Will this handle hole punching through the eviction logic?
Yes.


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@436
PS5, Line 436:   // Try verifying the checksum of the new buffer matches that of the existing entry.
> nit: only append the "checksum $3" part if checksumming is enabled? I don't
Keeping it for simplicity.


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@457
PS5, Line 457: == nullptr)) r
> start_reclaim?
Done


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@633
PS5, Line 633: URN_IF_ERROR(p
> start_reclaim?
Done


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@64
PS6, Line 64: DEFINE_int64(data_cache_file_max_size_bytes, 1L << 40 /* 1TB */,
> - can you add a comment here like /* 4TB */?
Done.

Yes, according to https://access.redhat.com/solutions/1532, ext4 should support up to 16TB. Apparently, there is still need to support ext3 which has a limit of 2TB. So, I will set it to 1TB to be safe.


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@68
PS6, Line 68:     "(Advanced) The maximum number of allowed opened files. This must be at least the "
> Setting this per-partition creates a dependency between this and the number
Makes sense. Done.


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@70
PS6, Line 70: DEFINE_int32(data_cache_write_concurrency, 1,
> is this per-partition? should be, right?
Done


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@117
PS6, Line 117:     KUDU_RETURN_IF_ERROR(kudu::Env::Default()->NewRWFile(path, &cache_file->file_),
> why not pass the RWFile into the CacheFile constructor vs creating an empty
One advantage is to hide the implementation of CacheFile from the caller (e.g. if we somehow can switch to using Impala's own implementation of RWFile).

I forgot to move the constructor to private. Should have done that so there is no way to create an empty CacheFile.


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@141
PS6, Line 141:     Close();
> WARN_NOT_OK could be used here
Done


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@153
PS6, Line 153: int64_
> 'inline' here and elsewhere isn't necessary since you've defined them inlin
Done


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@174
PS6, Line 174:     DCHECK_EQ(offset % PAGE_SIZE, 0);
> worth a DCHECK that offset + bytes_to_read <= current_offset_
Done


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@192
PS6, Line 192:     DCHECK_EQ(offset % PAGE_SIZE, 0);
> same DCHECK suggested above
Done


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@205
PS6, Line 205:   }
> same
Done


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@228
PS6, Line 228: 
> instead of just saying used for synchronozation" I think best to say "taken
Done


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@281
PS6, Line 281:     return key_;
> can you DCHECK that lock_ is held by the current thread here? (same elsewhe
Done


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@347
PS6, Line 347:   }
             : 
> is this the right method call? seems the same as above
Oops..mean to call GetFileSize().


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@352
PS6, Line 352: acking file for the partition.
> .clear()
Done


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@360
PS6, Line 360:   for (auto& file : cache_files_) {
> dcheck the lock is held?
Done


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@380
PS6, Line 380: 
> I think this pattern is used pretty widely in Impala but it's moderately sk
Done


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@395
PS6, Line 395: }
> should LOG(WARNING) an error here on bad checksum, so people know their dri
There is already a LOG(DFATAL) in VerifyChecksum() in case there is any checksum mismatch.


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@436
PS6, Line 436:   // Try verifying the checksum of the new buffer matches that of the existing entry.
> maybe VLOG(3)?
Done


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@444
PS6, Line 444:   }
> similar to comment above, I think it's better to create a struct, set the f
Done


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@508
PS6, Line 508:  path 
> nit: emplace
Done


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@588
PS6, Line 588: 
> CHECK here (not a hot path and we'd crash later)
Done


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@616
PS6, Line 616:     return Status(Substitute("Configured data cache capacity $0 is too small",
> 3 or 4 maybe? this will be quite noisy
Done


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/hdfs-file-reader.cc
File be/src/runtime/io/hdfs-file-reader.cc:

http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/hdfs-file-reader.cc@37
PS5, Line 37: "
> nit: trailing space
Done


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/util/filesystem-util.cc
File be/src/util/filesystem-util.cc:

http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/util/filesystem-util.cc@305
PS6, Line 305: }
> we should probably test for known buggy el6 versions and ext4 here and disa
Done


http://gerrit.cloudera.org:8080/#/c/12987/5/tests/custom_cluster/test_data_cache.py
File tests/custom_cluster/test_data_cache.py:

http://gerrit.cloudera.org:8080/#/c/12987/5/tests/custom_cluster/test_data_cache.py@22
PS5, Line 22: 
> flake8: E302 expected 2 blank lines, found 1
Done


http://gerrit.cloudera.org:8080/#/c/12987/5/tests/custom_cluster/test_data_cache.py@23
PS5, Line 23: 
            :   """ This test enables the data cache an
> It actually seems to check the metrics, not the profile counters.
It checks both. It checks the metrics in the python code but also checks the profile's counters in "QueryTest/data-cache.test"



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 7
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 01 May 2019 01:46:49 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Hello Thomas Marshall, Lars Volker, David Rorke, Sahil Takiar, Todd Lipcon, Tim Armstrong, Joe McDonnell, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12987

to look at the new patch set (#3).

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................

IMPALA-8341: Data cache for remote reads

This is a patch based on PhilZ's prototype: https://gerrit.cloudera.org/#/c/12683/

This change implements an IO data cache which is backed by
local storage. It implicitly relies on the OS page cache
management to shuffle data between memory and the storage
device. This is useful for caching data read from remote
filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS).

A data cache is divided into one or more partitions based on
the configuration string which is a list of directories, separated
by comma, followed by the storage capacity per directory.
An example configuration string is like the following:
  --data_cache_config=/data/0,/data/1:150GB

In the configuration above, the cache may use up to 300GB of
storage space, with 150GB max for /data/0 and /data/1 respectively.

Each partition has a meta-data cache which tracks the mappings
of cache keys to the locations of the cached data. A cache key
is a tuple of (file's name, file's modification time, file offset)
and a cache entry is a tuple of (backing file, offset in the backing
file, length of the cached data, optional checksum). Note that the
cache currently doesn't support overlapping ranges. In other words,
if the cache contains an entry of a file for range [m, m+4MB), a lookup
for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen
this as a problem but this may require further evaluation in the future.

Each partition stores its set of cached data in backing files created
on local storage. When inserting new data into the cache, the data is
appended to the current backing file in use. The storage consumption
of each cache entry counts towards the quota of that partition. When a
partition reaches its capacity, the least recently used (LRU) data in
that partition is evicted. Evicted data is removed from the underlying
storage by punching holes in the backing file it's stored in. As a
backing file reaches a certain size (by default 4TB), new data will
stop being appended to it and a new file will be created instead. Note
that due to hole punching, the backing file is actually sparse.

Optionally, checksumming can be enabled to verify read from the cache
is consistent with what was inserted and to verify that multiple attempted
insertions with the same cache key have the same cache content.
Checksumming is enabled by default for debug builds.

To probe for cached data in the cache, the interface Lookup() is used;
To insert data into the cache, the interface Store() is used. Please note
that eviction happens inline currently during Store().

This patch also added two startup flags for start-impala-cluster.py:
'--data_cache_dir' specifies the base directory in which each Impalad
creates the caching directory
'--data_cache_size' specifies the capacity string for each cache directory.

Testing done: a new BE test was added; core test with cache enabled.

Perf:
- 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement
over runs without the cache. Each node has a cache size of 150GB per node.
The performance is at parity with a configuration of a HDFS cluster using
EBS as the storage.

Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/runtime/io/CMakeLists.txt
A be/src/runtime/io/data-cache-test.cc
A be/src/runtime/io/data-cache.cc
A be/src/runtime/io/data-cache.h
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/runtime/io/hdfs-file-reader.h
M be/src/runtime/io/request-context.h
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/impalad-metrics.cc
M be/src/util/impalad-metrics.h
M bin/start-impala-cluster.py
M common/thrift/metrics.json
18 files changed, 1,706 insertions(+), 29 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/3
-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 3
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 1:

(13 comments)

I did a quick pass just trying to understand the big picture. Looks like Lars beat me to some things.

http://gerrit.cloudera.org:8080/#/c/12987/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12987/1//COMMIT_MSG@24
PS1, Line 24: is a tuple of (file's name, file's modification time, file offset)
So this means that you can get a partial cache hit if the offsets match but lengths differ, but not if 

I guess this is a consequence of the cache being keyed this way (one alternative would be to key by name/mod and then have some kind of sorted map of cache entries per file).  It seems like this implementation simplifies things and probably the benefit of supporting more partial cache hits would be minimal.

Might be worth briefly mentioning this fact and any reasoning behind it, for posterity.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/exec/hdfs-scan-node-base.cc@948
PS1, Line 948:     data_cache_miss_bytes_->Set(reader_context_->data_cache_miss_bytes());
I think for the above counters we figured that this pattern of copying the value out of the counter at the end was bad. Instead we could just pass in the counter to reader_context_ and have it updated live. E.g. see reader_context_->set_bytes_read_counter()


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@18
PS1, Line 18: #ifndef IMPALA_RUNTIME_IO_DATA_CACHE_H
Could use #pragma once instead of include guards


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@53
PS1, Line 53: /// punching holes in the backing file it's stored in. As a backing file reaches a certain
This does mean that the number of backing files could grown unbounded right? E.g. if you end up with one or more long-lived cache entries per file. This doesn't seem likely to be a problem in practice since we should be able to have a reasonable number of open file descriptors without issue but may be worth noting.

I suspect trying to compact the files would be overengineering things but it may be worth thinking about whether this could be a problem.

Also, this requires that the filesystem supports hole punching, right? - IIRC there are some that don't.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@55
PS1, Line 55: /// created instead. Note that due to hole punching, the backing file is actually sparse.
This might be a case where a simple ASCII diagram would illustrate the concept nicely (although the text is pretty clear on its own).


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@63
PS1, Line 63: /// happens inline currently during Store().
It's worth documenting the PAGE_SIZE behaviour since it implies that there's a minimum granularity of data that clients of the cache should be caching.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@105
PS1, Line 105:  
space


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@110
PS1, Line 110:   /// is installed successfully.
Is it interesting to document reasons why the content may not be installed?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@79
PS1, Line 79:   // Unlink the file so the disk space is recycled once the process exits.
I'm not sure how I feel about this - it could cause a lot of confusion having all this "hidden" disk usage. It seems like TmpFileMgr's approach of deleting the existing scratch on startup has worked out ok so might be worth considering.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@158
PS1, Line 158:     // TODO: handle cases in which 'store_len' is longer than existing entry;
It's probably worth documenting this behaviour in the header - it's a little surprising. I guess we're really hoping that we don't have these collisions too often? Since we could end up in a situation where we repeatedly miss the cache for the latter part of a data range and never actually store the data.

Or does the partial read handling in the I/O manager avoid this?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc
File be/src/runtime/io/disk-io-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc@295
PS1, Line 295:       LOG(ERROR) << "Failed to initialize data cache.";
> Have you considered making this fatal?
Should log the error message at least.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.cc
File be/src/runtime/io/hdfs-file-reader.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.cc@110
PS1, Line 110:     DataCache* remote_data_cache = io_mgr->remote_data_cache();
I feel like this should be factored out into its own function so that the flow of the main function is clearer.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.cc@193
PS1, Line 193:       remote_data_cache->Store(*scan_range_->file_string(), scan_range_->mtime(),
> I think this will not be effective if the cache already has a smaller entry
I had the same question. I think it sort-of works if you have a cache entry for [n, m) and a reader is reading [n, k), where k > m. Since it will get a hit for the cache entry, then read [m,k) and insert that into the cache. Then all subsequent readers of [n, k) will get two cache hits. It could get weird with overlapping cache entries if the initial offsets don't line up, but probably in practice it's not a big problem.



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 10 Apr 2019 19:43:03 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12987/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12987/4//COMMIT_MSG@61
PS4, Line 61: Testing done: a new BE test was added; core test with cache enabled.
> I have a bit of trouble in getting this to work in the mini-cluster. May be
Ended up adding a startup flag to force use of the cache even for local reads. A new custom cluster test was also added for sanity check.



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 4
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Sat, 27 Apr 2019 07:27:42 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/2841/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 3
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 19 Apr 2019 03:04:02 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 5:

(32 comments)

http://gerrit.cloudera.org:8080/#/c/12987/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12987/4//COMMIT_MSG@61
PS4, Line 61: '--data_cache_dir' specifies the base directory in which each Impalad
> Can we add a custom cluster test to sanity check that it works end to end. 
I have a bit of trouble in getting this to work in the mini-cluster. May be it's easier with the dockerised test.


http://gerrit.cloudera.org:8080/#/c/12987/4//COMMIT_MSG@62
PS4, Line 62: creates the caching directory
> Do we want to consider enabling this by default for the dockerised tests as
Definitely. Will also do so for S3 builds.


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/exec/hdfs-scan-node-base.cc@365
PS4, Line 365:       "DataCacheHitCount", TUnit::UNIT);
> Do we have tests for these to make sure they show up in the profiles and wo
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc
File be/src/runtime/io/data-cache-test.cc:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc@138
PS4, Line 138: 
> Move to the other test constants (TEMP_BUFFER_SIZE etc)?
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc@226
PS4, Line 226: 
> nit: This could now be 4 * FLAGS_data_cache_file_max_size ?
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc@349
PS4, Line 349:  num_entries = 0
> Can they be to separate tests?
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc@410
PS4, Line 410: mp_buf
> nit: typo
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@83
PS4, Line 83: eviction from it happen s
> That's controlled by --data_cache_write_concurrency, right? Mention here?
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@87
PS4, Line 87: /// of 4KB so any data inserted will be rounded up to the nearest multiple of 4KB.
> Do we plan to look into picking partitions on faster disks with higher prob
Ideally, we want to keep the hotter data in the faster media while keeping the lukewarm data in the slower media. Added a TODO.


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@88
PS4, Line 88: ///
> Yeah this scenario is a bit concerning for me still since it's conceivable 
I added a "simple" implementation with rw-lock and lazy cache entry eviction. If it's deemed too complicated, please let me know and I can undo it. Also added a test case for it.


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@145
PS4, Line 145: ,
> nit: formatting
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@197
PS4, Line 197:     /// - removes any stale backing file in this partition
> Should we also delete the files when we close them? There's a distinction i
This was needed for data-cache-test.cc as we need to close the files before verifying their sizes. However, it seems that we can hide all those internal details in VerifyFileSizes(), which is renamed to CloseAndVerifyFileSizes();


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@203
PS4, Line 203: 
> Should we pass const CacheKey& here and convert it in the implementation?
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@224
PS4, Line 224:  void EvictedEnt
> nit: VerifyFileSizes
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@72
PS4, Line 72:     "(Advanced) Enable checksumming for the cached buffer.");
> static const char*?
This is actually a static class member of DataCache.


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@75
PS4, Line 75: namespace io {
> Should this be a class, given it has a c'tor and d'tor?
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@173
PS4, Line 173: ock(lock_.get_lock());
             :     if (UNLIKELY(!file_)) return fals
> I think you can merge these two lines, which also reduces the risk that som
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@187
PS4, Line 187: inline
> nit: missing include, but we might generally omit this one.
Not sure which one you are referring to ? Isn't it in #include "common/names.h" ?


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@186
PS4, Line 186: // already closed.
             :   inline bool Write(int64_t offset, const uint8_t* buffer, int64_t buffer_len) {
             :     DCHECK_EQ(offset % PAGE_SIZE, 0);
             :     DCHECK_LE(offset, current_offset_);
             :     kudu::shared_lock<rw_spinlock> lock(lock_.get_lock());
             :     if (UNLIKELY(!file_)) return false;
             :     kudu::Status status = file_->Write(offset, Slice(buffer, buffer_len));
             :     if (UNLIKELY(!status.ok())) {
             :       LOG(ERROR) << Substitute("Failed to write to $0 at offset $1 for $2 bytes: $3",
             :           path_, offset, PrettyPrinter::PrintBytes(buffer_len), status.ToString());
             :       return false;
             :     }
             :     return true;
             :   }
> This might be a good candidate for a separate method, but I don't feel stro
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@235
PS4, Line 235: ains the w
> Wrap in PrintBytes, or just put "bytes" into the message
Added bytes in the message. Makes it easier to debug in case things go wrong.


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@306
PS4, Line 306: Status DataCache::Partition::Init()
This check is racy. More allocation could have happened since 'insertion_offset' is allocated.


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@316
PS4, Line 316:   // Check if there is enough space available at this point in time.
> nit: can be on a single line
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@391
PS4, Line 391: 
> What are the cases where this loop runs more than once? Would a single if-s
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@402
PS4, Line 402:     kudu::Cache::Handle* handle, const uint8_t* buffer, int64_t buffer_len) {
> Does this need to go to line 387 before the return statement in the file lo
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@459
PS4, Line 459: 
> Is a capacity of 1 really supported? Should we require something that techn
Yes, required at least 1 page.


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@482
PS4, Line 482:     // the extra memory consumption for holding the temporary buffer though.
> Is this a valid thing to happen or should we DCHECK?
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/disk-io-mgr.cc
File be/src/runtime/io/disk-io-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/disk-io-mgr.cc@61
PS4, Line 61: must h
> "must have"? Such a configuration is not supported anyways, but if people t
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/hdfs-file-reader.cc
File be/src/runtime/io/hdfs-file-reader.cc:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/hdfs-file-reader.cc@153
PS4, Line 153:         RETURN_IF_ERROR(io_mgr->ReopenCachedHdfsFileHandle(hdfs_fs_,
> VLOG_FILE?
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/util/filesystem-util-test.cc
File be/src/util/filesystem-util-test.cc:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/util/filesystem-util-test.cc@147
PS4, Line 147: file type
> file type entries?
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/util/filesystem-util.cc
File be/src/util/filesystem-util.cc:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/util/filesystem-util.cc@339
PS4, Line 339:   const off_t hole_size = buffer_size * 2;
> Kudu's RWFile encapsulates this. Should we use it instead?
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/bin/start-impala-cluster.py
File bin/start-impala-cluster.py:

http://gerrit.cloudera.org:8080/#/c/12987/4/bin/start-impala-cluster.py@118
PS4, Line 118: 0,
> "_size" implies that the default should be something like 0?
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/common/thrift/metrics.json
File common/thrift/metrics.json:

http://gerrit.cloudera.org:8080/#/c/12987/4/common/thrift/metrics.json@433
PS4, Line 433:     "description": "Total number of insertion skipped in remote data cache due to concurrency limit.",
> Do we have a test for the new metrics to make sure they work as expected?
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 5
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Sat, 27 Apr 2019 07:22:43 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 10:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/3051/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 10
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 03 May 2019 03:42:31 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.cc@296
PS2, Line 296:     if (pending_insert_set_.size() >= FLAGS_data_cache_write_concurrency ||
             :         pending_insert_set_.find(key.ToString()) != pending_insert_set_.end()) {
             :       return false;
             :     }
TODO: Add a metric for this.



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 2
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 16 Apr 2019 19:47:18 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 7:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/3021/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 7
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 01 May 2019 02:34:27 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 1:

(23 comments)

I think it would be good to have some tests that make sure that data gets freed up on the disk by the hole punching mechanism as expected. Additionally, maybe a test to make sure that mtime and other parts of the key are properly considered would be nice to have.

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc
File be/src/runtime/io/data-cache-test.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@52
PS1, Line 52:   // The callback is invoked by each thread in the multi-threaded tests below.
callback reads like it's called when something is done, how about ThreadFn?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@157
PS1, Line 157:   // Read the same same range inserted previously and they should still all in the cache.
nit: be


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@196
PS1, Line 196:   // Create a buffer way larger than the cache.
Why does it need to be larger?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@201
PS1, Line 201:   ASSERT_TRUE(cache.Store("foobar", 12345, 0, large_buffer.get(), cache_size));
Use variables instead of inline constants?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@220
PS1, Line 220:   const int64_t cache_size = 1 << 22;
I find these easier to read in the form of 4 * 1024 * 1024


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@248
PS1, Line 248: differen
typo


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@301
PS1, Line 301: ootprint) {
Can you add a comment what this test does?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@330
PS1, Line 330: int main(int argc, char **argv) {
This is not needed anymore if you add your test to the unified backend tests (but better confirm with Joe).


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@56
PS1, Line 56: ///
We should mention in the comment how lookup works, in particular what happens when we look up data that is a sub-piece of something in the cache and the lookup by offset wouldn't work (i.e. state that only exact matches are implemented)


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@110
PS1, Line 110:   /// is installed successfully.
Maybe point out explicitly that it returns false for errors and if the content was already in the cache?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@129
PS1, Line 129:     /// Creates a partition at the given directory 'path' with quota 'capacity'.
Mention that capacity is bytes? Maybe even add a suffix that indicates it's bytes (and not number of entries, here and elsewhere)


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@157
PS1, Line 157:     void EvictedEntry(kudu::Slice key, kudu::Slice value) override;
Can you add the virtual and override keywords to all virtual methods to make it obvious which ones are part of the interface?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@226
PS1, Line 226:     /// failure. Please note that the backing file is unlinked right after it's created
How will these files look to an administrator? Will this make it harder to debug out-of-space issues?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@239
PS1, Line 239:     /// Please note that this function will CHECK fail if there is any checksum mismatch.
Does that mean "crash a release build"?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@254
PS1, Line 254:       std::unique_ptr<kudu::faststring>* key);
You could return the pointer by value


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@66
PS1, Line 66: DRAM_CACHE
Should we add DISK_CACHE here? Otherwise, can you add a comment to explain why we use DRAM_CACHE?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@193
PS1, Line 193:   ssize_t bytes_written = 0;
declare right before usage


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@222
PS1, Line 222: done:
You could use a scoped cleanup lambda and returns instead of goto


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@262
PS1, Line 262:   for (const string& cache_config : all_cache_configs) {
gutil has SplitStringIntoKeyValuePairs() in strings.h that could be helpful here.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc
File be/src/runtime/io/disk-io-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc@295
PS1, Line 295:       LOG(ERROR) << "Failed to initialize data cache.";
Have you considered making this fatal?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.h
File be/src/runtime/io/hdfs-file-reader.h:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.h@48
PS1, Line 48:   virtual void CachedFile(uint8_t** data, int64_t* length) override;
Maybe rename to ReadHdfsCachedFile to make its purpose more clear at the call-sites?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.cc
File be/src/runtime/io/hdfs-file-reader.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.cc@124
PS1, Line 124:         if (exclusive_hdfs_fh_ != nullptr) {
I think a comment here would help explain why we do this


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.cc@193
PS1, Line 193:       remote_data_cache->Store(*scan_range_->file_string(), scan_range_->mtime(),
I think this will not be effective if the cache already has a smaller entry. Should we remove the smaller entry before adding the larger one? In either case it's probably good to add a comment that states the intent.

It'd probably be good to document here or in the comment of Store() why it's safe to ignore the return value



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 10 Apr 2019 19:23:35 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has uploaded a new patch set (#5). ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................

IMPALA-8341: Data cache for remote reads

This is a patch based on PhilZ's prototype: https://gerrit.cloudera.org/#/c/12683/

This change implements an IO data cache which is backed by
local storage. It implicitly relies on the OS page cache
management to shuffle data between memory and the storage
device. This is useful for caching data read from remote
filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS).

A data cache is divided into one or more partitions based on
the configuration string which is a list of directories, separated
by comma, followed by the storage capacity per directory.
An example configuration string is like the following:
  --data_cache_config=/data/0,/data/1:150GB

In the configuration above, the cache may use up to 300GB of
storage space, with 150GB max for /data/0 and /data/1 respectively.

Each partition has a meta-data cache which tracks the mappings
of cache keys to the locations of the cached data. A cache key
is a tuple of (file's name, file's modification time, file offset)
and a cache entry is a tuple of (backing file, offset in the backing
file, length of the cached data, optional checksum). Note that the
cache currently doesn't support overlapping ranges. In other words,
if the cache contains an entry of a file for range [m, m+4MB), a lookup
for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen
this as a problem but this may require further evaluation in the future.

Each partition stores its set of cached data in backing files created
on local storage. When inserting new data into the cache, the data is
appended to the current backing file in use. The storage consumption
of each cache entry counts towards the quota of that partition. When a
partition reaches its capacity, the least recently used (LRU) data in
that partition is evicted. Evicted data is removed from the underlying
storage by punching holes in the backing file it's stored in. As a
backing file reaches a certain size (by default 4TB), new data will
stop being appended to it and a new file will be created instead. Note
that due to hole punching, the backing file is actually sparse. When
the number of backing files per partition exceeds,
--data_cache_max_files_per_partition, files are deleted in the order
in which they are created. Stale cache entries referencing deleted
files are erased lazily or evicted due to inactivity.

Optionally, checksumming can be enabled to verify read from the cache
is consistent with what was inserted and to verify that multiple attempted
insertions with the same cache key have the same cache content.
Checksumming is enabled by default for debug builds.

To probe for cached data in the cache, the interface Lookup() is used;
To insert data into the cache, the interface Store() is used. Please note
that eviction happens inline currently during Store().

This patch also added two startup flags for start-impala-cluster.py:
'--data_cache_dir' specifies the base directory in which each Impalad
creates the caching directory
'--data_cache_size' specifies the capacity string for each cache directory.

Testing done:
- added a new BE and EE test
- exhaustive (debug, release) builds with cache enabled
- core ASAN build with cache enabled

Perf:
- 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement
over runs without the cache. Each node has a cache size of 150GB per node.
The performance is at parity with a configuration of a HDFS cluster using
EBS as the storage.

Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/runtime/io/CMakeLists.txt
A be/src/runtime/io/data-cache-test.cc
A be/src/runtime/io/data-cache.cc
A be/src/runtime/io/data-cache.h
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/runtime/io/hdfs-file-reader.h
M be/src/runtime/io/request-context.h
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/impalad-metrics.cc
M be/src/util/impalad-metrics.h
M bin/start-impala-cluster.py
M common/thrift/metrics.json
A testdata/workloads/functional-query/queries/QueryTest/data-cache.test
M tests/common/custom_cluster_test_suite.py
M tests/common/impala_test_suite.py
A tests/custom_cluster/test_data_cache.py
M tests/custom_cluster/test_krpc_metrics.py
23 files changed, 2,059 insertions(+), 55 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/5
-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 5
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 10: Code-Review+2

(1 comment)

Carry Todd's +2

http://gerrit.cloudera.org:8080/#/c/12987/9/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/9/be/src/runtime/io/data-cache.cc@260
PS9, Line 260:   explicit CacheEntry(const Slice& value) {
> explicit :)
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 10
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 03 May 2019 02:46:01 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Michael Ho has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 8:

(5 comments)

Accidentally pushed a draft. Will push PS9.

http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache-test.cc
File be/src/runtime/io/data-cache-test.cc:

http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache-test.cc@274
PS7, Line 274: 
> I think gflags has a built in 'google::FlagsSaver', no? It handles restorin
Nice. I don't think we have an equivalent of that in Impala although it's a nice improvement. Filed IMPALA-8480.


http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache.cc@245
PS7, Line 245:   explicit CacheFile(std::string path) : path_(move(path)) { }
> nit: explicit
Done


http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache.cc@301
PS7, Line 301: 
> here I think you could just call lock_.DCheckLocked()  right?
Done


http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache.cc@409
PS7, Line 409:     uint8_t* buffer) {
> why not have UnpackCacheEntry return the CacheEntry object? (the generated 
Done


http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/util/impalad-metrics.h
File be/src/util/impalad-metrics.h:

http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/util/impalad-metrics.h@91
PS7, Line 91: not ins
> In the rest of the code "skipped" often means "pruned" and is a good thing,
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 8
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 02 May 2019 00:50:47 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................

IMPALA-8341: Data cache for remote reads

This is a patch based on PhilZ's prototype: https://gerrit.cloudera.org/#/c/12683/

This change implements an IO data cache which is backed by
local storage. It implicitly relies on the OS page cache
management to shuffle data between memory and the storage
device. This is useful for caching data read from remote
filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS).

A data cache is divided into one or more partitions based on
the configuration string which is a list of directories, separated
by comma, followed by the storage capacity per directory.
An example configuration string is like the following:
  --data_cache_config=/data/0,/data/1:150GB

In the configuration above, the cache may use up to 300GB of
storage space, with 150GB max for /data/0 and /data/1 respectively.

Each partition has a meta-data cache which tracks the mappings
of cache keys to the locations of the cached data. A cache key
is a tuple of (file's name, file's modification time, file offset)
and a cache entry is a tuple of (backing file, offset in the backing
file, length of the cached data, optional checksum). Note that the
cache currently doesn't support overlapping ranges. In other words,
if the cache contains an entry of a file for range [m, m+4MB), a lookup
for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen
this as a problem but this may require further evaluation in the future.

Each partition stores its set of cached data in backing files created
on local storage. When inserting new data into the cache, the data is
appended to the current backing file in use. The storage consumption
of each cache entry counts towards the quota of that partition. When a
partition reaches its capacity, the least recently used (LRU) data in
that partition is evicted. Evicted data is removed from the underlying
storage by punching holes in the backing file it's stored in. As a
backing file reaches a certain size (by default 4TB), new data will
stop being appended to it and a new file will be created instead. Note
that due to hole punching, the backing file is actually sparse. When
the number of backing files per partition exceeds,
--data_cache_max_files_per_partition, files are deleted in the order
in which they are created. Stale cache entries referencing deleted
files are erased lazily or evicted due to inactivity.

Optionally, checksumming can be enabled to verify read from the cache
is consistent with what was inserted and to verify that multiple attempted
insertions with the same cache key have the same cache content.
Checksumming is enabled by default for debug builds.

To probe for cached data in the cache, the interface Lookup() is used;
To insert data into the cache, the interface Store() is used. Please note
that eviction happens inline currently during Store().

This patch also added two startup flags for start-impala-cluster.py:
'--data_cache_dir' specifies the base directory in which each Impalad
creates the caching directory
'--data_cache_size' specifies the capacity string for each cache directory.

Testing done:
- added a new BE and EE test
- exhaustive (debug, release) builds with cache enabled
- core ASAN build with cache enabled

Perf:
- 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement
over runs without the cache. Each node has a cache size of 150GB per node.
The performance is at parity with a configuration of a HDFS cluster using
EBS as the storage.

Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Reviewed-on: http://gerrit.cloudera.org:8080/12987
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/runtime/io/CMakeLists.txt
A be/src/runtime/io/data-cache-test.cc
A be/src/runtime/io/data-cache.cc
A be/src/runtime/io/data-cache.h
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/runtime/io/hdfs-file-reader.h
M be/src/runtime/io/request-context.h
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/impalad-metrics.cc
M be/src/util/impalad-metrics.h
M bin/start-impala-cluster.py
M common/thrift/metrics.json
A testdata/workloads/functional-query/queries/QueryTest/data-cache.test
M tests/common/custom_cluster_test_suite.py
M tests/common/impala_test_suite.py
A tests/custom_cluster/test_data_cache.py
M tests/custom_cluster/test_krpc_metrics.py
23 files changed, 2,150 insertions(+), 55 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 13
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 1:

> (1 comment)
 > 
 > > I think the _IO_ apportioned relative to capacity may be fine as
 > that's a reasonable expectation if a user specifies more capacity
 > on a slower device. One thing to consider may be to use multiple
 > backing files for larger partition to avoid per-file lock problem.
 > 
 > I think, unfortunately, the lower capacity device is most likely to
 > be the faster one (eg a small SSD plus big HDDs).
 > 
 > Maybe we can simplify this for now by just requiring that all
 > partitions of the cache be allocated the same amount of space? The
 > most immediate scenario I can see is on Amazon instances like
 > r5d.4xlarge (two local SSDs with the same size) where you'd want to
 > give the same amount of capacity to each cache drive anyway.
 > 
 > Put another way, I think we should constrain the configuration
 > space in such a way that only good configs are possible, rather
 > than hope that users don't do something like: /ssd/:100G,/hdd1:1TB,/hdd2:1TB,/hdd3:1TB
 > and find that their SSD's fast IO performance is basically ignored.

Is it important that the caches fill up at the same ratio (e.g. all disks at 20%)? Or can we pick each disk with the probability of its throughput to optimize for combined bandwidth? Then the SSD would fill up first with 6x more data on the SSD as on each of the HDD (for 6x throughput ratio). Then we could either adjust that by a constant factor through experiments, or add an expert level option to allow admins to specify the throughput ratio between the disks.


-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 12 Apr 2019 16:10:47 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Michael Ho (Code Review)" <ge...@cloudera.org>.
Hello Thomas Marshall, Lars Volker, David Rorke, Sahil Takiar, Todd Lipcon, Tim Armstrong, Joe McDonnell, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12987

to look at the new patch set (#2).

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................

IMPALA-8341: Data cache for remote reads

This is a patch based on PhilZ's prototype: https://gerrit.cloudera.org/#/c/12683/

This change implements an IO data cache which is backed by
local storage. It implicitly relies on the OS page cache
management to shuffle data between memory and the storage
device. This is useful for caching data read from remote
filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS).

A data cache is divided into one or more partitions based on
the configuration string which is a list of directories, separated
by comma, followed by the storage capacity per directory.
An example configuration string is like the following:
  --data_cache_config=/data/0,/data/1:150GB

In the configuration above, the cache may use up to 300GB of
storage space, with 150GB max for /data/0 and /data/1 respectively.

Each partition has a meta-data cache which tracks the mappings
of cache keys to the locations of the cached data. A cache key
is a tuple of (file's name, file's modification time, file offset)
and a cache entry is a tuple of (backing file, offset in the backing
file, length of the cached data, optional checksum). Note that the
cache currently doesn't support overlapping ranges. In other words,
if the cache contains an entry of a file for range [m, m+4MB), a lookup
for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen
this as a problem but this may require further evaluation in the future.

Each partition stores its set of cached data in backing files created
on local storage. When inserting new data into the cache, the data is
appended to the current backing file in use. The storage consumption
of each cache entry counts towards the quota of that partition. When a
partition reaches its capacity, the least recently used (LRU) data in
that partition is evicted. Evicted data is removed from the underlying
storage by punching holes in the backing file it's stored in. As a
backing file reaches a certain size (by default 4TB), new data will
stop being appended to it and a new file will be created instead. Note
that due to hole punching, the backing file is actually sparse.

Optionally, checksumming can be enabled to verify read from the cache
is consistent with what was inserted and to verify that multiple attempted
insertions with the same cache key have the same cache content.
Checksumming is enabled by default for debug builds.

To probe for cached data in the cache, the interface Lookup() is used;
To insert data into the cache, the interface Store() is used. Please note
that eviction happens inline currently during Store().

This patch also added two startup flags for start-impala-cluster.py:
'--data_cache_dir' specifies the base directory in which each Impalad
creates the caching directory
'--data_cache_size' specifies the capacity string for each cache directory.

Testing done: a new BE test was added; core test with cache enabled.

Perf:
- 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement
over runs without the cache. Each node has a cache size of 150GB per node.
The performance is at parity with a configuration of a HDFS cluster using
EBS as the storage.

Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/runtime/io/CMakeLists.txt
A be/src/runtime/io/data-cache-test.cc
A be/src/runtime/io/data-cache.cc
A be/src/runtime/io/data-cache.h
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/runtime/io/hdfs-file-reader.h
M be/src/runtime/io/request-context.h
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/impalad-metrics.cc
M be/src/util/impalad-metrics.h
M bin/start-impala-cluster.py
M common/thrift/metrics.json
17 files changed, 1,467 insertions(+), 29 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/2
-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 2
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
......................................................................


Patch Set 5:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/2948/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 5
Gerrit-Owner: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: David Rorke <dr...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Michael Ho <kw...@cloudera.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Reviewer: Thomas Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Sat, 27 Apr 2019 08:06:04 +0000
Gerrit-HasComments: No