You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Joe McDonnell (Code Review)" <ge...@cloudera.org> on 2017/04/05 20:29:48 UTC
[Impala-ASF-CR] IMPALA-4623: Thread level file handle caching

Joe McDonnell has posted comments on this change.

Change subject: IMPALA-4623: Thread level file handle caching
......................................................................


Patch Set 1:

(19 comments)

http://gerrit.cloudera.org:8080/#/c/6478/1/be/src/runtime/disk-io-mgr.cc
File be/src/runtime/disk-io-mgr.cc:

Line 70: DEFINE_uint64(max_cached_file_handles, 10000, "Maximum number of HDFS file handles "
> on further consideration, esp. in light of what tim brought up (separate re
Comment out of date.


Line 388:   std::vector<int> num_threads_per_disk;
> don't use std:: in .cc files
Done


Line 1050: void DiskIoMgr::WorkLoop(DiskQueue* disk_queue, int thread_file_handle_cache_size) {
> you use 'fh' elsewhere for 'file handle', feel free to use that elsewhere i
removed


Line 1052:   if (thread_file_handle_cache_size > 0) {
> dcheck that it's not < 0
removed


Line 1081:       Write(worker_context, static_cast<WriteRange*>(range));
> what about writing?
We currently don't do writes to HDFS via the IO manager. The writes here are to local files.

When we are writing tables, we open an Hdfs file handle in hdfs-table-sink.cc. I don't think these file handles can be reused, because we open it in write only mode. The file handle cache opens in read only mode. I don't see any way to convert them.


Line 1337:   : capacity_(capacity) {}
> move to .h
removed


Line 1339: DiskIoMgr::ThreadFileHandleCache::~ThreadFileHandleCache() {}
> remove
removed


Line 1343:   std::string file_key = std::string(fname) + std::to_string(mtime);
> is this guaranteed to be unique w/o some sort of separator?
removed


Line 1351:       EvictFileHandle();
> you only call this once, and it's a small function, you might as well inlin
removed


http://gerrit.cloudera.org:8080/#/c/6478/1/be/src/runtime/disk-io-mgr.h
File be/src/runtime/disk-io-mgr.h:

Line 233:   /// This is a single-threaded LRU cache for Hdfs file handles. The cache creates and
> I guess I was thinking that each thread would temporarily check out a file 
The approach changed to using a single cache at the DiskIoMgr level like the existing cache.


Line 235:   class ThreadFileHandleCache {
> 'thread' is generic. LruFileHandleCache? or simply FileHandleCache?
Removed.


Line 240:     /// Gets an Hdfs file handle for the specified file. It will lookup and use a
> "look up"
Removed


Line 245:     HdfsCachedFileHandle* GetFileHandle(const hdfsFS& fs, const char* fname, int64_t mtime);
> long line
Removed


Line 248: 
> remove blank line
Removed


Line 256:     class LruListElement {
> make it a struct, class member vars are supposed to be private.
Removed


Line 260:       ~LruListElement() {}
> remove
Removed


Line 533:     /// in the reader context.
> "reader context" = reader_ member. referencing the member directly is more 
N/A in new code


Line 544:     /// the DiskIoRequestContext. This clears the statistics on this file handle.
> why not make that a function of diskiorequestcontext
The two statistics that need input from the ScanRange are unexpected_remote_bytes_ and num_remote_ranges_.

The unexpected_remote_bytes_ requires knowing the value of expected_local_ for the ScanRange. There is some tracing that prints when a ScanRange has unexpected remote bytes (see ScanRange::Close for where it is now). The tracing is the only thing that would prevent this from being put directly into DiskIoMgrContext if we passed in expected_local_ or the ScanRange to the function.

num_remote_ranges_ is impacted by the fact that we can call GetStatistics multiple times per scan range. It needs to know if this range has already been counted at the DiskIoMgrContext level. This can be worked around.


Line 574:     /// at read time, so hdfs_file_ is null.
> is there a need to have it both ways for hdfs files, instead of always open
We need to keep the hdfs file open as long as the cached buffer is around so that the hdfs file doesn't get closed. We also use this when file handle caching is off (max_cached_file_handles=0).


-- 
To view, visit http://gerrit.cloudera.org:8080/6478
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibe5ff60971dd653c3b6a0e13928cfa9fc59d078d
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes