You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Alex Behm (Code Review)" <ge...@cloudera.org> on 2017/04/05 19:23:28 UTC
[Impala-ASF-CR] IMPALA-4029: Reduce memory requirements for storing file metadata

Alex Behm has posted comments on this change.

Change subject: IMPALA-4029: Reduce memory requirements for storing file metadata
......................................................................


Patch Set 3:

(13 comments)

http://gerrit.cloudera.org:8080/#/c/6406/3/common/fbs/CatalogObjects.fbs
File common/fbs/CatalogObjects.fbs:

Line 36:   offset: long = 0 (id: 0);
Why a default value? Seems potentially dangerous.

Not for this change, but I'm thinking we don't even need to store the offset. If we know the block size, than we can derive the offset (assuming the list of file blocks is ordered by offset). Might be worth adding a TODO or recording that idea somewhere.


Line 40:   length: long = -1 (id: 1);
Seems redundant, why keep it?


Line 65:   compression: FbCompression (id: 3);
Will FlatBuffers add padding to align members? Ideally, we'd optimize for space and not access speed.


http://gerrit.cloudera.org:8080/#/c/6406/3/common/thrift/CatalogObjects.thrift
File common/thrift/CatalogObjects.thrift:

Line 218:   1: required binary file_desc_data
Nice


Line 296:   10: optional list<string> file_name_prefixes
Is this required for the move to flat buffers? I think we should consider an HdfsPathSet abstraction that assigns path ids and internally compresses the underlying strings. There's a lot of manual lookups and stitching in the current code. I don't feel too strongly about whether we should do that now, or clean up the code later.


http://gerrit.cloudera.org:8080/#/c/6406/3/fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java
File fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java:

Line 78:   public byte toFbCompression() {
toFlatBuffer()?


http://gerrit.cloudera.org:8080/#/c/6406/3/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
File fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java:

Line 126:         locations = fileSystem.getFileBlockLocations(fileStatus, 0, fileStatus.getLen());
I think it would be better for now to keep all the code that fetches information from external systems in one place (HdfsTable). Splitting up the loading and delegating to several classes may make sense, but that probably requires significant surgery, and the current fetching code is very much centralized (we iterate over all files of in a table).

The loading code in this patch is more confusing to me than before. The meaning of some verbs like load/create is less clear.

If you agree with that direction, we may not need a FileDescriptor class at all, and can only rely on the FB to hold the data. It may still make sense to have a FileDescBuilder which you can use to construct a FbFileDesc.


Line 227:             loc.getNames().length);
Another case where we might be calling out to the NN.


Line 321:     private static int REPLICA_HOST_CACHE_MASK = 0x8000;
Less code and more readable to have one var with (1 << 15). The places that need to & the mask can bit-wise invert, i.e. (x & ~MASK).


http://gerrit.cloudera.org:8080/#/c/6406/3/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 189:   // List of file name prefixes.
Sorted list of file name prefixes?


Line 309:         Pair<Integer, String> compressedFileName =
How predictable is the compression behavior? Do we iterate over the files in lexicographical order for both HDFS and S3?
I'm worried about the case where an "invalidate metadata" suddenly leads to a higher memory requirement even if no files/partitions have changed.


Line 340:    * the suffix is equal to 'fileName'.
Should mention that this function expects the fileNames to be sorted.


Line 346:     String commonPrefix = Strings.commonPrefix(prevFileName, fileName);
I'm wondering if it's better to first check the last common prefix instead of the prefix between the current and prev file name. In the current version it seems like the list of prefixes could contain strings which are a common prefix of another one.


-- 
To view, visit http://gerrit.cloudera.org:8080/6406
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I483d3cadc9d459f71a310c35a130d073597b0983
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes