You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org> on 2019/02/04 06:01:28 UTC
[Impala-ASF-CR] IMPALA-7540. Intern most repetitive strings and network addresses in catalog
Bharath Vissapragada has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/11158 )
Change subject: IMPALA-7540. Intern most repetitive strings and network addresses in catalog
......................................................................
IMPALA-7540. Intern most repetitive strings and network addresses in catalog
This adds interning to a bunch of repeated strings in catalog objects,
including:
- table name
- DB name
- owner
- column names
- input/output formats
- parameter keys
- common parameter values ("true", "false", etc)
- HBase column family names
Additionally, it interns TNetworkAddresses, so that each datanode host
is only stored once rather than having its own copy in each table.
I verified this patch using jxray on the development catalogd and
impalad. The following lines are removed entirely from the "duplicate
strings" report:
Overhead # char[]s # objects Value
164K (0.3%) 2,635 2,635 "127.0.0.1"
97K (0.2%) 1,038 1,038 "__HIVE_DEFAULT_PARTITION__"
95K (0.2%) 1,111 1,111 "transient_lastDdlTime"
92K (0.1%) 1,975 1,975 "d"
70K (0.1%) 997 997 "EXTERNAL_TABLE"
56K (< 0.1%) 1,201 1,201 "todd"
54K (< 0.1%) 998 998 "EXTERNAL"
46K (< 0.1%) 998 998 "TRUE"
44K (< 0.1%) 567 567 "numFilesErasureCoded"
38K (< 0.1%) 612 612 "totalSize"
30K (< 0.1%) 567 567 "numFiles"
The following are reduced substantially:
Before: 72K (0.1%) 1,543 1,543 "1"
After: 47K (< 0.1%) 1,009 1,009 "1"
A few large strings remain in the report that may be worth addressing, depending
on whether we think production catalogs exhibit the same repetitions:
1) Avro schemas, eg:
204K (0.3%) 3 3 "{"fields": [{"type": ["boolean", "null"], "name": "bool_col1"}, {"type": ["int", "null"], "name": "tinyint_col1"}, {"type": ...[length 52429]"
(in the development catalog there are multiple tables with the same Avro
schema)
2) Partition location suffixes, eg:
144K (0.2%) 1,234 1,234 "many_blocks_num_blocks_per_partition_1"
17K (< 0.1%) 230 230 "year=2009/month=2"
17K (< 0.1%) 230 230 "year=2009/month=3"
17K (< 0.1%) 230 230 "year=2009/month=1"
(in the development catalog lots of tables have the same partitioning
layout)
3) Unsure (jxray isn't reporting the reference chain, but seems likely
to be partition values):
49K (< 0.1%) 1,058 1,058 "2010"
28K (< 0.1%) 612 612 "2009"
27K (< 0.1%) 585 585 "0"
22K (< 0.1%) 71 899 ""
Change-Id: Ib3121aefa4391bcb1477d9dba0a49440d7000d26
Reviewed-on: http://gerrit.cloudera.org:8080/11158
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
A fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java
M fe/src/main/java/org/apache/impala/catalog/HBaseColumn.java
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/catalog/Table.java
5 files changed, 250 insertions(+), 6 deletions(-)
Approvals:
Impala Public Jenkins: Looks good to me, approved; Verified
--
To view, visit http://gerrit.cloudera.org:8080/11158
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ib3121aefa4391bcb1477d9dba0a49440d7000d26
Gerrit-Change-Number: 11158
Gerrit-PatchSet: 7
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pr...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>