You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Paul Rogers (Code Review)" <ge...@cloudera.org> on 2019/01/26 05:41:17 UTC

[Impala-ASF-CR] IMPALA-7540. Intern most repetitive strings and network addresses in catalog

Paul Rogers has uploaded a new patch set (#3) to the change originally created by Todd Lipcon. ( http://gerrit.cloudera.org:8080/11158 )

Change subject: IMPALA-7540. Intern most repetitive strings and network addresses in catalog
......................................................................

IMPALA-7540. Intern most repetitive strings and network addresses in catalog

This adds interning to a bunch of repeated strings in catalog objects,
including:
- table name
- DB name
- owner
- column names
- input/output formats
- parameter keys
- common parameter values ("true", "false", etc)
- HBase column family names

Additionally, it interns TNetworkAddresses, so that each datanode host
is only stored once rather than having its own copy in each table.

I verified this patch using jxray on the development catalogd and
impalad. The following lines are removed entirely from the "duplicate
strings" report:

 Overhead   # char[]s # objects  Value
 164K (0.3%)     2,635   2,635  "127.0.0.1"
 97K (0.2%)      1,038   1,038  "__HIVE_DEFAULT_PARTITION__"
 95K (0.2%)      1,111   1,111  "transient_lastDdlTime"
 92K (0.1%)      1,975   1,975  "d"
 70K (0.1%)      997     997    "EXTERNAL_TABLE"
 56K (< 0.1%)    1,201   1,201  "todd"
 54K (< 0.1%)    998     998    "EXTERNAL"
 46K (< 0.1%)    998     998    "TRUE"
 44K (< 0.1%)    567     567    "numFilesErasureCoded"
 38K (< 0.1%)    612     612    "totalSize"
 30K (< 0.1%)    567     567    "numFiles"

The following are reduced substantially:

Before: 72K (0.1%)      1,543   1,543  "1"
After:  47K (< 0.1%)    1,009   1,009  "1"

A few large strings remain in the report that may be worth addressing, depending
on whether we think production catalogs exhibit the same repetitions:

1) Avro schemas, eg:
 204K (0.3%)     3       3      "{"fields": [{"type": ["boolean", "null"], "name": "bool_col1"}, {"type": ["int", "null"], "name": "tinyint_col1"}, {"type": ...[length 52429]"

(in the development catalog there are multiple tables with the same Avro
schema)

2) Partition location suffixes, eg:
 144K (0.2%)     1,234   1,234  "many_blocks_num_blocks_per_partition_1"
 17K (< 0.1%)    230     230    "year=2009/month=2"
 17K (< 0.1%)    230     230    "year=2009/month=3"
 17K (< 0.1%)    230     230    "year=2009/month=1"

(in the development catalog lots of tables have the same partitioning
layout)

3) Unsure (jxray isn't reporting the reference chain, but seems likely
   to be partition values):
 49K (< 0.1%)    1,058   1,058  "2010"
 28K (< 0.1%)    612     612    "2009"
 27K (< 0.1%)    585     585    "0"
 22K (< 0.1%)    71      899    ""

Change-Id: Ib3121aefa4391bcb1477d9dba0a49440d7000d26
---
A fe/src/main/java/org/apache/impala/catalog/CatalogInterners.java
M fe/src/main/java/org/apache/impala/catalog/HBaseColumn.java
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/catalog/Table.java
5 files changed, 250 insertions(+), 6 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/58/11158/3
-- 
To view, visit http://gerrit.cloudera.org:8080/11158
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib3121aefa4391bcb1477d9dba0a49440d7000d26
Gerrit-Change-Number: 11158
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Paul Rogers <pr...@cloudera.com>