You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by jo...@apache.org on 2019/02/06 05:18:24 UTC
[impala] 05/08: test-with-docker: decrease image size by
"de-duping" HDFS.
This is an automated email from the ASF dual-hosted git repository.
joemcdonnell pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git
commit 2f5d0016eaa3f9e0e4e3b02032d092cd7b887198
Author: Philip Zeyliger <ph...@cloudera.com>
AuthorDate: Tue Oct 23 09:53:54 2018 -0700
test-with-docker: decrease image size by "de-duping" HDFS.
This change shaves about 20GB of the (uncompressed) Docker
image for test-with-docker, taking it from ~60GB to ~40GB.
Compressed, the image ends up being about 14GB.
To do this, we cheat: HDFS represents every block three times, so we
have three copies of every block. Before committing the image, we simply
hard-link the blocks together, which happens to work. It's an
implementation detail of HDFS that these blocks aren't, say, appended
to, but I think the trade-off in time and disk space saved is worth it.
Because the image is smaller, it takes less time to "docker commit" it.
Change-Id: I4a13910ba5e873c31893dbb810a8410547adb2f1
Reviewed-on: http://gerrit.cloudera.org:8080/11782
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
docker/entrypoint.sh | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/docker/entrypoint.sh b/docker/entrypoint.sh
index c4a1243..3f56252 100755
--- a/docker/entrypoint.sh
+++ b/docker/entrypoint.sh
@@ -252,6 +252,24 @@ function build_impdev() {
# Shut down things cleanly.
testdata/bin/kill-all.sh
+ # "Compress" HDFS data by de-duplicating blocks. As a result of
+ # having three datanodes, our data load is 3x larger than it needs
+ # to be. To alleviate this (to the tune of ~20GB savings), we
+ # use hardlinks to link together the identical blocks. This is absolutely
+ # taking advantage of an implementation detail of HDFS.
+ echo "Hardlinking duplicate HDFS block data."
+ set +x
+ for x in $(find testdata/cluster/*/node-1/data/dfs/dn/current/ -name 'blk_*[0-9]'); do
+ for n in 2 3; do
+ xn=${x/node-1/node-$n}
+ if [ -f $xn ]; then
+ rm $xn
+ ln $x $xn
+ fi
+ done
+ done
+ set -x
+
# Shutting down PostgreSQL nicely speeds up it's start time for new containers.
_pg_ctl stop