You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@impala.apache.org by jo...@apache.org on 2019/02/06 05:18:24 UTC

[impala] 05/08: test-with-docker: decrease image size by "de-duping" HDFS.

This is an automated email from the ASF dual-hosted git repository.

joemcdonnell pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/impala.git

commit 2f5d0016eaa3f9e0e4e3b02032d092cd7b887198
Author: Philip Zeyliger <ph...@cloudera.com>
AuthorDate: Tue Oct 23 09:53:54 2018 -0700

    test-with-docker: decrease image size by "de-duping" HDFS.
    
    This change shaves about 20GB of the (uncompressed) Docker
    image for test-with-docker, taking it from ~60GB to ~40GB.
    Compressed, the image ends up being about 14GB.
    
    To do this, we cheat: HDFS represents every block three times, so we
    have three copies of every block. Before committing the image, we simply
    hard-link the blocks together, which happens to work. It's an
    implementation detail of HDFS that these blocks aren't, say, appended
    to, but I think the trade-off in time and disk space saved is worth it.
    Because the image is smaller, it takes less time to "docker commit" it.
    
    Change-Id: I4a13910ba5e873c31893dbb810a8410547adb2f1
    Reviewed-on: http://gerrit.cloudera.org:8080/11782
    Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
    Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
 docker/entrypoint.sh | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/docker/entrypoint.sh b/docker/entrypoint.sh
index c4a1243..3f56252 100755
--- a/docker/entrypoint.sh
+++ b/docker/entrypoint.sh
@@ -252,6 +252,24 @@ function build_impdev() {
   # Shut down things cleanly.
   testdata/bin/kill-all.sh
 
+  # "Compress" HDFS data by de-duplicating blocks. As a result of
+  # having three datanodes, our data load is 3x larger than it needs
+  # to be. To alleviate this (to the tune of ~20GB savings), we
+  # use hardlinks to link together the identical blocks. This is absolutely
+  # taking advantage of an implementation detail of HDFS.
+  echo "Hardlinking duplicate HDFS block data."
+  set +x
+  for x in $(find testdata/cluster/*/node-1/data/dfs/dn/current/ -name 'blk_*[0-9]'); do
+    for n in 2 3; do
+      xn=${x/node-1/node-$n}
+      if [ -f $xn ]; then
+        rm $xn
+        ln $x $xn
+      fi
+    done
+  done
+  set -x
+
   # Shutting down PostgreSQL nicely speeds up it's start time for new containers.
   _pg_ctl stop