You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2022/09/16 15:25:25 UTC

[GitHub] [accumulo-docker] keith-turner opened a new pull request, #20: Make rebuilding the docker image for a new Accumulo snapshot faster.

keith-turner opened a new pull request, #20:
URL: https://github.com/apache/accumulo-docker/pull/20

This PR is intentionally incomplete as I am seeking it improve a problem I see but I am not sure this is the best approach.

In the past when testing compactor and scan servers running Kubernetes I would go through the following process.

1. clone accumulo docker
2. manually download hadoop and zookeeper
3. build a snapshot version of accumulo
4. build an accumulo docker image
5. push the docker image to a container repository that the Kubernetes cluster can pull from
6. restart the accumulo processes running in Kubernetes
7. run some experiments and make some changes to accumulo and then go to step 3

Step 4 above takes multiple minutes and creates a 2GB images. Because the image is so large it makes steps 5 and 6 take a while as the image is uploaded and then downloaded from the repo. This PR works around these problems by doing the following.

* Move the code to download needed deps outside of the docker build file. This saves me time from manually downloading in steo 2 above.
* Split the docker build file into two build files. The first one builds a base image with java,hadoop,zookeeper. The seond extends the first and only has to include Accumulo.

With the above changes I can have the following workflow.

1. clone accumulo docker
2. run download script to get hadoop and zookeeper files
3. build the accumulo-base docker image that includes java, hadoop, and, zk
4. build a snapshot version of accumulo
5. build the accumulo docker image that extends accumulo-base and includes accumulo
6. push the docker image to a container repository that the Kubernetes cluster can pull from
7. restart the accumulo processes running in Kubernetes
8. run some experiments and make some changes to accumulo and then go to step 4

Step 5 above takes a few seconds (vs a few minutes) and produces a new image where the layers on top of accumulo-base are only ~30MB (can see this with docker history command). The first times step 6 and 7 run, the large accumulo-base image will have to be uploaded and downloaded. However on subsequent runs of step 6 and 7 only ~30MB needs to be uploaded and downloaded, making those steps much much faster.

This is a huge improvement for what I am trying to do. I did just enough work to get this functioning. Before updating the readme, improving the docker file, and download script I would like to see if anyone has feedback.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo-docker] keith-turner commented on pull request #20: Make rebuilding the docker image for a new Accumulo snapshot faster.

Posted by GitBox <gi...@apache.org>.

keith-turner commented on PR #20:
URL: https://github.com/apache/accumulo-docker/pull/20#issuecomment-1250970667

   > Simply splitting the RUN command into multiple run commands might achieve what you want.
   
   That might work.  I was wondering how docker decides that it can reuse a cached layer, i looked and found the following.
   
   https://medium.com/swlh/docker-caching-introduction-to-docker-layers-84f20c48060a
   https://stackoverflow.com/questions/60578670/why-does-docker-rebuild-all-layers-every-time-i-change-build-args
   
   Based on those, it seems like it will really matter when accumulo is copied in.   Going to try the following in the build file.
   
    1. copy hadoop and ZK in
    2. run to install jdk, other stuff,hadoop, and ZK 
    3. copy accumulo in
    4. run to install accumulo
   
   so hopefully if only accumulo changes then it will only rerun 3 and 4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo-docker] brianloss commented on pull request #20: Make rebuilding the docker image for a new Accumulo snapshot faster.

Posted by GitBox <gi...@apache.org>.

brianloss commented on PR #20:
URL: https://github.com/apache/accumulo-docker/pull/20#issuecomment-1250956637

I'm not sure you even need a multi-stage build. Typically, as I've seen anyway, you would do a multi-stage build if you need to install a lot of dependencies or need a different base image to build some software, but then just want to use the build output in your final image.

Docker images are already layered, and general advice is to minimize commands in the Dockerfile that add layers as that makes the image larger. However, combining everything into a small number of layers will then mean any change causes the one (or few) layers to be rebuilt. You can see that with the [single RUN command](https://github.com/apache/accumulo-docker/blob/main/Dockerfile#L39) in the Dockerfile. Since the hadoop/zookeeper/accumulo download and accumulo native build are all combined into a single layer, if you want to rebuild accumulo then everything must be repeated. Simply splitting the RUN command into multiple run commands might achieve what you want. The earlier RUN commands should have the things that you don't expect to change often (e.g., download and install hadoop and zookeeper). Then the later RUN commands should have the things you want to iterate on.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo-docker] keith-turner closed pull request #20: Make rebuilding the docker image for a new Accumulo snapshot faster.

Posted by GitBox <gi...@apache.org>.

keith-turner closed pull request #20: Make rebuilding the docker image for a new Accumulo snapshot faster.
URL: https://github.com/apache/accumulo-docker/pull/20


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo-docker] keith-turner commented on pull request #20: Make rebuilding the docker image for a new Accumulo snapshot faster.

Posted by GitBox <gi...@apache.org>.

keith-turner commented on PR #20:
URL: https://github.com/apache/accumulo-docker/pull/20#issuecomment-1250916545

   > Did you mean to target this against the next-release branch?
   
   Oh I forgot about that branch, yeah I probably want to target against it.
   
   > Also, did you look at Docker [multi-stage](https://docs.docker.com/develop/develop-images/multistage-build/#name-your-build-stages) builds?
   
   No I didn't, I will take a look at that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo-docker] dlmarion commented on pull request #20: Make rebuilding the docker image for a new Accumulo snapshot faster.

Posted by GitBox <gi...@apache.org>.

dlmarion commented on PR #20:
URL: https://github.com/apache/accumulo-docker/pull/20#issuecomment-1249758868

   Did you mean to target this against the `next-release` branch? Also, did you look at Docker [multi-stage](https://docs.docker.com/develop/develop-images/multistage-build/#name-your-build-stages) builds? You might only need to change the existing Dockerfile to achieve what you want.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo-docker] brianloss commented on pull request #20: Make rebuilding the docker image for a new Accumulo snapshot faster.

Posted by GitBox <gi...@apache.org>.

brianloss commented on PR #20:
URL: https://github.com/apache/accumulo-docker/pull/20#issuecomment-1250972522

   Yes, my apologies--I should have been more specific. The order of the steps in the layers is critical, and not just RUN commands create a layer. COPY and ADD commands do as well, so their location within the file matters.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo-docker] keith-turner commented on pull request #20: Make rebuilding the docker image for a new Accumulo snapshot faster.

Posted by GitBox <gi...@apache.org>.

keith-turner commented on PR #20:
URL: https://github.com/apache/accumulo-docker/pull/20#issuecomment-1251252758

   replaced by #22 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@accumulo.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org