You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2020/12/02 00:53:13 UTC

[GitHub] [beam] tvalentyn commented on pull request #13399: [BEAM-11312] Log cloud build url and enable kaniko cache in sdk_conta…

tvalentyn commented on pull request #13399:
URL: https://github.com/apache/beam/pull/13399#issuecomment-736915744


   > > Thanks. It seems that caching may improve the startup time, and be useful for users who frequently launch the same pipeline. However I think caching may result in a difference in behavior. Questions:
   > > 
   > > 1. Is it possible that caching will result in a stale image that users will perceive as undesirable and the behavior will be difficult to debug to users or support folks? For example, if a user pipeline depends on a latest version of a dependency X in pypi. Perhaps a dependency they control. They have a pipeline with a setup.py that has an open install_requires bound dep>=1.0.0 < 2. They run the pipeline, then push dependency to pypi and run the pipeline again, expecting a change in behavior. Kaniko will not rebuild the image in this case, right? What are your thoughts on that?
   > 
   > I think kaniko cache works the same way as docker layer cache, that is to say, if the locally downloaded artifacts changed(or requirements.txt, setup.py changed) it will actually change the COPY step in the prebuilding workflow. There will be no valid cache layer since the artifacts copy step and a new image will be rebuilt. (also verified through my own experiment with changing requirements.txt)
   
   Thanks for checking. Sounds similar to Docker build cache https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#leverage-build-cache.
   
   Transitive dependencies not present in requirements.txt may not be updated, but it would be better to list them in requirements.txt anyway to avoid pickling mismatches on the worker, as mentioned in https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/.
   
   > 
   > > 1. During runtime with prebuilding workflow enabled, how visible is it to the user that the cached layers are reused and not rebuilt?
   > 
   > There will be log entries "No cached layer found for cmd ..." in the cloud build log.
   
   Would it be possible to mention that caching is used when it is rather when it is not? Or perhaps add a  generic info message along the lines of: `Staging pipeline dependencies into a prebuilt container image. To optimize build time, build steps will be cached.`
   
   Also do you know if a user can tell Kaniko to clean the cache manually?
   
   > 
   > > 1. I think we should document the prebuilding feature in Beam docs, and reflect the caching behavior and associated TTLs. What is a plan for that?
   > 
   > I do believe Emily will be working on documenting this as part of the custom container next quarter and I can also help.
   > 
   > > 1. Would customizing the TTL or adding a no-cache option make sense? We are using default 2 weeks TTL, right? See: [cloud.google.com/cloud-build/docs/kaniko-cache#configuring_the_cache_expiration_time](https://cloud.google.com/cloud-build/docs/kaniko-cache#configuring_the_cache_expiration_time).
   > 
   > I think default value makes sense, I didn't want to provide too many knobs to users since it may become more confusing or rarely used, but we can always provide additional flags for more advanced user to control it.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org