You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by "robertwb (via GitHub)" <gi...@apache.org> on 2023/09/07 22:47:24 UTC

[GitHub] [beam] robertwb commented on a diff in pull request #28335: Add script to cache provider artifacts for faster startup.

robertwb commented on code in PR #28335:
URL: https://github.com/apache/beam/pull/28335#discussion_r1319184302


##########
sdks/python/apache_beam/yaml/yaml_provider.py:
##########
@@ -473,23 +490,57 @@ def __init__(self, packages, base_python=sys.executable):
     self._packages = packages
     self._base_python = base_python
 
-  def _key(self):
-    return json.dumps({'binary': self._base_python, 'packages': self._packages})
+  @classmethod
+  def _key(cls, base_python, packages):
+    return json.dumps({
+        'binary': base_python, 'packages': sorted(packages)
+    },
+                      sort_keys=True)
 
-  def _venv(self):
-    venv = os.path.join(
-        self.VENV_CACHE,
-        hashlib.sha256(self._key().encode('utf-8')).hexdigest())
+  @classmethod
+  def _path(cls, base_python, packages):
+    return os.path.join(
+        cls.VENV_CACHE,
+        hashlib.sha256(cls._key(base_python,
+                                packages).encode('utf-8')).hexdigest())
+
+  @classmethod
+  def _create_venv_from_scratch(cls, base_python, packages):
+    venv = cls._path(base_python, packages)
     if not os.path.exists(venv):
-      python_binary = os.path.join(venv, 'bin', 'python')
-      subprocess.run([self._base_python, '-m', 'venv', venv], check=True)
-      subprocess.run([python_binary, '-m', 'ensurepip'], check=True)
-      subprocess.run([python_binary, '-m', 'pip', 'install'] + self._packages,
+      subprocess.run([base_python, '-m', 'venv', venv], check=True)
+      venv_python = os.path.join(venv, 'bin', 'python')
+      subprocess.run([venv_python, '-m', 'ensurepip'], check=True)
+      subprocess.run([venv_python, '-m', 'pip', 'install'] + packages,
                      check=True)
       with open(venv + '-requirements.txt', 'w') as fout:
-        fout.write('\n'.join(self._packages))
+        fout.write('\n'.join(packages))
     return venv

Review Comment:
   Yeah, installing apache beam is slow. I just filed https://github.com/apache/beam/issues/28364 about this. However, the change here lets you install it once and then clone this environment for all other created venvs, so the price is paid once at docker build time and then the expansion service (e.g. during yaml submission time) is fast(er). 
   
   Installing this from the local environment is an interesting idea, but might be hard, might not match the container environment, and might not be clean. 
   
   A downstream dependency can depend on `apache_beam[X]` which should only install the delta, but I added a couple more here to be complete. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org