You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@airflow.apache.org by "uranusjr (via GitHub)" <gi...@apache.org> on 2023/02/20 05:52:41 UTC

[GitHub] [airflow] uranusjr opened a new pull request, #29625: Aggressively cache entry points in process

uranusjr opened a new pull request, #29625:
URL: https://github.com/apache/airflow/pull/29625

   `importlib.metadata.distributions()` reads information from the actual installations, which is a lot of IO that we can avoid by caching.
   
   The benefit of this depends on how many packages you have in your installation. It’s nearly zero with a bare Airflow installation, and I observed a ~7% save (17s to 16s) for the webserver to finish init (launch until `when_ready` is emitted) in a setup with all official providers installed.
   
   The downside is we are now persisting a lot of small objects in memory. I wonder whether there’s a good time we can purge those.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr merged pull request #29625: Aggressively cache entry points in process

Posted by "uranusjr (via GitHub)" <gi...@apache.org>.
uranusjr merged PR #29625:
URL: https://github.com/apache/airflow/pull/29625


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] ashb commented on a diff in pull request #29625: Aggressively cache entry points in process

Posted by "ashb (via GitHub)" <gi...@apache.org>.
ashb commented on code in PR #29625:
URL: https://github.com/apache/airflow/pull/29625#discussion_r1111824731


##########
airflow/utils/entry_points.py:
##########
@@ -28,26 +30,33 @@
 
 log = logging.getLogger(__name__)
 
+EPnD = Tuple[metadata.EntryPoint, metadata.Distribution]
 
-def entry_points_with_dist(group: str) -> Iterator[tuple[metadata.EntryPoint, metadata.Distribution]]:
-    """Retrieve entry points of the given group.
-
-    This is like the ``entry_points()`` function from importlib.metadata,
-    except it also returns the distribution the entry_point was loaded from.
 
-    :param group: Filter results to only this entrypoint group
-    :return: Generator of (EntryPoint, Distribution) objects for the specified groups
-    """
+@functools.lru_cache(maxsize=None)
+def _get_grouped_entry_points() -> dict[str, list[EPnD]]:
     loaded: set[str] = set()
+    mapping: dict[str, list[EPnD]] = collections.defaultdict(list)
     for dist in metadata.distributions():
         try:
             key = canonicalize_name(dist.metadata["Name"])

Review Comment:
   I was profling memory usage (not speed), and this seemed to be a cause of a lot of bloat -- and it's suprisingly expensive to get the dist name (involves parsing a lot of files for _every_ dist in airflow).
   
   I think there are three options here:
   
   1. Only compute this key if the entrypoint group matches (this limits the expensive operation to just dists we actually care about, instead of _all_
   2. Use the path component to be the cache key (see below)
   3. Cache based on the entrypoint classpath
   4. Don't cache at all. This only catches a case when you have multiple copies of the same dist in multiple cases (which is _rare_ to hit outside of being an Airflow developer anyway).
   
   On point 2, it doesn't seem possible to do this using public methods, but this:
   
   ```python console
   In [14]: d._path
   Out[14]: PosixPath('/home/ash/airflow/.venv/lib/python3.11/site-packages/greenlet-2.0.2.dist-info')
   
   In [15]: d._path.stem
   Out[15]: 'greenlet-2.0.2'
   ```
   
   Doing this might break for less common ways of shipping dists though, so it's likely not a good option, not without a fallback anyway.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [airflow] uranusjr commented on a diff in pull request #29625: Aggressively cache entry points in process

Posted by "uranusjr (via GitHub)" <gi...@apache.org>.
uranusjr commented on code in PR #29625:
URL: https://github.com/apache/airflow/pull/29625#discussion_r1112560468


##########
airflow/utils/entry_points.py:
##########
@@ -28,26 +30,33 @@
 
 log = logging.getLogger(__name__)
 
+EPnD = Tuple[metadata.EntryPoint, metadata.Distribution]
 
-def entry_points_with_dist(group: str) -> Iterator[tuple[metadata.EntryPoint, metadata.Distribution]]:
-    """Retrieve entry points of the given group.
-
-    This is like the ``entry_points()`` function from importlib.metadata,
-    except it also returns the distribution the entry_point was loaded from.
 
-    :param group: Filter results to only this entrypoint group
-    :return: Generator of (EntryPoint, Distribution) objects for the specified groups
-    """
+@functools.lru_cache(maxsize=None)
+def _get_grouped_entry_points() -> dict[str, list[EPnD]]:
     loaded: set[str] = set()
+    mapping: dict[str, list[EPnD]] = collections.defaultdict(list)
     for dist in metadata.distributions():
         try:
             key = canonicalize_name(dist.metadata["Name"])

Review Comment:
   I checked the callers and currently this is used by loading `airflow.plugins` and `apache_airflow_provider`. Both of these implement their own deduplication logic, so I think it is safe to remove this entirely. Although this would not actually help the latter case, which still accesses `metadata` anyway…



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@airflow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org