You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@superset.apache.org by GitBox <gi...@apache.org> on 2021/07/13 07:07:28 UTC
[GitHub] [superset] ktmud commented on a change in pull request #15648: perf: Refactor Dashboard.datasets_trimmed_for_slices et al.

ktmud commented on a change in pull request #15648:
URL: https://github.com/apache/superset/pull/15648#discussion_r668470726



##########
File path: superset/models/dashboard.py
##########
@@ -246,13 +256,24 @@ def data(self) -> Dict[str, Any]:
         unless=lambda: not is_feature_enabled("DASHBOARD_CACHE"),
     )
     def datasets_trimmed_for_slices(self) -> List[Dict[str, Any]]:
-        datasource_slices = utils.indexed(self.slices, "datasource")
-        return [
-            # Filter out unneeded fields from the datasource payload
-            datasource.data_for_slices(slices)
-            for datasource, slices in datasource_slices.items()
-            if datasource
-        ]
+        # Verbose but efficient database enumeration of dashboard datasources.
+        datasource_slices: Dict[
+            Tuple[Type["BaseDatasource"], int], Set[Slice]
+        ] = defaultdict(set)
+
+        for slc in self.slices:
+            datasource_slices[(slc.cls_model, slc.datasource_id)].add(slc)
+
+        result: List[Dict[str, Any]] = []
+
+        for (cls_model, datasource_id), slices in datasource_slices.items():
+            datasource = db.session.query(cls_model).filter_by(id=datasource_id).first()
+
+            if datasource:
+                # Filter out unneeded fields from the datasource payload
+                result.append(datasource.data_for_slices(slices))
+
+        return result

Review comment:
       For this method, though, I believe the real bottle neck is in `dashboard.data_for_slices` where [these two lines](https://github.com/airbnb/superset-fork/blob/b295c6ad43798d33b15af7b34ab4a4ff1fd2a903/superset/connectors/base/models.py#L263-L264) takes a lot of time. 
   
   Generating the `datasource.data` dictionary for the one big datasource we  have takes about ~1.2s (80% of the query time). There may be two solutions here:
   
   1. Refactor `datasource.data(...)` and `datasource.data_for_slices(...)` to load metrics & columns lazily.
   2. Creating cache for `Datasource.data`,  clearing the cache using SQLA hook in a similar fashion to [clear_dashboard_cache](https://github.com/airbnb/superset-fork/blob/f24264ccdc6bf6ab95f1969295dd9ba3612e9054/superset/models/dashboard.py#L405).
   
   

##########
File path: superset/models/dashboard.py
##########
@@ -171,8 +172,17 @@ def url(self) -> str:
 
     @property
     def datasources(self) -> Set[BaseDatasource]:
-        # pylint: disable=using-constant-test
-        return {slc.datasource for slc in self.slices if slc.datasource}
+        # Verbose but efficient database enumeration of dashboard datasources.
+        datasources = {(slc.cls_model, slc.datasource_id) for slc in self.slices}
+        result: Set[BaseDatasource] = set()
+
+        for cls_model, datasource_id in datasources:
+            datasource = db.session.query(cls_model).filter_by(id=datasource_id).first()
+
+            if datasource:
+                result.add(datasource)
+
+        return result

Review comment:
       here's another way (probably more efficient):
   
   ```python
       @property
       def datasources(self) -> Set[BaseDatasource]:
           datasources = defaultdict(set)
           for slc in self.slices:
               datasources[slc.cls_model].add(slc.datasource_id)
           result = []
           for cls_model, ids in datasources.items():
               result += db.session.query(cls_model).filter(cls_model.id._in(ids)).all()
           return set(result)
   ```
   
   On one of my test dashboard, this returns an average query time of `0.028s`.
   <img src="https://user-images.githubusercontent.com/335541/125401010-750f3680-e367-11eb-9c1a-d4ff680ba986.png" width="300" />
   
   
   While the original method:
   
   ```python
       @property
       def datasources(self) -> Set[BaseDatasource]:
           # pylint: disable=using-constant-test
           return {slc.datasource for slc in self.slices if slc.datasource}
   ```
   
   costs about `0.5s` (17x slower).
   
   <img src="https://user-images.githubusercontent.com/335541/125400609-f4503a80-e366-11eb-9779-f4feee779f84.png" width="200">
   
   The code in this PR is about 2x slower than my optimized version:
   <img src="https://user-images.githubusercontent.com/335541/125401154-a7209880-e367-11eb-922f-006f310dfac3.png" width="200">
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org