You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@superset.apache.org by GitBox <gi...@apache.org> on 2021/06/23 09:45:13 UTC
[GitHub] [superset] villebro commented on a change in pull request #15279: feat: run extra query on QueryObject and add compare operator for post_processing

villebro commented on a change in pull request #15279:
URL: https://github.com/apache/superset/pull/15279#discussion_r656877011



##########
File path: superset/utils/pandas_postprocessing.py
##########
@@ -425,9 +429,11 @@ def select(
 
 
 @validate_column_args("columns")
-def diff(df: DataFrame, columns: Dict[str, str], periods: int = 1,) -> DataFrame:
+def diff(
+    df: DataFrame, columns: Dict[str, str], periods: int = 1, axis: int = 0,

Review comment:
       Let's be more explicit about the axis type (I personally never remember for certain what 0 and 1 means on axis). If Pandas isn't yet exposing types, maybe something like
   
   ```python
   class PandasPostprocessingAxis(int, Enum):
       ROW = 0
       COLUMN = 1
   ```

##########
File path: superset/common/query_context.py
##########
@@ -97,6 +101,62 @@ def __init__(  # pylint: disable=too-many-arguments
             "result_format": self.result_format,
         }
 
+    def processing_time_offset(
+        self, df: pd.DataFrame, query_object: QueryObject,
+    ) -> Tuple[pd.DataFrame, List[str]]:
+        # ensure query_object is immutable
+        query_object_clone = copy.copy(query_object)
+        rv_sql = []
+
+        time_offset = query_object.time_offset
+        outer_from_dttm = query_object.from_dttm
+        outer_to_dttm = query_object.to_dttm
+        for offset in time_offset:
+            try:
+                query_object_clone.from_dttm = get_past_or_future(
+                    offset, outer_from_dttm,
+                )
+                query_object_clone.to_dttm = get_past_or_future(offset, outer_to_dttm,)
+            except ValueError as ex:
+                raise QueryObjectValidationError(str(ex))
+            # make sure subquery use main query where clause
+            query_object_clone.inner_from_dttm = outer_from_dttm
+            query_object_clone.inner_to_dttm = outer_to_dttm
+            query_object_clone.time_offset = []

Review comment:
       I wonder if we should add `time_offset` to the `QueryObject` schema and rename the current one to `time_offsets`, adding both to the cache key. Example:
   
   We want to make a query with two offsets: one year ago and two years ago.
   
   The "actual" main query that gets executed and cached (no additional columns added yet):
   ```python
   time_offsets: None
   time_offset: None
   ```
   
   First extra query (gets concatenated to the previous dataframe):
   ```python
   time_offsets: None
   time_offset: 1
   ```
   
   Second extra query (also concatenated to the main dataframe):
   ```python
   time_offsets: None
   time_offset: 2
   ```
   
   
   Main `QueryObject`:
   ```python
   time_offsets: [1, 2]
   time_offset: None
   ```
   
   Finally, when the full query object is constructed, the following result would be cached with the following keys:
   
   This way the main query result would be persisted along with the results of the extra query results without the need to rebuild the full dataframe on each request, and the extra queries could then also be cached individually.

##########
File path: superset/utils/pandas_postprocessing.py
##########
@@ -436,14 +442,68 @@ def diff(df: DataFrame, columns: Dict[str, str], periods: int = 1,) -> DataFrame
            on diff values calculated from `y`, leaving the original column `y`
            unchanged.
     :param periods: periods to shift for calculating difference.
+    :param axis: 0 for row, 1 for column. default 0.
     :return: DataFrame with diffed columns
     :raises QueryObjectValidationError: If the request in incorrect
     """
     df_diff = df[columns.keys()]
-    df_diff = df_diff.diff(periods=periods)
+    df_diff = df_diff.diff(periods=periods, axis=axis)
     return _append_columns(df, df_diff, columns)
 
 
+@validate_column_args("source_columns", "compare_columns")
+def compare(
+    df: DataFrame,
+    source_columns: List[str],
+    compare_columns: List[str],
+    compare_type: Optional[str],

Review comment:
       Same here: `class PandasPostprocessingCompare(str, Enum)`

##########
File path: superset/common/query_context.py
##########
@@ -130,11 +191,16 @@ def get_query_result(self, query_object: QueryObject) -> Dict[str, Any]:
             if self.enforce_numerical_metrics:
                 self.df_metrics_to_num(df, query_object)
 
+            if query_object.time_offset:
+                df, offset_sql = self.processing_time_offset(df, query_object)
+                query += ";\n\n".join(offset_sql)
+                query += ";\n\n"

Review comment:
       We probably don't need this line?
   ```suggestion
   ```

##########
File path: superset/common/query_object.py
##########
@@ -94,6 +96,7 @@ class QueryObject:
     datasource: Optional[BaseDatasource]
     result_type: Optional[ChartDataResultType]
     is_rowcount: bool
+    time_offset: List[str]

Review comment:
       I think it would be great to deprecate the current string-based offset, and use a periods-based approach, much like in rolling window, using the time grain as the time offset unit.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@superset.apache.org
For additional commands, e-mail: notifications-help@superset.apache.org