You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2020/08/14 18:14:35 UTC

[GitHub] [incubator-pinot] jasonyanwenl opened a new pull request #5866: [TE] add anomaly detection as a service - new table for storing ad-hoc data

jasonyanwenl opened a new pull request #5866:
URL: https://github.com/apache/incubator-pinot/pull/5866


   Phase 2 is separated into two parts. This PR is for the 1st part. The main change is that a new table is created for storing online ad-hoc data. The PR for Phase 1 can be found in #5769.
   
   Here are some other minor changes:
   
   * Remove the `cleanExistingState` method in the `AnomalyDetectionResource`. This is not necessary because each request is distinct now and the monitor job will periodically clean them (dataset/metric/detection configs, ad-hoc data).
   * Remove the suffix appended to metric as the metric name is allowed to be duplicate.
   
   The 2nd part of phase 2 is about creating a new endpoint to provide registering and CRUD operations on ad-hoc data.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] jasonyanwenl commented on a change in pull request #5866: [TE] add anomaly detection as a service - new table for storing ad-hoc data

Posted by GitBox <gi...@apache.org>.
jasonyanwenl commented on a change in pull request #5866:
URL: https://github.com/apache/incubator-pinot/pull/5866#discussion_r473093021



##########
File path: thirdeye/thirdeye-pinot/src/main/java/org/apache/pinot/thirdeye/api/detection/AnomalyDetectionResource.java
##########
@@ -464,6 +420,38 @@ TaskDTO generateTaskConfig(DetectionConfigDTO detectionConfigDTO,
     return taskDTO;
   }
 
+  long saveOnlineDetectionData(JsonNode payloadNode,
+      DatasetConfigDTO datasetConfigDTO, MetricConfigDTO metricConfigDTO)
+        throws JsonProcessingException {
+    JsonNode dataNode = payloadNode.get(DATA_FIELD);
+    String timeColumnName = datasetConfigDTO.getTimeColumn();
+    String datasetName = datasetConfigDTO.getDataset();
+    String metricName = metricConfigDTO.getName();
+
+    // Check if time & metric columns exist in adhoc data
+    ArrayNode columnsNode = dataNode.withArray(COLUMNS_FIELD);
+    int[] colIndices = findTimeAndMetricColumns(columnsNode,
+        timeColumnName, metricName);
+    int timeColIdx = colIndices[0];
+    int metricColIdx = colIndices[1];
+    Preconditions.checkArgument(metricColIdx>=0 && timeColIdx>=0,
+        String.format("metric: %s or time: %s not found in adhoc data.",
+            metricName, timeColumnName));
+
+    // Save online data
+    OnlineDetectionDataDTO onlineDetectionDataDTO = new OnlineDetectionDataDTO();
+    onlineDetectionDataDTO.setDataset(datasetName);
+    onlineDetectionDataDTO.setMetric(metricName);
+    onlineDetectionDataDTO.setOnlineDetectionData(this.objectMapper.writeValueAsString(dataNode));

Review comment:
       Thanks! I have already limited the size of request data. But originally I found the column type is `MEDIUMTEXT` in our internal database. So I just limit the size to be 10MB. However, here in our open-source SQL script, it is `TEXT` so the maximum size is 64 KB like what you said. Anyway, I will keep consistent with our open-source version and limit the request data size to be less than 64KB.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] vincentchenjl commented on a change in pull request #5866: [TE] add anomaly detection as a service - new table for storing ad-hoc data

Posted by GitBox <gi...@apache.org>.
vincentchenjl commented on a change in pull request #5866:
URL: https://github.com/apache/incubator-pinot/pull/5866#discussion_r472489939



##########
File path: thirdeye/thirdeye-pinot/src/main/java/org/apache/pinot/thirdeye/api/detection/AnomalyDetectionResource.java
##########
@@ -345,18 +323,15 @@ DatasetConfigDTO generateDatasetConfig(JsonNode payloadNode, String suffix) {
     return datasetConfigDTO;
   }
 
-  MetricConfigDTO generateMetricConfig(JsonNode payloadNode, String suffix,
-        DatasetConfigDTO datasetConfigDTO)
-      throws JsonProcessingException {
+  MetricConfigDTO generateMetricConfig(JsonNode payloadNode, String suffix) {
     MetricConfigDTO metricConfigDTO = new MetricConfigDTO();
-    JsonNode dataNode = payloadNode.get(DATA_FIELD);
 
     // Default configuration
-    metricConfigDTO.setName(DEFAULT_METRIC_NAME + suffix);
+    metricConfigDTO.setName(DEFAULT_METRIC_NAME);

Review comment:
       Any reason we need to hardcode this here?

##########
File path: thirdeye/thirdeye-pinot/src/main/java/org/apache/pinot/thirdeye/api/detection/AnomalyDetectionResource.java
##########
@@ -511,24 +499,19 @@ private TaskDTO pollingTask(long taskId) {
   }
 
   private void cleanStates(MetricConfigDTO metricConfigDTO, DatasetConfigDTO datasetConfigDTO) {
+    // Clean up ad hoc data
     if (datasetConfigDTO != null) {
-      datasetConfigDAO.delete(datasetConfigDTO);
-      LOG.info("Deleted dataset: {}", datasetConfigDTO);
-
-      int anomalyCnt = anomalyDAO.deleteByPredicate(
-          Predicate.EQ("collection", datasetConfigDTO.getName()));
-      LOG.info("Deleted {} anomalies with dataset {}",
-          anomalyCnt, datasetConfigDTO.getName());
+      int onlineDetectionDataCnt = onlineDetectionDataDAO
+          .deleteByPredicate(Predicate.EQ("dataset", datasetConfigDTO.getName()));

Review comment:
       If we store the ad-hoc data, do we still want store the metadata in normalized form? Since we clean up the dataset every time, we can store all the metadata into the ad hoc data table.

##########
File path: thirdeye/thirdeye-pinot/src/main/java/org/apache/pinot/thirdeye/api/detection/AnomalyDetectionResource.java
##########
@@ -464,6 +420,38 @@ TaskDTO generateTaskConfig(DetectionConfigDTO detectionConfigDTO,
     return taskDTO;
   }
 
+  long saveOnlineDetectionData(JsonNode payloadNode,
+      DatasetConfigDTO datasetConfigDTO, MetricConfigDTO metricConfigDTO)
+        throws JsonProcessingException {
+    JsonNode dataNode = payloadNode.get(DATA_FIELD);
+    String timeColumnName = datasetConfigDTO.getTimeColumn();
+    String datasetName = datasetConfigDTO.getDataset();
+    String metricName = metricConfigDTO.getName();
+
+    // Check if time & metric columns exist in adhoc data
+    ArrayNode columnsNode = dataNode.withArray(COLUMNS_FIELD);
+    int[] colIndices = findTimeAndMetricColumns(columnsNode,
+        timeColumnName, metricName);
+    int timeColIdx = colIndices[0];
+    int metricColIdx = colIndices[1];
+    Preconditions.checkArgument(metricColIdx>=0 && timeColIdx>=0,
+        String.format("metric: %s or time: %s not found in adhoc data.",
+            metricName, timeColumnName));
+
+    // Save online data
+    OnlineDetectionDataDTO onlineDetectionDataDTO = new OnlineDetectionDataDTO();
+    onlineDetectionDataDTO.setDataset(datasetName);
+    onlineDetectionDataDTO.setMetric(metricName);
+    onlineDetectionDataDTO.setOnlineDetectionData(this.objectMapper.writeValueAsString(dataNode));

Review comment:
       Since we are storing the time series data in JSON, we might exceed the limit of MySQL text field, which is 64KB. We should at least make sure that the field should not be overflowed the large data users posted.

##########
File path: thirdeye/thirdeye-pinot/src/main/resources/schema/create-schema.sql
##########
@@ -437,3 +437,15 @@ create index rootcause_template_id_idx ON rootcause_template_index(base_id);
 create index rootcause_template_owner_idx ON rootcause_template_index(owner);
 create index rootcause_template_metric_idx on rootcause_template_index(metric_id);
 create index rootcause_template_config_application_idx ON rootcause_template_index(`application`);
+
+create table if not exists online_detection_data_index (
+    base_id bigint(20) not null,
+    dataset varchar(200),
+    metric varchar(200),
+    create_time timestamp default 0,
+    update_time timestamp default current_timestamp,
+    version int(10)
+) ENGINE=InnoDB;

Review comment:
       For long term, we might want to consider using BLOB/CLOB to storing ad hoc data so that we can support dataset larger than 64KB, not to mention that the data is stored in JSON format, which is really inefficient ways to store data, compare to any binary format. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] jasonyanwenl commented on a change in pull request #5866: [TE] add anomaly detection as a service - new table for storing ad-hoc data

Posted by GitBox <gi...@apache.org>.
jasonyanwenl commented on a change in pull request #5866:
URL: https://github.com/apache/incubator-pinot/pull/5866#discussion_r473202402



##########
File path: thirdeye/thirdeye-pinot/src/main/java/org/apache/pinot/thirdeye/api/detection/AnomalyDetectionResource.java
##########
@@ -511,24 +499,19 @@ private TaskDTO pollingTask(long taskId) {
   }
 
   private void cleanStates(MetricConfigDTO metricConfigDTO, DatasetConfigDTO datasetConfigDTO) {
+    // Clean up ad hoc data
     if (datasetConfigDTO != null) {
-      datasetConfigDAO.delete(datasetConfigDTO);
-      LOG.info("Deleted dataset: {}", datasetConfigDTO);
-
-      int anomalyCnt = anomalyDAO.deleteByPredicate(
-          Predicate.EQ("collection", datasetConfigDTO.getName()));
-      LOG.info("Deleted {} anomalies with dataset {}",
-          anomalyCnt, datasetConfigDTO.getName());
+      int onlineDetectionDataCnt = onlineDetectionDataDAO
+          .deleteByPredicate(Predicate.EQ("dataset", datasetConfigDTO.getName()));

Review comment:
       Thank you for the point! This is a good idea. But it also has some trade-off. Like during the detection pipeline workflow, we cannot reuse the existing DAO layer logic to retrieve dataset/metric configurations from corresponding tables. Probably we will add some extra logic to separate online detection from the original detection during the DAO layer for this. Currently how about let's currently keep this and I will note down this in the design doc for future reference?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] jasonyanwenl commented on a change in pull request #5866: [TE] add anomaly detection as a service - new table for storing ad-hoc data

Posted by GitBox <gi...@apache.org>.
jasonyanwenl commented on a change in pull request #5866:
URL: https://github.com/apache/incubator-pinot/pull/5866#discussion_r473085145



##########
File path: thirdeye/thirdeye-pinot/src/main/java/org/apache/pinot/thirdeye/api/detection/AnomalyDetectionResource.java
##########
@@ -345,18 +323,15 @@ DatasetConfigDTO generateDatasetConfig(JsonNode payloadNode, String suffix) {
     return datasetConfigDTO;
   }
 
-  MetricConfigDTO generateMetricConfig(JsonNode payloadNode, String suffix,
-        DatasetConfigDTO datasetConfigDTO)
-      throws JsonProcessingException {
+  MetricConfigDTO generateMetricConfig(JsonNode payloadNode, String suffix) {
     MetricConfigDTO metricConfigDTO = new MetricConfigDTO();
-    JsonNode dataNode = payloadNode.get(DATA_FIELD);
 
     // Default configuration
-    metricConfigDTO.setName(DEFAULT_METRIC_NAME + suffix);
+    metricConfigDTO.setName(DEFAULT_METRIC_NAME);

Review comment:
       This is the default name of the metric. We firstly set the metric name to be the default one. Then we check if the user provides a metric configuration. If so, we will use the customized metric name given by the user.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] jasonyanwenl commented on a change in pull request #5866: [TE] add anomaly detection as a service - new table for storing ad-hoc data

Posted by GitBox <gi...@apache.org>.
jasonyanwenl commented on a change in pull request #5866:
URL: https://github.com/apache/incubator-pinot/pull/5866#discussion_r473375909



##########
File path: thirdeye/thirdeye-pinot/src/main/resources/schema/create-schema.sql
##########
@@ -437,3 +437,15 @@ create index rootcause_template_id_idx ON rootcause_template_index(base_id);
 create index rootcause_template_owner_idx ON rootcause_template_index(owner);
 create index rootcause_template_metric_idx on rootcause_template_index(metric_id);
 create index rootcause_template_config_application_idx ON rootcause_template_index(`application`);
+
+create table if not exists online_detection_data_index (
+    base_id bigint(20) not null,
+    dataset varchar(200),
+    metric varchar(200),
+    create_time timestamp default 0,
+    update_time timestamp default current_timestamp,
+    version int(10)
+) ENGINE=InnoDB;

Review comment:
       Thanks for this! I agree with you. In long term, we should use a more efficient way to store the data. This is just current workaround. And if we use a type to store it such as the BLOB, we cannot store it into the `generic_json_entity` table and need to consider where to store it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] harleyjj commented on a change in pull request #5866: [TE] add anomaly detection as a service - new table for storing ad-hoc data

Posted by GitBox <gi...@apache.org>.
harleyjj commented on a change in pull request #5866:
URL: https://github.com/apache/incubator-pinot/pull/5866#discussion_r470809557



##########
File path: thirdeye/thirdeye-pinot/src/main/java/org/apache/pinot/thirdeye/api/detection/AnomalyDetectionResource.java
##########
@@ -464,6 +420,38 @@ TaskDTO generateTaskConfig(DetectionConfigDTO detectionConfigDTO,
     return taskDTO;
   }
 
+  long saveOnlineDetectionData(JsonNode payloadNode,
+      DatasetConfigDTO datasetConfigDTO, MetricConfigDTO metricConfigDTO)
+        throws JsonProcessingException {
+    JsonNode dataNode = payloadNode.get(DATA_FIELD);
+    String timeColumnName = datasetConfigDTO.getTimeColumn();
+    String datasetName = datasetConfigDTO.getDataset();
+    String metricName = metricConfigDTO.getName();
+

Review comment:
       @jasonyanwenl This code looks good.  Are there docs that tell users the formatting for adhoc data, or is that future work for Thirdeye?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] jasonyanwenl commented on a change in pull request #5866: [TE] add anomaly detection as a service - new table for storing ad-hoc data

Posted by GitBox <gi...@apache.org>.
jasonyanwenl commented on a change in pull request #5866:
URL: https://github.com/apache/incubator-pinot/pull/5866#discussion_r470827343



##########
File path: thirdeye/thirdeye-pinot/src/main/java/org/apache/pinot/thirdeye/api/detection/AnomalyDetectionResource.java
##########
@@ -464,6 +420,38 @@ TaskDTO generateTaskConfig(DetectionConfigDTO detectionConfigDTO,
     return taskDTO;
   }
 
+  long saveOnlineDetectionData(JsonNode payloadNode,
+      DatasetConfigDTO datasetConfigDTO, MetricConfigDTO metricConfigDTO)
+        throws JsonProcessingException {
+    JsonNode dataNode = payloadNode.get(DATA_FIELD);
+    String timeColumnName = datasetConfigDTO.getTimeColumn();
+    String datasetName = datasetConfigDTO.getDataset();
+    String metricName = metricConfigDTO.getName();
+

Review comment:
       Thanks Harley! Yeah, the formatting for the ad-hoc data will be explained in the API doc. I am currently writing that. I will share it when I finished it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] akshayrai merged pull request #5866: [TE] add anomaly detection as a service - new table for storing ad-hoc data

Posted by GitBox <gi...@apache.org>.
akshayrai merged pull request #5866:
URL: https://github.com/apache/incubator-pinot/pull/5866


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org