You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by "paul-rogers (via GitHub)" <gi...@apache.org> on 2023/02/17 18:37:16 UTC

[GitHub] [druid] paul-rogers commented on a diff in pull request #13815: Improved error message when topic name changes within same supervisor

paul-rogers commented on code in PR #13815:
URL: https://github.com/apache/druid/pull/13815#discussion_r1110173527


##########
core/src/main/java/org/apache/druid/java/util/common/guava/ParallelMergeCombiningSequence.java:
##########
@@ -191,7 +191,8 @@ public boolean hasNext()
               {
                 final long thisTimeoutNanos = timeoutAtNanos - System.nanoTime();
                 if (hasTimeout && thisTimeoutNanos < 0) {
-                  throw new QueryTimeoutException("Sequence iterator timed out");
+                  throw new QueryTimeoutException("Query did not complete within configured timeout period. " +
+                      "You can increase query timeout or tune the performance of query");

Review Comment:
   Suggestion:
   
   > Query did not complete within the timeout period set by druid.some.config.
   
   The action suggested here is OK, but it does not tell the user how to change the config. Also, there could be other causes: network issues, machine is overloaded, too many threads, and on and on.



##########
core/src/main/java/org/apache/druid/java/util/common/guava/ParallelMergeCombiningSequence.java:
##########
@@ -206,7 +207,8 @@ public boolean hasNext()
                     }
                   }
                   if (currentBatch == null) {
-                    throw new QueryTimeoutException("Sequence iterator timed out waiting for data");
+                    throw new QueryTimeoutException("Query did not complete within configured timeout period. " +

Review Comment:
   Suggestion: if the error appears more than once, declare a constant and reference the constant. That also makes it easier or tests to track the message as it evolves.



##########
docs/operations/metrics.md:
##########
@@ -221,7 +221,7 @@ batch ingestion emit the following metrics. These metrics are deltas for each em
 |`ingest/merge/cpu`|Cpu time in Nanoseconds spent on merging intermediate segments.|`dataSource`, `taskId`, `taskType`, `tags`|Depends on configuration. Generally a few minutes at most.|
 |`ingest/handoff/count`|Number of handoffs that happened.|`dataSource`, `taskId`, `taskType`, `tags`|Varies. Generally greater than 0 once every segment granular period if cluster operating normally.|
 |`ingest/sink/count`|Number of sinks not handoffed.|`dataSource`, `taskId`, `taskType`, `tags`|1~3|
-|`ingest/events/messageGap`|Time gap in milliseconds between the latest ingested event timestamp and the current system timestamp of metrics emission. |`dataSource`, `taskId`, `taskType`, `tags`|Greater than 0, depends on the time carried in event. |
+|`ingest/events/messageGap`|Time gap in milliseconds between the latest ingested event timestamp and the current system timestamp of metrics emission. If this is increasing but lag is low, it likely means that druid is not getting new data. This metric will get reset as new tasks spawn up.|`dataSource`, `taskId`, `taskType`, `tags`|Greater than 0, depends on the time carried in event. |

Review Comment:
   "This metric gets reset" --> "This metric is reset"



##########
server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java:
##########
@@ -1605,7 +1599,13 @@ protected DataStoreMetadataUpdateResult updateDataSourceMetadataWithHandle(
                                 .bind("commit_metadata_sha1", newCommitMetadataSha1)
                                 .execute();
 
-      retVal = numRows == 1 ? DataStoreMetadataUpdateResult.SUCCESS : DataStoreMetadataUpdateResult.TRY_AGAIN;
+      retVal = numRows == 1
+          ? DataStoreMetadataUpdateResult.SUCCESS
+          : new DataStoreMetadataUpdateResult(
+              true,
+          true,
+          "Failed to insert metadata for datasource [%s]",

Review Comment:
   This error is not very actionable for newbies: it is helpful only to developers already familiar with the area. If we can't insert metadata, then we've probably got a fatal or data loss error, don't we?
   
   Perhaps:
   
   > Fatal error: Supervisor %s failed to update metadata for datasource %s in table %s. Newly ingested segments are lost.
   
   Or some such.



##########
server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java:
##########
@@ -1908,4 +1916,71 @@ public String getSequenceName()
     }
   }
 
+  public static class DataStoreMetadataUpdateResult
+  {
+    private final boolean failed;
+    private final boolean canRetry;
+    @Nullable private final String errorMsg;

Review Comment:
   This class is a good idea, but we're jumping to error messages too soon. This class hold the exception, not the error message. The exception class conveys information. The exception has a stack trace. With the message, developers have no information about what specific bit of code went wrong when debugging. Convert the exception to a string at the point the string is needed.



##########
server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java:
##########
@@ -1679,11 +1685,13 @@ protected DataStoreMetadataUpdateResult dropSegmentsWithHandle(
         );
 
     if (numChangedSegments != segmentsToDrop.size()) {
-      log.warn("Failed to drop segments metadata update as numChangedSegments[%s] segmentsToDropSize[%s]",
-               numChangedSegments,
-               segmentsToDrop.size()
+      return new DataStoreMetadataUpdateResult(
+          true,
+          true,
+          "Failed to drop some segments. Only [%s] could be dropped out of [%s]. Trying again",

Review Comment:
   Here, the value is a number. We don't need brackets or `%s`.
   
   > Only %d out of %d were dropped.



##########
docs/operations/metrics.md:
##########
@@ -221,7 +221,7 @@ batch ingestion emit the following metrics. These metrics are deltas for each em
 |`ingest/merge/cpu`|Cpu time in Nanoseconds spent on merging intermediate segments.|`dataSource`, `taskId`, `taskType`, `tags`|Depends on configuration. Generally a few minutes at most.|
 |`ingest/handoff/count`|Number of handoffs that happened.|`dataSource`, `taskId`, `taskType`, `tags`|Varies. Generally greater than 0 once every segment granular period if cluster operating normally.|
 |`ingest/sink/count`|Number of sinks not handoffed.|`dataSource`, `taskId`, `taskType`, `tags`|1~3|
-|`ingest/events/messageGap`|Time gap in milliseconds between the latest ingested event timestamp and the current system timestamp of metrics emission. |`dataSource`, `taskId`, `taskType`, `tags`|Greater than 0, depends on the time carried in event. |
+|`ingest/events/messageGap`|Time gap in milliseconds between the latest ingested event timestamp and the current system timestamp of metrics emission. If this is increasing but lag is low, it likely means that druid is not getting new data. This metric will get reset as new tasks spawn up.|`dataSource`, `taskId`, `taskType`, `tags`|Greater than 0, depends on the time carried in event. |

Review Comment:
   What does this mean? "the current system timestamp of metrics emission." This part is clear: ""the current system timestamp". It is not clear how the "system" (meaning OS) knows about metrics emission. I suspect we just mean `now - __time`. Correct? If so, perhaps reword.
   
   > Druid computes the lag between the current system timestamp and the `__time` value for each record. Druid emits that value, for each record, as this metric. If the value...



##########
server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java:
##########
@@ -1572,12 +1566,12 @@ protected DataStoreMetadataUpdateResult updateDataSourceMetadataWithHandle(
 
     if (!startMetadataMatchesExisting) {
       // Not in the desired start state.
-      log.error(
-          "Not updating metadata, existing state[%s] in metadata store doesn't match to the new start state[%s].",
+      return new DataStoreMetadataUpdateResult(true, false, StringUtils.format(
+          "Not updating metadata, existing state[%s] in metadata store doesn't match to the new start state[%s]." +
+              " This usually happens if the input stream is changed within the same supervisor.",

Review Comment:
   As a newbie to this area, the message is a bit of a mystery. Are we saying that we're using optimistic locking and we just tried to do an update where we expect version X, but found a newer version Y? Or, that the DB transaction failed (perhaps due to locking conflicts) and we discarded the update? Something else?
   
   Also, how does an "input stream change" within "the same supervisor"? Are we saying that the _contents_ of the data in Kafka (say) changed? Or, that the offsets registered in Kafka changed? Or, did I run two supervisors against the same topic?
   
   We also don't say which state we're changing. The state of the supervisor task? Which one?
   
   Is this error a bad thing? Will I lose data? or, will the system continue on and I can ignore the messages? Or, in fact, is this an assertion and, in a properly coded Druid, will never occur?
   
   So, perhaps:
   
   > Attempted to update state for supervisor %s in table %s from %s to %s, but found the state was already at %s. Are you running two supervisors for the same topic?
   
   Don't take the above too literally: I'm making up much of the information.



##########
server/src/main/java/org/apache/druid/metadata/IndexerSQLMetadataStorageCoordinator.java:
##########
@@ -1605,7 +1599,13 @@ protected DataStoreMetadataUpdateResult updateDataSourceMetadataWithHandle(
                                 .bind("commit_metadata_sha1", newCommitMetadataSha1)
                                 .execute();
 
-      retVal = numRows == 1 ? DataStoreMetadataUpdateResult.SUCCESS : DataStoreMetadataUpdateResult.TRY_AGAIN;
+      retVal = numRows == 1
+          ? DataStoreMetadataUpdateResult.SUCCESS
+          : new DataStoreMetadataUpdateResult(
+              true,
+          true,
+          "Failed to insert metadata for datasource [%s]",

Review Comment:
   Nit: the `word[%s]` pattern is a bit awkward. It may have started from thread names: `datasource[%s]`. But, for error messages, a better approach is:
   
   * If the `%s` value is a single symbol (no spaces), omit the brackets.
   * If the `%s` value contains spaces, include brackets, but insert a space before the bracket.
   
   Example: `datasource %s not found`, `The name [%s] is not valid`
   
   In the first case, the rules for datasource names disallow spaces. In the second, the name is invalid, so it may well be "foo bar" (with spaces) so we need the brackets.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org