You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by "clintropolis (via GitHub)" <gi...@apache.org> on 2023/05/22 19:00:47 UTC

[GitHub] [druid] clintropolis commented on a diff in pull request #14319: add configurable ColumnTypeMergePolicy to SegmentMetadataCache

clintropolis commented on code in PR #14319:
URL: https://github.com/apache/druid/pull/14319#discussion_r1200918069


##########
sql/src/main/java/org/apache/druid/sql/calcite/schema/SegmentMetadataCache.java:
##########
@@ -995,4 +986,117 @@ void doInLock(Runnable runnable)
       runnable.run();
     }
   }
+
+
+  /**
+   * ColumnTypeMergePolicy defines the rules of which type to use when faced with the possibility of different types
+   * for the same column from segment to segment. It is used to help compute a {@link RowSignature} for a table in
+   * Druid based on the segment metadata of all segments, merging the types of each column encountered to end up with
+   * a single type to represent it globally.
+   */
+  @JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type", defaultImpl = FirstTypeMergePolicy.class)
+  @JsonSubTypes(value = {
+      @JsonSubTypes.Type(name = FirstTypeMergePolicy.NAME, value = FirstTypeMergePolicy.class),
+      @JsonSubTypes.Type(name = LeastRestrictiveTypeMergePolicy.NAME, value = LeastRestrictiveTypeMergePolicy.class)
+  })
+  @FunctionalInterface
+  public interface ColumnTypeMergePolicy
+  {
+    ColumnType merge(ColumnType existingType, ColumnType newType);
+  }
+
+  /**
+   * Classic logic, we use the first type we encounter. This policy is effectively 'newest first' because we iterated
+   * segments starting from the most recent time chunk, so this typically results in the most recently used type being
+   * chosen, at least for systems that are continuously updated with 'current' data.
+   *
+   * Since {@link ColumnTypeMergePolicy} are used to compute the SQL schema, at least in systems using SQL schemas which
+   * are poartially or fully computed by this cache, this merge policy can result in query time errors if incompatible
+   * types are mixed if the chosen type is more restrictive than the types of some segments. If data is likely to vary
+   * in type across segments, consider using {@link LeastRestrictiveTypeMergePolicy} instead.
+   */
+  public static class FirstTypeMergePolicy implements ColumnTypeMergePolicy

Review Comment:
   hmm, I tried to capture my hesitation to name the actual class something like that in the javadocs because the only reason it is 'newestFirst' is because of the external iterator ordering that is passed to it, and if the iterator changes then this one would take on the behavior of whatever order the iterator has. So i wanted to make it super clear that this doesn't actually do anything to ensure it is newest first, and instead just uses the first non-null type it finds. This way if someone refactors this cache it could hopefully be easier to notice that something might need to be done to make the behavior of this mode not change.
   
   However, the type name is user facing, so it seemed important that to make it a name that people could understand, so i did choose the name that captures the behavior of policy + iterator for that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org