You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by "clintropolis (via GitHub)" <gi...@apache.org> on 2023/05/19 22:26:30 UTC

[GitHub] [druid] clintropolis opened a new pull request, #14319: add configurable ColumnTypeMergePolicy to SegmentMetadataCache

clintropolis opened a new pull request, #14319:
URL: https://github.com/apache/druid/pull/14319

   ### Description
   This PR adds a new interface to control how `SegmentMetadataCache` chooses `ColumnType` when faced with differences between segments for SQL schemas which are computed, exposed as `druid.sql.planner.metadataColumnTypeMergePolicy` and adds a new 'least restrictive type' mode to allow choosing the type that data across all segments can best be coerced into. The existing "newest first" behavior remains the default, primarily because this is a behavior change around when schema migrations take effect for the SQL schema. With `{"type":"newestFirst"}`, the SQL schema would be updated as soon as the first job with the new schema has published segments, while using `{"type":"leastRestrictive"}`, the schema would only be updated once all segments are reindexed to the new type. The benefit of `leastRestrictive` is that it eliminates a bunch of type coercion errors that can happen in SQL when types are varied across segments with `newestFirst` because the newest type is not able to correctly
  represent older data, such as if the segments have a mix of ARRAY and number types, or any other combinations that lead to odd query plans.
   
   I am not at all attached to these names, so if they should be called something else more intuitive then feel free to suggest.
   
   #### Release note
   A new broker configuration, `druid.sql.planner.metadataColumnTypeMergePolicy` adds configurable modes to how column types are computed for the SQL table schema when faced with differences between segments. A new 'least restrictive type' mode allows choosing the most appropriate type that data across all segments can best be coerced into. The existing "newest first" behavior remains the default, primarily because this is a behavior change around when schema migrations will take effect for the SQL schema. With `{"type":"newestFirst"}`, the SQL schema would be updated as soon as the first job with the new schema has published segments, while using `{"type":"leastRestrictive"}`, the schema would only be updated once all segments are reindexed to the new type. However, `{"type":"leastRestrictive"}` is likely to have "better" query time behavior and eliminates some query time errors that can occur when using `newestFirst`.
   
   <hr>
   
   <!-- Check the items by putting "x" in the brackets for the done things. Not all of these items apply to every PR. Remove the items which are not done or not relevant to the PR. None of the items from the checklist below are strictly necessary, but it would be very helpful if you at least self-review the PR. -->
   
   This PR has:
   
   - [ ] been self-reviewed.
      - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] a release note entry in the PR description.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] zachjsh commented on a diff in pull request #14319: add configurable ColumnTypeMergePolicy to SegmentMetadataCache

Posted by "zachjsh (via GitHub)" <gi...@apache.org>.
zachjsh commented on code in PR #14319:
URL: https://github.com/apache/druid/pull/14319#discussion_r1202957170


##########
sql/src/main/java/org/apache/druid/sql/calcite/schema/SegmentMetadataCache.java:
##########
@@ -995,4 +986,118 @@ void doInLock(Runnable runnable)
       runnable.run();
     }
   }
+
+
+  /**
+   * ColumnTypeMergePolicy defines the rules of which type to use when faced with the possibility of different types
+   * for the same column from segment to segment. It is used to help compute a {@link RowSignature} for a table in
+   * Druid based on the segment metadata of all segments, merging the types of each column encountered to end up with
+   * a single type to represent it globally.
+   */
+  @JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type", defaultImpl = LeastRestrictiveTypeMergePolicy.class)
+  @JsonSubTypes(value = {
+      @JsonSubTypes.Type(name = FirstTypeMergePolicy.NAME, value = FirstTypeMergePolicy.class),
+      @JsonSubTypes.Type(name = LeastRestrictiveTypeMergePolicy.NAME, value = LeastRestrictiveTypeMergePolicy.class)
+  })
+  @FunctionalInterface
+  public interface ColumnTypeMergePolicy
+  {
+    ColumnType merge(ColumnType existingType, ColumnType newType);
+  }
+
+  /**
+   * Classic logic, we use the first type we encounter. This policy is effectively 'newest first' because we iterated
+   * segments starting from the most recent time chunk, so this typically results in the most recently used type being
+   * chosen, at least for systems that are continuously updated with 'current' data.
+   *
+   * Since {@link ColumnTypeMergePolicy} are used to compute the SQL schema, at least in systems using SQL schemas which
+   * are partially or fully computed by this cache, this merge policy can result in query time errors if incompatible
+   * types are mixed if the chosen type is more restrictive than the types of some segments. If data is likely to vary
+   * in type across segments, consider using {@link LeastRestrictiveTypeMergePolicy} instead.
+   */
+  public static class FirstTypeMergePolicy implements ColumnTypeMergePolicy
+  {
+    public static final String NAME = "latestInterval";
+    private static final FirstTypeMergePolicy INSTANCE = new FirstTypeMergePolicy();
+
+    @Override
+    public ColumnType merge(ColumnType existingType, ColumnType newType)
+    {
+      if (existingType == null) {
+        return newType;
+      }
+      if (newType == null) {
+        return existingType;
+      }
+      // if any are json, are all json
+      if (ColumnType.NESTED_DATA.equals(newType) || ColumnType.NESTED_DATA.equals(existingType)) {
+        return ColumnType.NESTED_DATA;
+      }
+      // "existing type" is the 'newest' type, since we iterate the segments list by newest start time
+      return existingType;
+    }
+
+    @Override
+    public int hashCode()
+    {
+      return Objects.hash(NAME);
+    }
+
+    @Override
+    public boolean equals(Object o)
+    {
+      if (this == o) {
+        return true;
+      }
+      return o != null && getClass() == o.getClass();
+    }
+
+    @Override
+    public String toString()
+    {
+      return NAME;
+    }
+  }
+
+  /**
+   * Resolves types using {@link ColumnType#leastRestrictiveType(ColumnType, ColumnType)} to find the ColumnType that
+   * can best represent all data contained across all segments.
+   */
+  public static class LeastRestrictiveTypeMergePolicy implements ColumnTypeMergePolicy
+  {
+    public static final String NAME = "leastRestrictive";
+
+    @Override
+    public ColumnType merge(ColumnType existingType, ColumnType newType)
+    {
+      try {
+        return ColumnType.leastRestrictiveType(existingType, newType);
+      }
+      catch (Types.IncompatibleTypeException incompatibleTypeException) {
+        // fall back to first encountered type if they are not compatible for some reason
+        return FirstTypeMergePolicy.INSTANCE.merge(existingType, newType);

Review Comment:
   what would happen if we just allowed the exception to propagate through here instead of reverting back to the newest first strategy? With this reversion, it feels like this policy may be hard to define, maybe? In this case, how is the user expected to query the data if the types are not expressible from one least restrictive base type?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a diff in pull request #14319: add configurable ColumnTypeMergePolicy to SegmentMetadataCache

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on code in PR #14319:
URL: https://github.com/apache/druid/pull/14319#discussion_r1200924503


##########
sql/src/main/java/org/apache/druid/sql/calcite/schema/SegmentMetadataCache.java:
##########
@@ -995,4 +986,117 @@ void doInLock(Runnable runnable)
       runnable.run();
     }
   }
+
+
+  /**
+   * ColumnTypeMergePolicy defines the rules of which type to use when faced with the possibility of different types
+   * for the same column from segment to segment. It is used to help compute a {@link RowSignature} for a table in
+   * Druid based on the segment metadata of all segments, merging the types of each column encountered to end up with
+   * a single type to represent it globally.
+   */
+  @JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type", defaultImpl = FirstTypeMergePolicy.class)
+  @JsonSubTypes(value = {
+      @JsonSubTypes.Type(name = FirstTypeMergePolicy.NAME, value = FirstTypeMergePolicy.class),
+      @JsonSubTypes.Type(name = LeastRestrictiveTypeMergePolicy.NAME, value = LeastRestrictiveTypeMergePolicy.class)
+  })
+  @FunctionalInterface
+  public interface ColumnTypeMergePolicy
+  {
+    ColumnType merge(ColumnType existingType, ColumnType newType);
+  }
+
+  /**
+   * Classic logic, we use the first type we encounter. This policy is effectively 'newest first' because we iterated
+   * segments starting from the most recent time chunk, so this typically results in the most recently used type being
+   * chosen, at least for systems that are continuously updated with 'current' data.
+   *
+   * Since {@link ColumnTypeMergePolicy} are used to compute the SQL schema, at least in systems using SQL schemas which
+   * are poartially or fully computed by this cache, this merge policy can result in query time errors if incompatible
+   * types are mixed if the chosen type is more restrictive than the types of some segments. If data is likely to vary
+   * in type across segments, consider using {@link LeastRestrictiveTypeMergePolicy} instead.
+   */
+  public static class FirstTypeMergePolicy implements ColumnTypeMergePolicy
+  {
+    public static final String NAME = "newestFirst";
+
+    @Override
+    public ColumnType merge(ColumnType existingType, ColumnType newType)
+    {
+      if (existingType == null) {
+        return newType;
+      }
+      if (newType == null) {
+        return existingType;
+      }
+      // if any are json, are all json
+      if (ColumnType.NESTED_DATA.equals(newType) || ColumnType.NESTED_DATA.equals(existingType)) {
+        return ColumnType.NESTED_DATA;
+      }
+      // "existing type" is the 'newest' type, since we iterate the segments list by newest start time
+      return existingType;
+    }
+
+    @Override
+    public int hashCode()
+    {
+      return Objects.hash(NAME);
+    }
+
+    @Override
+    public boolean equals(Object o)
+    {
+      if (this == o) {
+        return true;
+      }
+      return o != null && getClass() == o.getClass();
+    }
+
+    @Override
+    public String toString()
+    {
+      return NAME;
+    }
+  }
+
+  /**
+   * Resolves types using {@link ColumnType#leastRestrictiveType(ColumnType, ColumnType)} to find the ColumnType that
+   * can best represent all data contained across all segments.

Review Comment:
   we iterate all segments for both modes, this PR didn't change that part. `MAX_SEGMENTS_PER_QUERY` defines the batch size of how many segments at a time we issue segment metadata queries for. Naively I think there shouldn't be any real performance difference between the policies since we always iterate all segments and columns regardless of the policy, the policy just chooses how we accumulate them into the final `RowSignature`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on a diff in pull request #14319: add configurable ColumnTypeMergePolicy to SegmentMetadataCache

Posted by "abhishekagarwal87 (via GitHub)" <gi...@apache.org>.
abhishekagarwal87 commented on code in PR #14319:
URL: https://github.com/apache/druid/pull/14319#discussion_r1203487777


##########
processing/src/main/java/org/apache/druid/segment/column/Types.java:
##########
@@ -112,4 +113,12 @@ public static <T extends TypeDescriptor> boolean either(
     return (typeSignature1 != null && typeSignature1.is(typeDescriptor)) ||
            (typeSignature2 != null && typeSignature2.is(typeDescriptor));
   }
+
+  public static class IncompatibleTypeException extends IAE

Review Comment:
   Q - would this exception return a 5xx to the user? I am wondering if we change the exception so that the end user gets a 400 instead. since its really a bad query. 



##########
sql/src/main/java/org/apache/druid/sql/calcite/schema/SegmentMetadataCache.java:
##########
@@ -995,4 +986,126 @@ void doInLock(Runnable runnable)
       runnable.run();
     }
   }
+
+
+  /**
+   * ColumnTypeMergePolicy defines the rules of which type to use when faced with the possibility of different types
+   * for the same column from segment to segment. It is used to help compute a {@link RowSignature} for a table in
+   * Druid based on the segment metadata of all segments, merging the types of each column encountered to end up with
+   * a single type to represent it globally.
+   */
+  @FunctionalInterface
+  public interface ColumnTypeMergePolicy
+  {
+    ColumnType merge(ColumnType existingType, ColumnType newType);
+
+    @JsonCreator
+    static ColumnTypeMergePolicy fromString(String type)
+    {
+      if (LeastRestrictiveTypeMergePolicy.NAME.equals(type)) {

Review Comment:
   can we case insensitive comparison here? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a diff in pull request #14319: add configurable ColumnTypeMergePolicy to SegmentMetadataCache

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on code in PR #14319:
URL: https://github.com/apache/druid/pull/14319#discussion_r1203512991


##########
sql/src/main/java/org/apache/druid/sql/calcite/schema/SegmentMetadataCache.java:
##########
@@ -995,4 +986,126 @@ void doInLock(Runnable runnable)
       runnable.run();
     }
   }
+
+
+  /**
+   * ColumnTypeMergePolicy defines the rules of which type to use when faced with the possibility of different types
+   * for the same column from segment to segment. It is used to help compute a {@link RowSignature} for a table in
+   * Druid based on the segment metadata of all segments, merging the types of each column encountered to end up with
+   * a single type to represent it globally.
+   */
+  @FunctionalInterface
+  public interface ColumnTypeMergePolicy
+  {
+    ColumnType merge(ColumnType existingType, ColumnType newType);
+
+    @JsonCreator
+    static ColumnTypeMergePolicy fromString(String type)
+    {
+      if (LeastRestrictiveTypeMergePolicy.NAME.equals(type)) {

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on pull request #14319: add configurable ColumnTypeMergePolicy to SegmentMetadataCache

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on PR #14319:
URL: https://github.com/apache/druid/pull/14319#issuecomment-1557995783

   >For leastRestrictive, perhaps bestFit is clearer? My concern is that "least restrictive", while correct, can also refer to a policy that just chooses the widest possible type.
   
   'leastRestrictive' is effectively a rather permissive widening of types, see https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/segment/column/ColumnType.java#L139 for the logic backing it, so im not sure its really wrong to interpret it that way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] zachjsh commented on a diff in pull request #14319: add configurable ColumnTypeMergePolicy to SegmentMetadataCache

Posted by "zachjsh (via GitHub)" <gi...@apache.org>.
zachjsh commented on code in PR #14319:
URL: https://github.com/apache/druid/pull/14319#discussion_r1202961243


##########
processing/src/main/java/org/apache/druid/segment/NestedDataColumnMerger.java:
##########
@@ -93,6 +93,9 @@ public void writeMergedValueDictionary(List<IndexableAdapter> adapters) throws I
         final IndexableAdapter.NestedColumnMergable mergable = closer.register(
             adapter.getNestedColumnMergeables(name)
         );
+        if (mergable == null) {

Review Comment:
   no, its fine, was just curious.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] zachjsh commented on a diff in pull request #14319: add configurable ColumnTypeMergePolicy to SegmentMetadataCache

Posted by "zachjsh (via GitHub)" <gi...@apache.org>.
zachjsh commented on code in PR #14319:
URL: https://github.com/apache/druid/pull/14319#discussion_r1202957170


##########
sql/src/main/java/org/apache/druid/sql/calcite/schema/SegmentMetadataCache.java:
##########
@@ -995,4 +986,118 @@ void doInLock(Runnable runnable)
       runnable.run();
     }
   }
+
+
+  /**
+   * ColumnTypeMergePolicy defines the rules of which type to use when faced with the possibility of different types
+   * for the same column from segment to segment. It is used to help compute a {@link RowSignature} for a table in
+   * Druid based on the segment metadata of all segments, merging the types of each column encountered to end up with
+   * a single type to represent it globally.
+   */
+  @JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type", defaultImpl = LeastRestrictiveTypeMergePolicy.class)
+  @JsonSubTypes(value = {
+      @JsonSubTypes.Type(name = FirstTypeMergePolicy.NAME, value = FirstTypeMergePolicy.class),
+      @JsonSubTypes.Type(name = LeastRestrictiveTypeMergePolicy.NAME, value = LeastRestrictiveTypeMergePolicy.class)
+  })
+  @FunctionalInterface
+  public interface ColumnTypeMergePolicy
+  {
+    ColumnType merge(ColumnType existingType, ColumnType newType);
+  }
+
+  /**
+   * Classic logic, we use the first type we encounter. This policy is effectively 'newest first' because we iterated
+   * segments starting from the most recent time chunk, so this typically results in the most recently used type being
+   * chosen, at least for systems that are continuously updated with 'current' data.
+   *
+   * Since {@link ColumnTypeMergePolicy} are used to compute the SQL schema, at least in systems using SQL schemas which
+   * are partially or fully computed by this cache, this merge policy can result in query time errors if incompatible
+   * types are mixed if the chosen type is more restrictive than the types of some segments. If data is likely to vary
+   * in type across segments, consider using {@link LeastRestrictiveTypeMergePolicy} instead.
+   */
+  public static class FirstTypeMergePolicy implements ColumnTypeMergePolicy
+  {
+    public static final String NAME = "latestInterval";
+    private static final FirstTypeMergePolicy INSTANCE = new FirstTypeMergePolicy();
+
+    @Override
+    public ColumnType merge(ColumnType existingType, ColumnType newType)
+    {
+      if (existingType == null) {
+        return newType;
+      }
+      if (newType == null) {
+        return existingType;
+      }
+      // if any are json, are all json
+      if (ColumnType.NESTED_DATA.equals(newType) || ColumnType.NESTED_DATA.equals(existingType)) {
+        return ColumnType.NESTED_DATA;
+      }
+      // "existing type" is the 'newest' type, since we iterate the segments list by newest start time
+      return existingType;
+    }
+
+    @Override
+    public int hashCode()
+    {
+      return Objects.hash(NAME);
+    }
+
+    @Override
+    public boolean equals(Object o)
+    {
+      if (this == o) {
+        return true;
+      }
+      return o != null && getClass() == o.getClass();
+    }
+
+    @Override
+    public String toString()
+    {
+      return NAME;
+    }
+  }
+
+  /**
+   * Resolves types using {@link ColumnType#leastRestrictiveType(ColumnType, ColumnType)} to find the ColumnType that
+   * can best represent all data contained across all segments.
+   */
+  public static class LeastRestrictiveTypeMergePolicy implements ColumnTypeMergePolicy
+  {
+    public static final String NAME = "leastRestrictive";
+
+    @Override
+    public ColumnType merge(ColumnType existingType, ColumnType newType)
+    {
+      try {
+        return ColumnType.leastRestrictiveType(existingType, newType);
+      }
+      catch (Types.IncompatibleTypeException incompatibleTypeException) {
+        // fall back to first encountered type if they are not compatible for some reason
+        return FirstTypeMergePolicy.INSTANCE.merge(existingType, newType);

Review Comment:
   what would happedn if we just allowed the exception to propogate through here instead of reverting back to the newest first strategy? With this reversion, it feels like this policy may be hard to define, maybe?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a diff in pull request #14319: add configurable ColumnTypeMergePolicy to SegmentMetadataCache

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on code in PR #14319:
URL: https://github.com/apache/druid/pull/14319#discussion_r1203510841


##########
processing/src/main/java/org/apache/druid/segment/column/Types.java:
##########
@@ -112,4 +113,12 @@ public static <T extends TypeDescriptor> boolean either(
     return (typeSignature1 != null && typeSignature1.is(typeDescriptor)) ||
            (typeSignature2 != null && typeSignature2.is(typeDescriptor));
   }
+
+  public static class IncompatibleTypeException extends IAE

Review Comment:
   hmm, in this particular PR the thing isn't user facing really because its happening inside the brokers segment metadata cache that powers the SQL schema and is happening in the background. This logic does not happen in user queries.
   
   The callers of `ExpressionTypeConversion` that also now throw this exception instead of `IAE` are i guess potentially user facing (though could also happen in ingest time transforms), though im not really sure this is an exception that should be thrown directly to users, rather it should probably be caught by something and decorated with additional context to indicate like what the offending expression was, etc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] dkoepke commented on pull request #14319: add configurable ColumnTypeMergePolicy to SegmentMetadataCache

Posted by "dkoepke (via GitHub)" <gi...@apache.org>.
dkoepke commented on PR #14319:
URL: https://github.com/apache/druid/pull/14319#issuecomment-1557827869

   Regarding naming:
   
   For `newestFirst`, "newest" without context might be confusing because it can refer either to when the segment was created/published OR the interval it covers. I believe the behavior is to select the segment with the latest interval (since the segments are sorted by segment ID, and the interval comes first in the ID), so a policy name like `latestInterval` might be clearer?
   
   For `leastRestrictive`, perhaps `bestFit` is clearer? My concern is that "least restrictive", while correct, can also refer to a policy that just chooses the widest possible type.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on pull request #14319: add configurable ColumnTypeMergePolicy to SegmentMetadataCache

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on PR #14319:
URL: https://github.com/apache/druid/pull/14319#issuecomment-1558183392

   >For newestFirst, "newest" without context might be confusing because it can refer either to when the segment was created/published OR the interval it covers. I believe the behavior is to select the segment with the latest interval (since the segments are sorted by segment ID, and the interval comes first in the ID), so a policy name like latestInterval might be clearer?
   
   `latestInterval` probably is clearer, i can make the change soon unless anyone has better ideas/different opinions


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] zachjsh commented on a diff in pull request #14319: add configurable ColumnTypeMergePolicy to SegmentMetadataCache

Posted by "zachjsh (via GitHub)" <gi...@apache.org>.
zachjsh commented on code in PR #14319:
URL: https://github.com/apache/druid/pull/14319#discussion_r1202960376


##########
docs/configuration/index.md:
##########
@@ -1954,6 +1954,7 @@ The Druid SQL server is configured through the following properties on the Broke
 |`druid.sql.http.enable`|Whether to enable JSON over HTTP querying at `/druid/v2/sql/`.|true|
 |`druid.sql.planner.maxTopNLimit`|Maximum threshold for a [TopN query](../querying/topnquery.md). Higher limits will be planned as [GroupBy queries](../querying/groupbyquery.md) instead.|100000|
 |`druid.sql.planner.metadataRefreshPeriod`|Throttle for metadata refreshes.|PT1M|
+|`druid.sql.planner.metadataColumnTypeMergePolicy`|Defines how column types will be chosen when faced with differences between segments when computing the SQL schema. Options are specified as a JSON object, with valid choices of `{"type":"leastRestrictive"}` or `{"type":"latestInterval"}`. For `leastRestrictive`, Druid will automatically widen the type computed for the schema to a type which data across all segments can be converted into, however planned schema migrations can only take effect once all segments have been re-ingested to the new schema. With `latestInterval`, the column type in most recent time chunks defines the type for the schema. |`{"type":"leastRestrictive"}`|

Review Comment:
   Any reason to not just have this configured with more simple values like `leastRestrictive` or `latestInterval`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a diff in pull request #14319: add configurable ColumnTypeMergePolicy to SegmentMetadataCache

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on code in PR #14319:
URL: https://github.com/apache/druid/pull/14319#discussion_r1200918069


##########
sql/src/main/java/org/apache/druid/sql/calcite/schema/SegmentMetadataCache.java:
##########
@@ -995,4 +986,117 @@ void doInLock(Runnable runnable)
       runnable.run();
     }
   }
+
+
+  /**
+   * ColumnTypeMergePolicy defines the rules of which type to use when faced with the possibility of different types
+   * for the same column from segment to segment. It is used to help compute a {@link RowSignature} for a table in
+   * Druid based on the segment metadata of all segments, merging the types of each column encountered to end up with
+   * a single type to represent it globally.
+   */
+  @JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type", defaultImpl = FirstTypeMergePolicy.class)
+  @JsonSubTypes(value = {
+      @JsonSubTypes.Type(name = FirstTypeMergePolicy.NAME, value = FirstTypeMergePolicy.class),
+      @JsonSubTypes.Type(name = LeastRestrictiveTypeMergePolicy.NAME, value = LeastRestrictiveTypeMergePolicy.class)
+  })
+  @FunctionalInterface
+  public interface ColumnTypeMergePolicy
+  {
+    ColumnType merge(ColumnType existingType, ColumnType newType);
+  }
+
+  /**
+   * Classic logic, we use the first type we encounter. This policy is effectively 'newest first' because we iterated
+   * segments starting from the most recent time chunk, so this typically results in the most recently used type being
+   * chosen, at least for systems that are continuously updated with 'current' data.
+   *
+   * Since {@link ColumnTypeMergePolicy} are used to compute the SQL schema, at least in systems using SQL schemas which
+   * are poartially or fully computed by this cache, this merge policy can result in query time errors if incompatible
+   * types are mixed if the chosen type is more restrictive than the types of some segments. If data is likely to vary
+   * in type across segments, consider using {@link LeastRestrictiveTypeMergePolicy} instead.
+   */
+  public static class FirstTypeMergePolicy implements ColumnTypeMergePolicy

Review Comment:
   hmm, I tried to capture my hesitation to name the actual class something like that in the javadocs because the only reason it is 'newestFirst' is because of the external iterator ordering that is passed to it, and if the iterator changes then this one would take on the behavior of whatever order the iterator has. So i wanted to make it super clear that this doesn't actually do anything to ensure it is newest first, and instead just uses the first non-null type it finds. This way if someone refactors this cache it could hopefully be easier to notice that something might need to be done to make the behavior of this mode not change.
   
   However, the type name is user facing, so it seemed important that to make it a name that people could understand, so i did choose the name that captures the behavior of policy + iterator for that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] zachjsh commented on a diff in pull request #14319: add configurable ColumnTypeMergePolicy to SegmentMetadataCache

Posted by "zachjsh (via GitHub)" <gi...@apache.org>.
zachjsh commented on code in PR #14319:
URL: https://github.com/apache/druid/pull/14319#discussion_r1202957170


##########
sql/src/main/java/org/apache/druid/sql/calcite/schema/SegmentMetadataCache.java:
##########
@@ -995,4 +986,118 @@ void doInLock(Runnable runnable)
       runnable.run();
     }
   }
+
+
+  /**
+   * ColumnTypeMergePolicy defines the rules of which type to use when faced with the possibility of different types
+   * for the same column from segment to segment. It is used to help compute a {@link RowSignature} for a table in
+   * Druid based on the segment metadata of all segments, merging the types of each column encountered to end up with
+   * a single type to represent it globally.
+   */
+  @JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type", defaultImpl = LeastRestrictiveTypeMergePolicy.class)
+  @JsonSubTypes(value = {
+      @JsonSubTypes.Type(name = FirstTypeMergePolicy.NAME, value = FirstTypeMergePolicy.class),
+      @JsonSubTypes.Type(name = LeastRestrictiveTypeMergePolicy.NAME, value = LeastRestrictiveTypeMergePolicy.class)
+  })
+  @FunctionalInterface
+  public interface ColumnTypeMergePolicy
+  {
+    ColumnType merge(ColumnType existingType, ColumnType newType);
+  }
+
+  /**
+   * Classic logic, we use the first type we encounter. This policy is effectively 'newest first' because we iterated
+   * segments starting from the most recent time chunk, so this typically results in the most recently used type being
+   * chosen, at least for systems that are continuously updated with 'current' data.
+   *
+   * Since {@link ColumnTypeMergePolicy} are used to compute the SQL schema, at least in systems using SQL schemas which
+   * are partially or fully computed by this cache, this merge policy can result in query time errors if incompatible
+   * types are mixed if the chosen type is more restrictive than the types of some segments. If data is likely to vary
+   * in type across segments, consider using {@link LeastRestrictiveTypeMergePolicy} instead.
+   */
+  public static class FirstTypeMergePolicy implements ColumnTypeMergePolicy
+  {
+    public static final String NAME = "latestInterval";
+    private static final FirstTypeMergePolicy INSTANCE = new FirstTypeMergePolicy();
+
+    @Override
+    public ColumnType merge(ColumnType existingType, ColumnType newType)
+    {
+      if (existingType == null) {
+        return newType;
+      }
+      if (newType == null) {
+        return existingType;
+      }
+      // if any are json, are all json
+      if (ColumnType.NESTED_DATA.equals(newType) || ColumnType.NESTED_DATA.equals(existingType)) {
+        return ColumnType.NESTED_DATA;
+      }
+      // "existing type" is the 'newest' type, since we iterate the segments list by newest start time
+      return existingType;
+    }
+
+    @Override
+    public int hashCode()
+    {
+      return Objects.hash(NAME);
+    }
+
+    @Override
+    public boolean equals(Object o)
+    {
+      if (this == o) {
+        return true;
+      }
+      return o != null && getClass() == o.getClass();
+    }
+
+    @Override
+    public String toString()
+    {
+      return NAME;
+    }
+  }
+
+  /**
+   * Resolves types using {@link ColumnType#leastRestrictiveType(ColumnType, ColumnType)} to find the ColumnType that
+   * can best represent all data contained across all segments.
+   */
+  public static class LeastRestrictiveTypeMergePolicy implements ColumnTypeMergePolicy
+  {
+    public static final String NAME = "leastRestrictive";
+
+    @Override
+    public ColumnType merge(ColumnType existingType, ColumnType newType)
+    {
+      try {
+        return ColumnType.leastRestrictiveType(existingType, newType);
+      }
+      catch (Types.IncompatibleTypeException incompatibleTypeException) {
+        // fall back to first encountered type if they are not compatible for some reason
+        return FirstTypeMergePolicy.INSTANCE.merge(existingType, newType);

Review Comment:
   what would happen if we just allowed the exception to propagate through here instead of reverting back to the newest first strategy? With this reversion, it feels like this policy may be hard to define, maybe? In this case, how is the user expected to query the data if the types are not expressible from one base type?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] zachjsh commented on a diff in pull request #14319: add configurable ColumnTypeMergePolicy to SegmentMetadataCache

Posted by "zachjsh (via GitHub)" <gi...@apache.org>.
zachjsh commented on code in PR #14319:
URL: https://github.com/apache/druid/pull/14319#discussion_r1202932517


##########
processing/src/main/java/org/apache/druid/segment/NestedDataColumnMerger.java:
##########
@@ -93,6 +93,9 @@ public void writeMergedValueDictionary(List<IndexableAdapter> adapters) throws I
         final IndexableAdapter.NestedColumnMergable mergable = closer.register(
             adapter.getNestedColumnMergeables(name)
         );
+        if (mergable == null) {

Review Comment:
   Was this existing bug, or does this change somehow allow `mergable` to be nullable now?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] zachjsh commented on a diff in pull request #14319: add configurable ColumnTypeMergePolicy to SegmentMetadataCache

Posted by "zachjsh (via GitHub)" <gi...@apache.org>.
zachjsh commented on code in PR #14319:
URL: https://github.com/apache/druid/pull/14319#discussion_r1203133006


##########
sql/src/main/java/org/apache/druid/sql/calcite/schema/SegmentMetadataCache.java:
##########
@@ -995,4 +986,118 @@ void doInLock(Runnable runnable)
       runnable.run();
     }
   }
+
+
+  /**
+   * ColumnTypeMergePolicy defines the rules of which type to use when faced with the possibility of different types
+   * for the same column from segment to segment. It is used to help compute a {@link RowSignature} for a table in
+   * Druid based on the segment metadata of all segments, merging the types of each column encountered to end up with
+   * a single type to represent it globally.
+   */
+  @JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type", defaultImpl = LeastRestrictiveTypeMergePolicy.class)
+  @JsonSubTypes(value = {
+      @JsonSubTypes.Type(name = FirstTypeMergePolicy.NAME, value = FirstTypeMergePolicy.class),
+      @JsonSubTypes.Type(name = LeastRestrictiveTypeMergePolicy.NAME, value = LeastRestrictiveTypeMergePolicy.class)
+  })
+  @FunctionalInterface
+  public interface ColumnTypeMergePolicy
+  {
+    ColumnType merge(ColumnType existingType, ColumnType newType);
+  }
+
+  /**
+   * Classic logic, we use the first type we encounter. This policy is effectively 'newest first' because we iterated
+   * segments starting from the most recent time chunk, so this typically results in the most recently used type being
+   * chosen, at least for systems that are continuously updated with 'current' data.
+   *
+   * Since {@link ColumnTypeMergePolicy} are used to compute the SQL schema, at least in systems using SQL schemas which
+   * are partially or fully computed by this cache, this merge policy can result in query time errors if incompatible
+   * types are mixed if the chosen type is more restrictive than the types of some segments. If data is likely to vary
+   * in type across segments, consider using {@link LeastRestrictiveTypeMergePolicy} instead.
+   */
+  public static class FirstTypeMergePolicy implements ColumnTypeMergePolicy
+  {
+    public static final String NAME = "latestInterval";
+    private static final FirstTypeMergePolicy INSTANCE = new FirstTypeMergePolicy();
+
+    @Override
+    public ColumnType merge(ColumnType existingType, ColumnType newType)
+    {
+      if (existingType == null) {
+        return newType;
+      }
+      if (newType == null) {
+        return existingType;
+      }
+      // if any are json, are all json
+      if (ColumnType.NESTED_DATA.equals(newType) || ColumnType.NESTED_DATA.equals(existingType)) {
+        return ColumnType.NESTED_DATA;
+      }
+      // "existing type" is the 'newest' type, since we iterate the segments list by newest start time
+      return existingType;
+    }
+
+    @Override
+    public int hashCode()
+    {
+      return Objects.hash(NAME);
+    }
+
+    @Override
+    public boolean equals(Object o)
+    {
+      if (this == o) {
+        return true;
+      }
+      return o != null && getClass() == o.getClass();
+    }
+
+    @Override
+    public String toString()
+    {
+      return NAME;
+    }
+  }
+
+  /**
+   * Resolves types using {@link ColumnType#leastRestrictiveType(ColumnType, ColumnType)} to find the ColumnType that
+   * can best represent all data contained across all segments.
+   */
+  public static class LeastRestrictiveTypeMergePolicy implements ColumnTypeMergePolicy
+  {
+    public static final String NAME = "leastRestrictive";
+
+    @Override
+    public ColumnType merge(ColumnType existingType, ColumnType newType)
+    {
+      try {
+        return ColumnType.leastRestrictiveType(existingType, newType);
+      }
+      catch (Types.IncompatibleTypeException incompatibleTypeException) {
+        // fall back to first encountered type if they are not compatible for some reason
+        return FirstTypeMergePolicy.INSTANCE.merge(existingType, newType);

Review Comment:
   Seems reasonable to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 merged pull request #14319: add configurable ColumnTypeMergePolicy to SegmentMetadataCache

Posted by "abhishekagarwal87 (via GitHub)" <gi...@apache.org>.
abhishekagarwal87 merged PR #14319:
URL: https://github.com/apache/druid/pull/14319


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekrb19 commented on a diff in pull request #14319: add configurable ColumnTypeMergePolicy to SegmentMetadataCache

Posted by "abhishekrb19 (via GitHub)" <gi...@apache.org>.
abhishekrb19 commented on code in PR #14319:
URL: https://github.com/apache/druid/pull/14319#discussion_r1200950594


##########
sql/src/main/java/org/apache/druid/sql/calcite/schema/SegmentMetadataCache.java:
##########
@@ -995,4 +986,117 @@ void doInLock(Runnable runnable)
       runnable.run();
     }
   }
+
+
+  /**
+   * ColumnTypeMergePolicy defines the rules of which type to use when faced with the possibility of different types
+   * for the same column from segment to segment. It is used to help compute a {@link RowSignature} for a table in
+   * Druid based on the segment metadata of all segments, merging the types of each column encountered to end up with
+   * a single type to represent it globally.
+   */
+  @JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type", defaultImpl = FirstTypeMergePolicy.class)
+  @JsonSubTypes(value = {
+      @JsonSubTypes.Type(name = FirstTypeMergePolicy.NAME, value = FirstTypeMergePolicy.class),
+      @JsonSubTypes.Type(name = LeastRestrictiveTypeMergePolicy.NAME, value = LeastRestrictiveTypeMergePolicy.class)
+  })
+  @FunctionalInterface
+  public interface ColumnTypeMergePolicy
+  {
+    ColumnType merge(ColumnType existingType, ColumnType newType);
+  }
+
+  /**
+   * Classic logic, we use the first type we encounter. This policy is effectively 'newest first' because we iterated
+   * segments starting from the most recent time chunk, so this typically results in the most recently used type being
+   * chosen, at least for systems that are continuously updated with 'current' data.
+   *
+   * Since {@link ColumnTypeMergePolicy} are used to compute the SQL schema, at least in systems using SQL schemas which
+   * are poartially or fully computed by this cache, this merge policy can result in query time errors if incompatible
+   * types are mixed if the chosen type is more restrictive than the types of some segments. If data is likely to vary
+   * in type across segments, consider using {@link LeastRestrictiveTypeMergePolicy} instead.
+   */
+  public static class FirstTypeMergePolicy implements ColumnTypeMergePolicy

Review Comment:
   That makes sense. Thanks for clarifying.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a diff in pull request #14319: add configurable ColumnTypeMergePolicy to SegmentMetadataCache

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on code in PR #14319:
URL: https://github.com/apache/druid/pull/14319#discussion_r1201111787


##########
processing/src/main/java/org/apache/druid/segment/column/ColumnType.java:
##########
@@ -218,13 +220,21 @@ public static ColumnType leastRestrictiveType(@Nullable ColumnType type, @Nullab
     }
 
     // all numbers win over longs
-    // floats vs doubles would be handled here, but we currently only support doubles...
     if (Types.is(type, ValueType.LONG) && Types.isNullOr(other, ValueType.LONG)) {
       return ColumnType.LONG;
     }
+    // doubles win over floats
     if (Types.is(type, ValueType.FLOAT) && Types.isNullOr(other, ValueType.FLOAT)) {
       return ColumnType.FLOAT;
     }
     return ColumnType.DOUBLE;
   }
+
+  public static class IncompatibleTypeException extends IAE
+  {
+    public IncompatibleTypeException(ColumnType type, ColumnType other)
+    {
+      super("Cannot implicitly cast %s to %s", type, other);

Review Comment:
   I considered this, but I was also planning to use this for some expression type casting errors too, which don't always necessarily have a column they belong to, rather it might be the output type of an entire expression tree, so left it out exception for now (and am adding it in the place where expression types would use it). 
   
   The cache isn't actually logging about this error, instead it eats it and basically falls back to the old logic (i'm updating it to actually just use the other merge policy in the catch). I guess it could warn, but i was afraid it would be too chatty in the event you actually did have segments of incompatible types, so debug would probably be the most appropriate level.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a diff in pull request #14319: add configurable ColumnTypeMergePolicy to SegmentMetadataCache

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on code in PR #14319:
URL: https://github.com/apache/druid/pull/14319#discussion_r1203099360


##########
sql/src/main/java/org/apache/druid/sql/calcite/schema/SegmentMetadataCache.java:
##########
@@ -995,4 +986,118 @@ void doInLock(Runnable runnable)
       runnable.run();
     }
   }
+
+
+  /**
+   * ColumnTypeMergePolicy defines the rules of which type to use when faced with the possibility of different types
+   * for the same column from segment to segment. It is used to help compute a {@link RowSignature} for a table in
+   * Druid based on the segment metadata of all segments, merging the types of each column encountered to end up with
+   * a single type to represent it globally.
+   */
+  @JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type", defaultImpl = LeastRestrictiveTypeMergePolicy.class)
+  @JsonSubTypes(value = {
+      @JsonSubTypes.Type(name = FirstTypeMergePolicy.NAME, value = FirstTypeMergePolicy.class),
+      @JsonSubTypes.Type(name = LeastRestrictiveTypeMergePolicy.NAME, value = LeastRestrictiveTypeMergePolicy.class)
+  })
+  @FunctionalInterface
+  public interface ColumnTypeMergePolicy
+  {
+    ColumnType merge(ColumnType existingType, ColumnType newType);
+  }
+
+  /**
+   * Classic logic, we use the first type we encounter. This policy is effectively 'newest first' because we iterated
+   * segments starting from the most recent time chunk, so this typically results in the most recently used type being
+   * chosen, at least for systems that are continuously updated with 'current' data.
+   *
+   * Since {@link ColumnTypeMergePolicy} are used to compute the SQL schema, at least in systems using SQL schemas which
+   * are partially or fully computed by this cache, this merge policy can result in query time errors if incompatible
+   * types are mixed if the chosen type is more restrictive than the types of some segments. If data is likely to vary
+   * in type across segments, consider using {@link LeastRestrictiveTypeMergePolicy} instead.
+   */
+  public static class FirstTypeMergePolicy implements ColumnTypeMergePolicy
+  {
+    public static final String NAME = "latestInterval";
+    private static final FirstTypeMergePolicy INSTANCE = new FirstTypeMergePolicy();
+
+    @Override
+    public ColumnType merge(ColumnType existingType, ColumnType newType)
+    {
+      if (existingType == null) {
+        return newType;
+      }
+      if (newType == null) {
+        return existingType;
+      }
+      // if any are json, are all json
+      if (ColumnType.NESTED_DATA.equals(newType) || ColumnType.NESTED_DATA.equals(existingType)) {
+        return ColumnType.NESTED_DATA;
+      }
+      // "existing type" is the 'newest' type, since we iterate the segments list by newest start time
+      return existingType;
+    }
+
+    @Override
+    public int hashCode()
+    {
+      return Objects.hash(NAME);
+    }
+
+    @Override
+    public boolean equals(Object o)
+    {
+      if (this == o) {
+        return true;
+      }
+      return o != null && getClass() == o.getClass();
+    }
+
+    @Override
+    public String toString()
+    {
+      return NAME;
+    }
+  }
+
+  /**
+   * Resolves types using {@link ColumnType#leastRestrictiveType(ColumnType, ColumnType)} to find the ColumnType that
+   * can best represent all data contained across all segments.
+   */
+  public static class LeastRestrictiveTypeMergePolicy implements ColumnTypeMergePolicy
+  {
+    public static final String NAME = "leastRestrictive";
+
+    @Override
+    public ColumnType merge(ColumnType existingType, ColumnType newType)
+    {
+      try {
+        return ColumnType.leastRestrictiveType(existingType, newType);
+      }
+      catch (Types.IncompatibleTypeException incompatibleTypeException) {
+        // fall back to first encountered type if they are not compatible for some reason
+        return FirstTypeMergePolicy.INSTANCE.merge(existingType, newType);

Review Comment:
   I wanted to catch instead of exploding here because the implication of exploding is that it interferes with computing the SQL schema, so it seemed better to fall back to just using the type from the most recent time chunk than to blow up.
   
   The specific scenario i was imagining is like what happens if someone wants to migrate between different sketch types. This is totally incompatible to do in the same column, but instead of blowing up the schema it seemed better to fall back to the existing behavior so at least the problem can be worked around (instead of breaking more stuff).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a diff in pull request #14319: add configurable ColumnTypeMergePolicy to SegmentMetadataCache

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on code in PR #14319:
URL: https://github.com/apache/druid/pull/14319#discussion_r1203099633


##########
docs/configuration/index.md:
##########
@@ -1954,6 +1954,7 @@ The Druid SQL server is configured through the following properties on the Broke
 |`druid.sql.http.enable`|Whether to enable JSON over HTTP querying at `/druid/v2/sql/`.|true|
 |`druid.sql.planner.maxTopNLimit`|Maximum threshold for a [TopN query](../querying/topnquery.md). Higher limits will be planned as [GroupBy queries](../querying/groupbyquery.md) instead.|100000|
 |`druid.sql.planner.metadataRefreshPeriod`|Throttle for metadata refreshes.|PT1M|
+|`druid.sql.planner.metadataColumnTypeMergePolicy`|Defines how column types will be chosen when faced with differences between segments when computing the SQL schema. Options are specified as a JSON object, with valid choices of `{"type":"leastRestrictive"}` or `{"type":"latestInterval"}`. For `leastRestrictive`, Druid will automatically widen the type computed for the schema to a type which data across all segments can be converted into, however planned schema migrations can only take effect once all segments have been re-ingested to the new schema. With `latestInterval`, the column type in most recent time chunks defines the type for the schema. |`{"type":"leastRestrictive"}`|

Review Comment:
   honest answer? i forget how to do that off the top of my head 😅



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] zachjsh commented on a diff in pull request #14319: add configurable ColumnTypeMergePolicy to SegmentMetadataCache

Posted by "zachjsh (via GitHub)" <gi...@apache.org>.
zachjsh commented on code in PR #14319:
URL: https://github.com/apache/druid/pull/14319#discussion_r1202957170


##########
sql/src/main/java/org/apache/druid/sql/calcite/schema/SegmentMetadataCache.java:
##########
@@ -995,4 +986,118 @@ void doInLock(Runnable runnable)
       runnable.run();
     }
   }
+
+
+  /**
+   * ColumnTypeMergePolicy defines the rules of which type to use when faced with the possibility of different types
+   * for the same column from segment to segment. It is used to help compute a {@link RowSignature} for a table in
+   * Druid based on the segment metadata of all segments, merging the types of each column encountered to end up with
+   * a single type to represent it globally.
+   */
+  @JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type", defaultImpl = LeastRestrictiveTypeMergePolicy.class)
+  @JsonSubTypes(value = {
+      @JsonSubTypes.Type(name = FirstTypeMergePolicy.NAME, value = FirstTypeMergePolicy.class),
+      @JsonSubTypes.Type(name = LeastRestrictiveTypeMergePolicy.NAME, value = LeastRestrictiveTypeMergePolicy.class)
+  })
+  @FunctionalInterface
+  public interface ColumnTypeMergePolicy
+  {
+    ColumnType merge(ColumnType existingType, ColumnType newType);
+  }
+
+  /**
+   * Classic logic, we use the first type we encounter. This policy is effectively 'newest first' because we iterated
+   * segments starting from the most recent time chunk, so this typically results in the most recently used type being
+   * chosen, at least for systems that are continuously updated with 'current' data.
+   *
+   * Since {@link ColumnTypeMergePolicy} are used to compute the SQL schema, at least in systems using SQL schemas which
+   * are partially or fully computed by this cache, this merge policy can result in query time errors if incompatible
+   * types are mixed if the chosen type is more restrictive than the types of some segments. If data is likely to vary
+   * in type across segments, consider using {@link LeastRestrictiveTypeMergePolicy} instead.
+   */
+  public static class FirstTypeMergePolicy implements ColumnTypeMergePolicy
+  {
+    public static final String NAME = "latestInterval";
+    private static final FirstTypeMergePolicy INSTANCE = new FirstTypeMergePolicy();
+
+    @Override
+    public ColumnType merge(ColumnType existingType, ColumnType newType)
+    {
+      if (existingType == null) {
+        return newType;
+      }
+      if (newType == null) {
+        return existingType;
+      }
+      // if any are json, are all json
+      if (ColumnType.NESTED_DATA.equals(newType) || ColumnType.NESTED_DATA.equals(existingType)) {
+        return ColumnType.NESTED_DATA;
+      }
+      // "existing type" is the 'newest' type, since we iterate the segments list by newest start time
+      return existingType;
+    }
+
+    @Override
+    public int hashCode()
+    {
+      return Objects.hash(NAME);
+    }
+
+    @Override
+    public boolean equals(Object o)
+    {
+      if (this == o) {
+        return true;
+      }
+      return o != null && getClass() == o.getClass();
+    }
+
+    @Override
+    public String toString()
+    {
+      return NAME;
+    }
+  }
+
+  /**
+   * Resolves types using {@link ColumnType#leastRestrictiveType(ColumnType, ColumnType)} to find the ColumnType that
+   * can best represent all data contained across all segments.
+   */
+  public static class LeastRestrictiveTypeMergePolicy implements ColumnTypeMergePolicy
+  {
+    public static final String NAME = "leastRestrictive";
+
+    @Override
+    public ColumnType merge(ColumnType existingType, ColumnType newType)
+    {
+      try {
+        return ColumnType.leastRestrictiveType(existingType, newType);
+      }
+      catch (Types.IncompatibleTypeException incompatibleTypeException) {
+        // fall back to first encountered type if they are not compatible for some reason
+        return FirstTypeMergePolicy.INSTANCE.merge(existingType, newType);

Review Comment:
   what would happen if we just allowed the exception to propagate through here instead of reverting back to the newest first strategy? With this reversion, it feels like this policy may be hard to define, maybe?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a diff in pull request #14319: add configurable ColumnTypeMergePolicy to SegmentMetadataCache

Posted by "clintropolis (via GitHub)" <gi...@apache.org>.
clintropolis commented on code in PR #14319:
URL: https://github.com/apache/druid/pull/14319#discussion_r1202938396


##########
processing/src/main/java/org/apache/druid/segment/NestedDataColumnMerger.java:
##########
@@ -93,6 +93,9 @@ public void writeMergedValueDictionary(List<IndexableAdapter> adapters) throws I
         final IndexableAdapter.NestedColumnMergable mergable = closer.register(
             adapter.getNestedColumnMergeables(name)
         );
+        if (mergable == null) {

Review Comment:
   o, this is an unrelated bug fix, not sure why I put it in this PR though, can remove



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekrb19 commented on a diff in pull request #14319: add configurable ColumnTypeMergePolicy to SegmentMetadataCache

Posted by "abhishekrb19 (via GitHub)" <gi...@apache.org>.
abhishekrb19 commented on code in PR #14319:
URL: https://github.com/apache/druid/pull/14319#discussion_r1199575200


##########
processing/src/main/java/org/apache/druid/segment/column/ColumnType.java:
##########
@@ -152,7 +152,7 @@ public static ColumnType ofComplex(@Nullable String complexTypeName)
    *                                                                   inference
    */
   @Nullable
-  public static ColumnType leastRestrictiveType(@Nullable ColumnType type, @Nullable ColumnType other)
+  public static ColumnType leastRestrictiveType(@Nullable ColumnType type, @Nullable ColumnType other) throws IncompatibleTypeException

Review Comment:
   Good call on the new exception type. May also want to update the javadoc to reflect `IllegalArgumentException` -> `IncompatibleTypeException`



##########
sql/src/main/java/org/apache/druid/sql/calcite/schema/SegmentMetadataCache.java:
##########
@@ -995,4 +986,117 @@ void doInLock(Runnable runnable)
       runnable.run();
     }
   }
+
+
+  /**
+   * ColumnTypeMergePolicy defines the rules of which type to use when faced with the possibility of different types
+   * for the same column from segment to segment. It is used to help compute a {@link RowSignature} for a table in
+   * Druid based on the segment metadata of all segments, merging the types of each column encountered to end up with
+   * a single type to represent it globally.
+   */
+  @JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type", defaultImpl = FirstTypeMergePolicy.class)
+  @JsonSubTypes(value = {
+      @JsonSubTypes.Type(name = FirstTypeMergePolicy.NAME, value = FirstTypeMergePolicy.class),
+      @JsonSubTypes.Type(name = LeastRestrictiveTypeMergePolicy.NAME, value = LeastRestrictiveTypeMergePolicy.class)
+  })
+  @FunctionalInterface
+  public interface ColumnTypeMergePolicy
+  {
+    ColumnType merge(ColumnType existingType, ColumnType newType);
+  }
+
+  /**
+   * Classic logic, we use the first type we encounter. This policy is effectively 'newest first' because we iterated
+   * segments starting from the most recent time chunk, so this typically results in the most recently used type being
+   * chosen, at least for systems that are continuously updated with 'current' data.
+   *
+   * Since {@link ColumnTypeMergePolicy} are used to compute the SQL schema, at least in systems using SQL schemas which
+   * are poartially or fully computed by this cache, this merge policy can result in query time errors if incompatible
+   * types are mixed if the chosen type is more restrictive than the types of some segments. If data is likely to vary
+   * in type across segments, consider using {@link LeastRestrictiveTypeMergePolicy} instead.
+   */
+  public static class FirstTypeMergePolicy implements ColumnTypeMergePolicy
+  {
+    public static final String NAME = "newestFirst";
+
+    @Override
+    public ColumnType merge(ColumnType existingType, ColumnType newType)
+    {
+      if (existingType == null) {
+        return newType;
+      }
+      if (newType == null) {
+        return existingType;
+      }
+      // if any are json, are all json
+      if (ColumnType.NESTED_DATA.equals(newType) || ColumnType.NESTED_DATA.equals(existingType)) {
+        return ColumnType.NESTED_DATA;
+      }
+      // "existing type" is the 'newest' type, since we iterate the segments list by newest start time
+      return existingType;
+    }
+
+    @Override
+    public int hashCode()
+    {
+      return Objects.hash(NAME);
+    }
+
+    @Override
+    public boolean equals(Object o)
+    {
+      if (this == o) {
+        return true;
+      }
+      return o != null && getClass() == o.getClass();
+    }
+
+    @Override
+    public String toString()
+    {
+      return NAME;
+    }
+  }
+
+  /**
+   * Resolves types using {@link ColumnType#leastRestrictiveType(ColumnType, ColumnType)} to find the ColumnType that
+   * can best represent all data contained across all segments.

Review Comment:
   Is it actually type resolved from all the segments or limited to the most recent `MAX_SEGMENTS_PER_QUERY` segments per query? (hardcoded to 15000)



##########
sql/src/main/java/org/apache/druid/sql/calcite/schema/SegmentMetadataCache.java:
##########
@@ -808,20 +812,7 @@ DatasourceTable.PhysicalDatasourceMetadata buildDruidTable(final String dataSour
                 rowSignature.getColumnType(column)
                             .orElseThrow(() -> new ISE("Encountered null type for column [%s]", column));
 
-            columnTypes.compute(column, (c, existingType) -> {

Review Comment:
   With the new `leastRestrictive` merge policy, the above comment `// Newer column types should override older ones.` may not necessarily be true? 



##########
sql/src/main/java/org/apache/druid/sql/calcite/schema/SegmentMetadataCache.java:
##########
@@ -995,4 +986,117 @@ void doInLock(Runnable runnable)
       runnable.run();
     }
   }
+
+
+  /**
+   * ColumnTypeMergePolicy defines the rules of which type to use when faced with the possibility of different types
+   * for the same column from segment to segment. It is used to help compute a {@link RowSignature} for a table in
+   * Druid based on the segment metadata of all segments, merging the types of each column encountered to end up with
+   * a single type to represent it globally.
+   */
+  @JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type", defaultImpl = FirstTypeMergePolicy.class)
+  @JsonSubTypes(value = {
+      @JsonSubTypes.Type(name = FirstTypeMergePolicy.NAME, value = FirstTypeMergePolicy.class),
+      @JsonSubTypes.Type(name = LeastRestrictiveTypeMergePolicy.NAME, value = LeastRestrictiveTypeMergePolicy.class)
+  })
+  @FunctionalInterface
+  public interface ColumnTypeMergePolicy
+  {
+    ColumnType merge(ColumnType existingType, ColumnType newType);
+  }
+
+  /**
+   * Classic logic, we use the first type we encounter. This policy is effectively 'newest first' because we iterated
+   * segments starting from the most recent time chunk, so this typically results in the most recently used type being
+   * chosen, at least for systems that are continuously updated with 'current' data.
+   *
+   * Since {@link ColumnTypeMergePolicy} are used to compute the SQL schema, at least in systems using SQL schemas which
+   * are poartially or fully computed by this cache, this merge policy can result in query time errors if incompatible
+   * types are mixed if the chosen type is more restrictive than the types of some segments. If data is likely to vary
+   * in type across segments, consider using {@link LeastRestrictiveTypeMergePolicy} instead.
+   */
+  public static class FirstTypeMergePolicy implements ColumnTypeMergePolicy

Review Comment:
   nit: should we align this class name with the policy name  - `NewestFirstTypeMergePolicy` or `LatestTypeMergePolicy`?



##########
processing/src/main/java/org/apache/druid/segment/column/ColumnType.java:
##########
@@ -218,13 +220,21 @@ public static ColumnType leastRestrictiveType(@Nullable ColumnType type, @Nullab
     }
 
     // all numbers win over longs
-    // floats vs doubles would be handled here, but we currently only support doubles...
     if (Types.is(type, ValueType.LONG) && Types.isNullOr(other, ValueType.LONG)) {
       return ColumnType.LONG;
     }
+    // doubles win over floats
     if (Types.is(type, ValueType.FLOAT) && Types.isNullOr(other, ValueType.FLOAT)) {
       return ColumnType.FLOAT;
     }
     return ColumnType.DOUBLE;
   }
+
+  public static class IncompatibleTypeException extends IAE
+  {
+    public IncompatibleTypeException(ColumnType type, ColumnType other)
+    {
+      super("Cannot implicitly cast %s to %s", type, other);

Review Comment:
   ```suggestion
         super("Cannot implicitly cast [%s] to [%s]", type, other);
   ```
   
   nit:  I think it'll also be helpful to pass in the column name so it's clear which column has incompatible types



##########
sql/src/main/java/org/apache/druid/sql/calcite/schema/SegmentMetadataCache.java:
##########
@@ -995,4 +986,117 @@ void doInLock(Runnable runnable)
       runnable.run();
     }
   }
+
+
+  /**
+   * ColumnTypeMergePolicy defines the rules of which type to use when faced with the possibility of different types
+   * for the same column from segment to segment. It is used to help compute a {@link RowSignature} for a table in
+   * Druid based on the segment metadata of all segments, merging the types of each column encountered to end up with
+   * a single type to represent it globally.
+   */
+  @JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type", defaultImpl = FirstTypeMergePolicy.class)
+  @JsonSubTypes(value = {
+      @JsonSubTypes.Type(name = FirstTypeMergePolicy.NAME, value = FirstTypeMergePolicy.class),
+      @JsonSubTypes.Type(name = LeastRestrictiveTypeMergePolicy.NAME, value = LeastRestrictiveTypeMergePolicy.class)
+  })
+  @FunctionalInterface
+  public interface ColumnTypeMergePolicy
+  {
+    ColumnType merge(ColumnType existingType, ColumnType newType);
+  }
+
+  /**
+   * Classic logic, we use the first type we encounter. This policy is effectively 'newest first' because we iterated
+   * segments starting from the most recent time chunk, so this typically results in the most recently used type being
+   * chosen, at least for systems that are continuously updated with 'current' data.
+   *
+   * Since {@link ColumnTypeMergePolicy} are used to compute the SQL schema, at least in systems using SQL schemas which
+   * are poartially or fully computed by this cache, this merge policy can result in query time errors if incompatible

Review Comment:
   ```suggestion
      * are partially or fully computed by this cache, this merge policy can result in query time errors if incompatible
   ```



##########
sql/src/main/java/org/apache/druid/sql/calcite/schema/SegmentMetadataCache.java:
##########
@@ -995,4 +986,117 @@ void doInLock(Runnable runnable)
       runnable.run();
     }
   }
+
+
+  /**
+   * ColumnTypeMergePolicy defines the rules of which type to use when faced with the possibility of different types

Review Comment:
   nice commentary, the code is easy to follow 👍 



##########
sql/src/main/java/org/apache/druid/sql/calcite/planner/SegmentMetadataCacheConfig.java:
##########
@@ -113,6 +94,7 @@ public String toString()
            ", metadataSegmentCacheEnable=" + metadataSegmentCacheEnable +
            ", metadataSegmentPollPeriod=" + metadataSegmentPollPeriod +
            ", awaitInitializationOnStart=" + awaitInitializationOnStart +
+           ", columnTypeMergePolicy=" + metadataColumnTypeMergePolicy +

Review Comment:
   ```suggestion
              ", metadataColumnTypeMergePolicy=" + metadataColumnTypeMergePolicy +
   ```



##########
sql/src/main/java/org/apache/druid/sql/calcite/schema/SegmentMetadataCache.java:
##########
@@ -995,4 +986,117 @@ void doInLock(Runnable runnable)
       runnable.run();
     }
   }
+
+
+  /**
+   * ColumnTypeMergePolicy defines the rules of which type to use when faced with the possibility of different types
+   * for the same column from segment to segment. It is used to help compute a {@link RowSignature} for a table in
+   * Druid based on the segment metadata of all segments, merging the types of each column encountered to end up with
+   * a single type to represent it globally.
+   */
+  @JsonTypeInfo(use = JsonTypeInfo.Id.NAME, property = "type", defaultImpl = FirstTypeMergePolicy.class)
+  @JsonSubTypes(value = {
+      @JsonSubTypes.Type(name = FirstTypeMergePolicy.NAME, value = FirstTypeMergePolicy.class),
+      @JsonSubTypes.Type(name = LeastRestrictiveTypeMergePolicy.NAME, value = LeastRestrictiveTypeMergePolicy.class)
+  })
+  @FunctionalInterface
+  public interface ColumnTypeMergePolicy
+  {
+    ColumnType merge(ColumnType existingType, ColumnType newType);
+  }
+
+  /**
+   * Classic logic, we use the first type we encounter. This policy is effectively 'newest first' because we iterated
+   * segments starting from the most recent time chunk, so this typically results in the most recently used type being
+   * chosen, at least for systems that are continuously updated with 'current' data.
+   *
+   * Since {@link ColumnTypeMergePolicy} are used to compute the SQL schema, at least in systems using SQL schemas which
+   * are poartially or fully computed by this cache, this merge policy can result in query time errors if incompatible
+   * types are mixed if the chosen type is more restrictive than the types of some segments. If data is likely to vary
+   * in type across segments, consider using {@link LeastRestrictiveTypeMergePolicy} instead.
+   */
+  public static class FirstTypeMergePolicy implements ColumnTypeMergePolicy
+  {
+    public static final String NAME = "newestFirst";
+
+    @Override
+    public ColumnType merge(ColumnType existingType, ColumnType newType)
+    {
+      if (existingType == null) {
+        return newType;
+      }
+      if (newType == null) {
+        return existingType;
+      }
+      // if any are json, are all json
+      if (ColumnType.NESTED_DATA.equals(newType) || ColumnType.NESTED_DATA.equals(existingType)) {
+        return ColumnType.NESTED_DATA;
+      }
+      // "existing type" is the 'newest' type, since we iterate the segments list by newest start time
+      return existingType;
+    }
+
+    @Override
+    public int hashCode()
+    {
+      return Objects.hash(NAME);
+    }
+
+    @Override
+    public boolean equals(Object o)
+    {
+      if (this == o) {
+        return true;
+      }
+      return o != null && getClass() == o.getClass();
+    }
+
+    @Override
+    public String toString()
+    {
+      return NAME;
+    }
+  }
+
+  /**
+   * Resolves types using {@link ColumnType#leastRestrictiveType(ColumnType, ColumnType)} to find the ColumnType that
+   * can best represent all data contained across all segments.

Review Comment:
   Also, I wonder what the performance implications for choosing the `leastRestrictive` strategy are, given that this policy has to scan many/all segments per data source. Should we call out any gotchas in the documentation explicitly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org