You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/04/21 07:45:05 UTC

[GitHub] [druid] clintropolis opened a new pull request #9731: ColumnCapabilities.hasMultipleValues refactor

clintropolis opened a new pull request #9731:
URL: https://github.com/apache/druid/pull/9731


   ### Description
   This PR changes `ColumnCapabilites.hasMultipleValues()` to be a `Capable` enum added in #9662, and removes `ColumnCapabilities.isComplete()` which was added in #7588, along with some slight refactoring of `ColumnCapabilitiesImpl`, to remove the `merge` method, which was only setting `isComplete`, and replacing it with a static method `ColumnCapabilities.complete` which is used at index merger time to finalize `Capable.UNKNOWN` values.
   
   Many uses in the query path of `ColumnCapabilites.hasMultipleValues` use a new `Capable` method, `isMaybeTrue` to check for either `TRUE` or `UNKNOWN`.
   
   <hr>
   
   This PR has:
   - [ ] been self-reviewed.
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/licenses.yaml)
   - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   
   
   <hr>
   
   ##### Key changed/added classes in this PR
    * `ColumnCapabilities`
    * `ColumnCapabilitiesImpl`
    *  a bunch of query and ingestion path stuffs that deals with strings having multiple values or not
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis merged pull request #9731: ColumnCapabilities.hasMultipleValues refactor

Posted by GitBox <gi...@apache.org>.
clintropolis merged pull request #9731:
URL: https://github.com/apache/druid/pull/9731


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a change in pull request #9731: ColumnCapabilities.hasMultipleValues refactor

Posted by GitBox <gi...@apache.org>.
clintropolis commented on a change in pull request #9731:
URL: https://github.com/apache/druid/pull/9731#discussion_r435569647



##########
File path: processing/src/main/java/org/apache/druid/segment/DimensionIndexer.java
##########
@@ -127,6 +127,19 @@
    */
   EncodedKeyComponentType processRowValsToUnsortedEncodedKeyComponent(@Nullable Object dimValues, boolean reportParseExceptions);
 
+  /**
+   * This method will be called whenever a known dimension is absent in a row to allow an indexer

Review comment:
       I modified the javadocs, hopefully they are a bit clearer, and removed the default implementation




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a change in pull request #9731: ColumnCapabilities.hasMultipleValues refactor

Posted by GitBox <gi...@apache.org>.
clintropolis commented on a change in pull request #9731:
URL: https://github.com/apache/druid/pull/9731#discussion_r435666847



##########
File path: processing/src/main/java/org/apache/druid/segment/ColumnSelectorBitmapIndexSelector.java
##########
@@ -157,14 +158,16 @@ public void close() throws IOException
   }
 
   @Override
-  public boolean hasMultipleValues(final String dimension)
+  public ColumnCapabilities.Capable hasMultipleValues(final String dimension)
   {
     if (isVirtualColumn(dimension)) {
       return virtualColumns.getVirtualColumn(dimension).capabilities(dimension).hasMultipleValues();
     }
 
     final ColumnHolder columnHolder = index.getColumnHolder(dimension);
-    return columnHolder != null && columnHolder.getCapabilities().hasMultipleValues();
+    return columnHolder != null
+           ? columnHolder.getCapabilities().hasMultipleValues()
+           : ColumnCapabilities.Capable.UNKNOWN;

Review comment:
       Ah, I changed `ExpressionFilter` [to check `isMaybeTrue`](https://github.com/apache/druid/pull/9731/commits/7dc8754d4363e276b5502f355357a039a5b769b6#diff-6713b89829f1ed89eb70e4a39002b85dR118) to handle unknown, so i think unknown is fine here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a change in pull request #9731: ColumnCapabilities.hasMultipleValues refactor

Posted by GitBox <gi...@apache.org>.
clintropolis commented on a change in pull request #9731:
URL: https://github.com/apache/druid/pull/9731#discussion_r435528807



##########
File path: processing/src/main/java/org/apache/druid/segment/DimensionHandlerUtils.java
##########
@@ -60,7 +60,8 @@
                                   .setDictionaryEncoded(false)
                                   .setDictionaryValuesUnique(false)
                                   .setDictionaryValuesSorted(false)
-                                  .setHasBitmapIndexes(false);
+                                  .setHasBitmapIndexes(false)
+                                  .setHasMultipleValues(false);

Review comment:
       The issue with this is that it is used when column capabilities don't exist by `getEffectiveCapabilities`, and `makeVectorProcessor` was not in fact checking if it was `complete`, so the other change in this file to check `isMaybeTrue` causes a ton of test failures because it now it tries to make multi-value dimension processors/selectors for these columns. We could leave it as unknown, but change `makeVectorProcessor` to `isTrue`, but that doesn't seem correct either. Ideally, we remove these default capabilities entirely, but we aren't quite there yet I think.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9731: ColumnCapabilities.hasMultipleValues refactor

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9731:
URL: https://github.com/apache/druid/pull/9731#discussion_r435701929



##########
File path: processing/src/main/java/org/apache/druid/segment/ColumnSelectorBitmapIndexSelector.java
##########
@@ -157,14 +158,16 @@ public void close() throws IOException
   }
 
   @Override
-  public boolean hasMultipleValues(final String dimension)
+  public ColumnCapabilities.Capable hasMultipleValues(final String dimension)
   {
     if (isVirtualColumn(dimension)) {
       return virtualColumns.getVirtualColumn(dimension).capabilities(dimension).hasMultipleValues();
     }
 
     final ColumnHolder columnHolder = index.getColumnHolder(dimension);
-    return columnHolder != null && columnHolder.getCapabilities().hasMultipleValues();
+    return columnHolder != null
+           ? columnHolder.getCapabilities().hasMultipleValues()
+           : ColumnCapabilities.Capable.UNKNOWN;

Review comment:
       Sorry, you're right, I meant `Capable.FALSE`. Thanks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a change in pull request #9731: ColumnCapabilities.hasMultipleValues refactor

Posted by GitBox <gi...@apache.org>.
clintropolis commented on a change in pull request #9731:
URL: https://github.com/apache/druid/pull/9731#discussion_r430688872



##########
File path: processing/src/main/java/org/apache/druid/segment/column/ColumnCapabilities.java
##########
@@ -57,6 +50,21 @@ public boolean isTrue()
       return this == TRUE;
     }
 
+    public boolean isMaybeTrue()
+    {
+      return isTrue() || isUnknown();
+    }
+
+    public boolean isUnknown()
+    {
+      return this == UNKNOWN;
+    }
+
+    public Capable complete(boolean convertUnknownToTrue)

Review comment:
       I don't think it needs another enum, if anything it could just accept a `Capable` to use for the default value, though I should enforce that it cannot be unknown. If i rename to `coerceIfUnknown` or something maybe it would also be more obvious what the purpose is.

##########
File path: processing/src/main/java/org/apache/druid/segment/incremental/IncrementalIndexStorageAdapter.java
##########
@@ -142,24 +141,8 @@ public Comparable getMaxValue(String column)
   @Override
   public ColumnCapabilities getColumnCapabilities(String column)
   {
-    // Different from index.getCapabilities because, in a way, IncrementalIndex's string-typed dimensions
-    // are always potentially multi-valued at query time. (Missing / null values for a row can potentially be
-    // represented by an empty array; see StringDimensionIndexer.IndexerDimensionSelector's getRow method.)
-    //
-    // We don't want to represent this as having-multiple-values in index.getCapabilities, because that's used
-    // at index-persisting time to determine if we need a multi-value column or not. However, that means we
-    // need to tweak the capabilities here in the StorageAdapter (a query-time construct), so at query time
-    // they appear multi-valued.
-
-    final ColumnCapabilities capabilitiesFromIndex = index.getCapabilities(column);
-    final IncrementalIndex.DimensionDesc dimensionDesc = index.getDimension(column);
-    if (dimensionDesc != null && dimensionDesc.getCapabilities().getType() == ValueType.STRING) {
-      final ColumnCapabilitiesImpl retVal = ColumnCapabilitiesImpl.copyOf(capabilitiesFromIndex);
-      retVal.setHasMultipleValues(true);
-      return retVal;
-    } else {
-      return capabilitiesFromIndex;
-    }
+    // snapshot the current state
+    return ColumnCapabilitiesImpl.complete(index.getCapabilities(column), false);

Review comment:
       Ok, given the race condition I guess I have to revert this behavior so queries use the fake capabilities and to fix the bug for segment metadata based schema discovery to plumb special the actual capabilities to segment metadata queries. Maybe longer term it would be better if we could snapshot at the time we query and make a cursor to only read the rows that were available at the time the capabilities were fetched so that we can have more performant queries on incremental indexes at the cost of missing a few rows.

##########
File path: processing/src/main/java/org/apache/druid/segment/incremental/IncrementalIndexStorageAdapter.java
##########
@@ -142,24 +141,8 @@ public Comparable getMaxValue(String column)
   @Override
   public ColumnCapabilities getColumnCapabilities(String column)
   {
-    // Different from index.getCapabilities because, in a way, IncrementalIndex's string-typed dimensions
-    // are always potentially multi-valued at query time. (Missing / null values for a row can potentially be
-    // represented by an empty array; see StringDimensionIndexer.IndexerDimensionSelector's getRow method.)
-    //
-    // We don't want to represent this as having-multiple-values in index.getCapabilities, because that's used
-    // at index-persisting time to determine if we need a multi-value column or not. However, that means we
-    // need to tweak the capabilities here in the StorageAdapter (a query-time construct), so at query time
-    // they appear multi-valued.
-
-    final ColumnCapabilities capabilitiesFromIndex = index.getCapabilities(column);
-    final IncrementalIndex.DimensionDesc dimensionDesc = index.getDimension(column);
-    if (dimensionDesc != null && dimensionDesc.getCapabilities().getType() == ValueType.STRING) {
-      final ColumnCapabilitiesImpl retVal = ColumnCapabilitiesImpl.copyOf(capabilitiesFromIndex);
-      retVal.setHasMultipleValues(true);
-      return retVal;
-    } else {
-      return capabilitiesFromIndex;
-    }
+    // snapshot the current state
+    return ColumnCapabilitiesImpl.complete(index.getCapabilities(column), false);

Review comment:
       Ok, given the race condition I guess I have to revert this behavior so queries use the fake capabilities and to fix the bug for segment metadata based schema discovery to plumb special the actual capabilities to segment metadata queries. Maybe longer term it would be better if we could snapshot the capabilities at the time we query and make a cursor to only read the rows that were available at the time the capabilities were fetched and give those to the string indexer so that we can have more performant queries on incremental indexes at the cost of missing a few rows.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9731: ColumnCapabilities.hasMultipleValues refactor

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9731:
URL: https://github.com/apache/druid/pull/9731#discussion_r420318956



##########
File path: processing/src/main/java/org/apache/druid/segment/DimensionIndexer.java
##########
@@ -127,6 +127,19 @@
    */
   EncodedKeyComponentType processRowValsToUnsortedEncodedKeyComponent(@Nullable Object dimValues, boolean reportParseExceptions);
 
+  /**
+   * This method will be called whenever a known dimension is absent in a row to allow an indexer

Review comment:
       1. This javadoc doesn't make a ton of sense to me. Specifically it's not clear to me what "known dimension is absent in a row" means or what "account for any missing rows" might refer to. I wonder if you could find an improved wording.
   2. IMO it'd be better to not have this method be `default`. It makes it too easy to forget to override it when necessary.

##########
File path: processing/src/main/java/org/apache/druid/segment/column/ColumnCapabilities.java
##########
@@ -57,6 +50,21 @@ public boolean isTrue()
       return this == TRUE;
     }
 
+    public boolean isMaybeTrue()
+    {
+      return isTrue() || isUnknown();
+    }
+
+    public boolean isUnknown()
+    {
+      return this == UNKNOWN;
+    }
+
+    public Capable complete(boolean convertUnknownToTrue)

Review comment:
       The boolean here is tough for me to wrap my head around: when looking at method calls like `.complete(true)` and `.complete(false)`, it looks like it's describing whether or not things are complete. I think it would be better to use a two-valued enum, so the call would be `.makeComplete(UNKNOWN_TO_FALSE)` or `.makeComplete(UNKNOWN_TO_TRUE)`.

##########
File path: processing/src/main/java/org/apache/druid/segment/column/ColumnCapabilitiesImpl.java
##########
@@ -31,23 +31,66 @@
  */
 public class ColumnCapabilitiesImpl implements ColumnCapabilities
 {
-  public static ColumnCapabilitiesImpl copyOf(final ColumnCapabilities other)
+  public static ColumnCapabilitiesImpl copyOf(@Nullable final ColumnCapabilities other)
   {
     final ColumnCapabilitiesImpl capabilities = new ColumnCapabilitiesImpl();
-    capabilities.merge(other);
-    capabilities.setFilterable(other.isFilterable());
-    capabilities.setIsComplete(other.isComplete());
+    if (other != null) {
+      capabilities.type = other.getType();
+      capabilities.dictionaryEncoded = other.isDictionaryEncoded();
+      capabilities.runLengthEncoded = other.isRunLengthEncoded();
+      capabilities.hasInvertedIndexes = other.hasBitmapIndexes();
+      capabilities.hasSpatialIndexes = other.hasSpatialIndexes();
+      capabilities.hasMultipleValues = other.hasMultipleValues();
+      capabilities.dictionaryValuesSorted = other.areDictionaryValuesSorted();
+      capabilities.dictionaryValuesUnique = other.areDictionaryValuesUnique();
+      capabilities.filterable = other.isFilterable();
+    }
     return capabilities;
   }
 
+  /**
+   * Used at indexing time to finalize all {@link ColumnCapabilities.Capable#UNKNOWN} values to
+   * {@link ColumnCapabilities.Capable#FALSE}, in order to present a snapshot of the state of the this column
+   */
+  @Nullable
+  public static ColumnCapabilitiesImpl complete(
+      @Nullable final ColumnCapabilities capabilities,
+      boolean convertUnknownToTrue
+  )
+  {
+    if (capabilities == null) {
+      return null;
+    }
+    ColumnCapabilitiesImpl copy = copyOf(capabilities);
+    copy.hasMultipleValues = copy.hasMultipleValues.complete(convertUnknownToTrue);
+    copy.dictionaryValuesSorted = copy.dictionaryValuesSorted.complete(convertUnknownToTrue);
+    copy.dictionaryValuesUnique = copy.dictionaryValuesUnique.complete(convertUnknownToTrue);
+    return copy;
+  }
+
+
+  /**
+   * Create a no frills, simple column with {@link ValueType} set and everything else false
+   */
+  public static ColumnCapabilitiesImpl createSimpleNumericColumn(ValueType valueType)

Review comment:
       `createSimpleNumericCapabilities`?

##########
File path: processing/src/main/java/org/apache/druid/segment/ColumnProcessors.java
##########
@@ -197,6 +197,6 @@
    */
   private static boolean mayBeMultiValue(@Nullable final ColumnCapabilities capabilities)
   {
-    return capabilities == null || !capabilities.isComplete() || capabilities.hasMultipleValues();
+    return capabilities == null || capabilities.hasMultipleValues().isMaybeTrue();

Review comment:
       Nice.

##########
File path: processing/src/main/java/org/apache/druid/segment/ColumnSelectorBitmapIndexSelector.java
##########
@@ -160,11 +160,11 @@ public void close() throws IOException
   public boolean hasMultipleValues(final String dimension)

Review comment:
       IMO, better if this returns a `Capable`. Then the caller can decide what to do with unknowns.

##########
File path: processing/src/main/java/org/apache/druid/segment/DimensionHandlerUtils.java
##########
@@ -60,7 +60,8 @@
                                   .setDictionaryEncoded(false)
                                   .setDictionaryValuesUnique(false)
                                   .setDictionaryValuesSorted(false)
-                                  .setHasBitmapIndexes(false);
+                                  .setHasBitmapIndexes(false)
+                                  .setHasMultipleValues(false);

Review comment:
       Why this change?
   
   As I understand it, the previous behavior is roughly equivalent to returning unknown for HMV. (Because it would return false, but complete is also false.)

##########
File path: processing/src/main/java/org/apache/druid/segment/incremental/IncrementalIndexStorageAdapter.java
##########
@@ -142,24 +141,8 @@ public Comparable getMaxValue(String column)
   @Override
   public ColumnCapabilities getColumnCapabilities(String column)
   {
-    // Different from index.getCapabilities because, in a way, IncrementalIndex's string-typed dimensions
-    // are always potentially multi-valued at query time. (Missing / null values for a row can potentially be
-    // represented by an empty array; see StringDimensionIndexer.IndexerDimensionSelector's getRow method.)
-    //
-    // We don't want to represent this as having-multiple-values in index.getCapabilities, because that's used
-    // at index-persisting time to determine if we need a multi-value column or not. However, that means we
-    // need to tweak the capabilities here in the StorageAdapter (a query-time construct), so at query time
-    // they appear multi-valued.
-
-    final ColumnCapabilities capabilitiesFromIndex = index.getCapabilities(column);
-    final IncrementalIndex.DimensionDesc dimensionDesc = index.getDimension(column);
-    if (dimensionDesc != null && dimensionDesc.getCapabilities().getType() == ValueType.STRING) {
-      final ColumnCapabilitiesImpl retVal = ColumnCapabilitiesImpl.copyOf(capabilitiesFromIndex);
-      retVal.setHasMultipleValues(true);
-      return retVal;
-    } else {
-      return capabilitiesFromIndex;
-    }
+    // snapshot the current state
+    return ColumnCapabilitiesImpl.complete(index.getCapabilities(column), false);

Review comment:
       Why `false` here instead of `true`?
   
   My understanding is that the HMV capability in the IncrementalIndex is now either going to be UNKNOWN (where it starts) or TRUE (if there are actually MVs). So UNKNOWN means we haven't seen any MV inputs yet.
   
   But for the reason mentioned in the comment you deleted (the behavior of `StringDimensionIndexer.IndexerDimensionSelector's getRow method`), can't we get empty arrays from dimension selectors when querying IncrementalIndexes, even if the capability is UNKNOWN? So shouldn't we treat an unknown here as `true` rather than `false`?
   
   I must be missing something, since the tests are passing, assuming we have tests for this…

##########
File path: processing/src/main/java/org/apache/druid/segment/column/ColumnCapabilitiesImpl.java
##########
@@ -31,23 +31,66 @@
  */
 public class ColumnCapabilitiesImpl implements ColumnCapabilities
 {
-  public static ColumnCapabilitiesImpl copyOf(final ColumnCapabilities other)
+  public static ColumnCapabilitiesImpl copyOf(@Nullable final ColumnCapabilities other)
   {
     final ColumnCapabilitiesImpl capabilities = new ColumnCapabilitiesImpl();
-    capabilities.merge(other);
-    capabilities.setFilterable(other.isFilterable());
-    capabilities.setIsComplete(other.isComplete());
+    if (other != null) {
+      capabilities.type = other.getType();
+      capabilities.dictionaryEncoded = other.isDictionaryEncoded();
+      capabilities.runLengthEncoded = other.isRunLengthEncoded();
+      capabilities.hasInvertedIndexes = other.hasBitmapIndexes();
+      capabilities.hasSpatialIndexes = other.hasSpatialIndexes();
+      capabilities.hasMultipleValues = other.hasMultipleValues();
+      capabilities.dictionaryValuesSorted = other.areDictionaryValuesSorted();
+      capabilities.dictionaryValuesUnique = other.areDictionaryValuesUnique();
+      capabilities.filterable = other.isFilterable();
+    }
     return capabilities;
   }
 
+  /**
+   * Used at indexing time to finalize all {@link ColumnCapabilities.Capable#UNKNOWN} values to

Review comment:
       Is this ever called with `convertUnknownToTrue == true`? If not, it could be named better.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9731: ColumnCapabilities.hasMultipleValues refactor

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9731:
URL: https://github.com/apache/druid/pull/9731#discussion_r435587936



##########
File path: processing/src/main/java/org/apache/druid/segment/DimensionHandlerUtils.java
##########
@@ -60,7 +60,8 @@
                                   .setDictionaryEncoded(false)
                                   .setDictionaryValuesUnique(false)
                                   .setDictionaryValuesSorted(false)
-                                  .setHasBitmapIndexes(false);
+                                  .setHasBitmapIndexes(false)
+                                  .setHasMultipleValues(false);

Review comment:
       OK, that makes sense.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9731: ColumnCapabilities.hasMultipleValues refactor

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9731:
URL: https://github.com/apache/druid/pull/9731#discussion_r435669232



##########
File path: processing/src/main/java/org/apache/druid/segment/ColumnSelectorBitmapIndexSelector.java
##########
@@ -157,14 +158,16 @@ public void close() throws IOException
   }
 
   @Override
-  public boolean hasMultipleValues(final String dimension)
+  public ColumnCapabilities.Capable hasMultipleValues(final String dimension)
   {
     if (isVirtualColumn(dimension)) {
       return virtualColumns.getVirtualColumn(dimension).capabilities(dimension).hasMultipleValues();
     }
 
     final ColumnHolder columnHolder = index.getColumnHolder(dimension);
-    return columnHolder != null && columnHolder.getCapabilities().hasMultipleValues();
+    return columnHolder != null
+           ? columnHolder.getCapabilities().hasMultipleValues()
+           : ColumnCapabilities.Capable.UNKNOWN;

Review comment:
       I think the code you have is correct, but it's not going to do the optimized path (return `supportsBitmapIndex = true`) for a missing column, even though it would be okay to do it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a change in pull request #9731: ColumnCapabilities.hasMultipleValues refactor

Posted by GitBox <gi...@apache.org>.
clintropolis commented on a change in pull request #9731:
URL: https://github.com/apache/druid/pull/9731#discussion_r435677590



##########
File path: processing/src/main/java/org/apache/druid/segment/ColumnSelectorBitmapIndexSelector.java
##########
@@ -157,14 +158,16 @@ public void close() throws IOException
   }
 
   @Override
-  public boolean hasMultipleValues(final String dimension)
+  public ColumnCapabilities.Capable hasMultipleValues(final String dimension)
   {
     if (isVirtualColumn(dimension)) {
       return virtualColumns.getVirtualColumn(dimension).capabilities(dimension).hasMultipleValues();
     }
 
     final ColumnHolder columnHolder = index.getColumnHolder(dimension);
-    return columnHolder != null && columnHolder.getCapabilities().hasMultipleValues();
+    return columnHolder != null
+           ? columnHolder.getCapabilities().hasMultipleValues()
+           : ColumnCapabilities.Capable.UNKNOWN;

Review comment:
       Ah, I see what you are saying, but in your first comment you meant `Capable.FALSE`, so that [this code path is taken](https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/segment/filter/ExpressionFilter.java#L118) to use the empty bitmap. Updated the PR and left a comment in the code to explain what is going on there




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a change in pull request #9731: ColumnCapabilities.hasMultipleValues refactor

Posted by GitBox <gi...@apache.org>.
clintropolis commented on a change in pull request #9731:
URL: https://github.com/apache/druid/pull/9731#discussion_r435568767



##########
File path: processing/src/main/java/org/apache/druid/segment/column/ColumnCapabilitiesImpl.java
##########
@@ -31,23 +31,66 @@
  */
 public class ColumnCapabilitiesImpl implements ColumnCapabilities
 {
-  public static ColumnCapabilitiesImpl copyOf(final ColumnCapabilities other)
+  public static ColumnCapabilitiesImpl copyOf(@Nullable final ColumnCapabilities other)
   {
     final ColumnCapabilitiesImpl capabilities = new ColumnCapabilitiesImpl();
-    capabilities.merge(other);
-    capabilities.setFilterable(other.isFilterable());
-    capabilities.setIsComplete(other.isComplete());
+    if (other != null) {
+      capabilities.type = other.getType();
+      capabilities.dictionaryEncoded = other.isDictionaryEncoded();
+      capabilities.runLengthEncoded = other.isRunLengthEncoded();
+      capabilities.hasInvertedIndexes = other.hasBitmapIndexes();
+      capabilities.hasSpatialIndexes = other.hasSpatialIndexes();
+      capabilities.hasMultipleValues = other.hasMultipleValues();
+      capabilities.dictionaryValuesSorted = other.areDictionaryValuesSorted();
+      capabilities.dictionaryValuesUnique = other.areDictionaryValuesUnique();
+      capabilities.filterable = other.isFilterable();
+    }
     return capabilities;
   }
 
+  /**
+   * Used at indexing time to finalize all {@link ColumnCapabilities.Capable#UNKNOWN} values to

Review comment:
       It wasn't when you left this comment, but is now, I renamed the method to `snapshot` and the `Capable` enum method to `coerceUnknownToBoolean` to better relay it's usage




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on pull request #9731: ColumnCapabilities.hasMultipleValues refactor

Posted by GitBox <gi...@apache.org>.
clintropolis commented on pull request #9731:
URL: https://github.com/apache/druid/pull/9731#issuecomment-639294192


   thanks for review @gianm :metal:


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9731: ColumnCapabilities.hasMultipleValues refactor

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9731:
URL: https://github.com/apache/druid/pull/9731#discussion_r435575043



##########
File path: processing/src/main/java/org/apache/druid/segment/ColumnSelectorBitmapIndexSelector.java
##########
@@ -160,11 +160,11 @@ public void close() throws IOException
   public boolean hasMultipleValues(final String dimension)

Review comment:
       IMO, it's ok to return a Capable without it being a standalone class. Like, calling it `ColumnCapabilities.Capable` isn't that weird and is ok I think.
   
   Another alternative is to clarify the behavior of the function through javadocs and, perhaps, changing its name to be more explicit about what it does.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a change in pull request #9731: ColumnCapabilities.hasMultipleValues refactor

Posted by GitBox <gi...@apache.org>.
clintropolis commented on a change in pull request #9731:
URL: https://github.com/apache/druid/pull/9731#discussion_r428255335



##########
File path: processing/src/main/java/org/apache/druid/segment/ColumnSelectorBitmapIndexSelector.java
##########
@@ -160,11 +160,11 @@ public void close() throws IOException
   public boolean hasMultipleValues(final String dimension)

Review comment:
       This appears to only be used in `ExpressionFilter` currently, which is one of the few things that has to deal with unknowns, but if we want to do this maybe we should pull the definition of `Capable` outside of the `ColumnCapabilities` class to be standalone.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9731: ColumnCapabilities.hasMultipleValues refactor

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9731:
URL: https://github.com/apache/druid/pull/9731#discussion_r428840570



##########
File path: processing/src/main/java/org/apache/druid/segment/incremental/IncrementalIndexStorageAdapter.java
##########
@@ -142,24 +141,8 @@ public Comparable getMaxValue(String column)
   @Override
   public ColumnCapabilities getColumnCapabilities(String column)
   {
-    // Different from index.getCapabilities because, in a way, IncrementalIndex's string-typed dimensions
-    // are always potentially multi-valued at query time. (Missing / null values for a row can potentially be
-    // represented by an empty array; see StringDimensionIndexer.IndexerDimensionSelector's getRow method.)
-    //
-    // We don't want to represent this as having-multiple-values in index.getCapabilities, because that's used
-    // at index-persisting time to determine if we need a multi-value column or not. However, that means we
-    // need to tweak the capabilities here in the StorageAdapter (a query-time construct), so at query time
-    // they appear multi-valued.
-
-    final ColumnCapabilities capabilitiesFromIndex = index.getCapabilities(column);
-    final IncrementalIndex.DimensionDesc dimensionDesc = index.getDimension(column);
-    if (dimensionDesc != null && dimensionDesc.getCapabilities().getType() == ValueType.STRING) {
-      final ColumnCapabilitiesImpl retVal = ColumnCapabilitiesImpl.copyOf(capabilitiesFromIndex);
-      retVal.setHasMultipleValues(true);
-      return retVal;
-    } else {
-      return capabilitiesFromIndex;
-    }
+    // snapshot the current state
+    return ColumnCapabilitiesImpl.complete(index.getCapabilities(column), false);

Review comment:
       Yes, that's what I was thinking: HMV could be set at any time, including between cap-fetch and selector-creation.
   
   By the way, it looks like StringDimensionIndexer's handling of `hasMultipleValues` is not thread-safe (it's updated by the indexing thread and read by querying threads without locking).

##########
File path: processing/src/main/java/org/apache/druid/segment/incremental/IncrementalIndexStorageAdapter.java
##########
@@ -142,24 +141,8 @@ public Comparable getMaxValue(String column)
   @Override
   public ColumnCapabilities getColumnCapabilities(String column)
   {
-    // Different from index.getCapabilities because, in a way, IncrementalIndex's string-typed dimensions
-    // are always potentially multi-valued at query time. (Missing / null values for a row can potentially be
-    // represented by an empty array; see StringDimensionIndexer.IndexerDimensionSelector's getRow method.)
-    //
-    // We don't want to represent this as having-multiple-values in index.getCapabilities, because that's used
-    // at index-persisting time to determine if we need a multi-value column or not. However, that means we
-    // need to tweak the capabilities here in the StorageAdapter (a query-time construct), so at query time
-    // they appear multi-valued.
-
-    final ColumnCapabilities capabilitiesFromIndex = index.getCapabilities(column);
-    final IncrementalIndex.DimensionDesc dimensionDesc = index.getDimension(column);
-    if (dimensionDesc != null && dimensionDesc.getCapabilities().getType() == ValueType.STRING) {
-      final ColumnCapabilitiesImpl retVal = ColumnCapabilitiesImpl.copyOf(capabilitiesFromIndex);
-      retVal.setHasMultipleValues(true);
-      return retVal;
-    } else {
-      return capabilitiesFromIndex;
-    }
+    // snapshot the current state
+    return ColumnCapabilitiesImpl.complete(index.getCapabilities(column), false);

Review comment:
       Yes, that's what I was thinking: HMV could be set at any time, including between cap-fetch and selector-creation.
   
   By the way, it looks like StringDimensionIndexer's handling of `hasMultipleValues` is not thread-safe (it's updated by the indexing thread and read by querying threads without locking). It'd be good to fix that too.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a change in pull request #9731: ColumnCapabilities.hasMultipleValues refactor

Posted by GitBox <gi...@apache.org>.
clintropolis commented on a change in pull request #9731:
URL: https://github.com/apache/druid/pull/9731#discussion_r428254895



##########
File path: processing/src/main/java/org/apache/druid/segment/incremental/IncrementalIndexStorageAdapter.java
##########
@@ -142,24 +141,8 @@ public Comparable getMaxValue(String column)
   @Override
   public ColumnCapabilities getColumnCapabilities(String column)
   {
-    // Different from index.getCapabilities because, in a way, IncrementalIndex's string-typed dimensions
-    // are always potentially multi-valued at query time. (Missing / null values for a row can potentially be
-    // represented by an empty array; see StringDimensionIndexer.IndexerDimensionSelector's getRow method.)
-    //
-    // We don't want to represent this as having-multiple-values in index.getCapabilities, because that's used
-    // at index-persisting time to determine if we need a multi-value column or not. However, that means we
-    // need to tweak the capabilities here in the StorageAdapter (a query-time construct), so at query time
-    // they appear multi-valued.
-
-    final ColumnCapabilities capabilitiesFromIndex = index.getCapabilities(column);
-    final IncrementalIndex.DimensionDesc dimensionDesc = index.getDimension(column);
-    if (dimensionDesc != null && dimensionDesc.getCapabilities().getType() == ValueType.STRING) {
-      final ColumnCapabilitiesImpl retVal = ColumnCapabilitiesImpl.copyOf(capabilitiesFromIndex);
-      retVal.setHasMultipleValues(true);
-      return retVal;
-    } else {
-      return capabilitiesFromIndex;
-    }
+    // snapshot the current state
+    return ColumnCapabilitiesImpl.complete(index.getCapabilities(column), false);

Review comment:
       `StringDimensionIndexer.IndexerDimensionSelector` will not have `hasMultipleValues` set until it processes a multi-value row, is there a potential race between the time we look at capabilities that it doesn't yet have a multi-value row, but at execution time the row has appeared and so it will be set and will return an empty array instead of an array with a null?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a change in pull request #9731: ColumnCapabilities.hasMultipleValues refactor

Posted by GitBox <gi...@apache.org>.
clintropolis commented on a change in pull request #9731:
URL: https://github.com/apache/druid/pull/9731#discussion_r435606119



##########
File path: processing/src/main/java/org/apache/druid/segment/ColumnSelectorBitmapIndexSelector.java
##########
@@ -160,11 +160,11 @@ public void close() throws IOException
   public boolean hasMultipleValues(final String dimension)

Review comment:
       Modified to return `ColumnCapabilities.Capable`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9731: ColumnCapabilities.hasMultipleValues refactor

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9731:
URL: https://github.com/apache/druid/pull/9731#discussion_r435648576



##########
File path: processing/src/main/java/org/apache/druid/segment/ColumnSelectorBitmapIndexSelector.java
##########
@@ -157,14 +158,16 @@ public void close() throws IOException
   }
 
   @Override
-  public boolean hasMultipleValues(final String dimension)
+  public ColumnCapabilities.Capable hasMultipleValues(final String dimension)
   {
     if (isVirtualColumn(dimension)) {
       return virtualColumns.getVirtualColumn(dimension).capabilities(dimension).hasMultipleValues();
     }
 
     final ColumnHolder columnHolder = index.getColumnHolder(dimension);
-    return columnHolder != null && columnHolder.getCapabilities().hasMultipleValues();
+    return columnHolder != null
+           ? columnHolder.getCapabilities().hasMultipleValues()
+           : ColumnCapabilities.Capable.UNKNOWN;

Review comment:
       If the column holder is `null` then we should be able to safely return `Capable.TRUE`, because `getBitmapIndex` will return a bitmap index as if the column were full of nulls. This will enable optimizations in ExpressionFilter when the expression is based on a nonexistent column.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a change in pull request #9731: ColumnCapabilities.hasMultipleValues refactor

Posted by GitBox <gi...@apache.org>.
clintropolis commented on a change in pull request #9731:
URL: https://github.com/apache/druid/pull/9731#discussion_r435570351



##########
File path: processing/src/main/java/org/apache/druid/segment/column/ColumnCapabilitiesImpl.java
##########
@@ -31,23 +31,66 @@
  */
 public class ColumnCapabilitiesImpl implements ColumnCapabilities
 {
-  public static ColumnCapabilitiesImpl copyOf(final ColumnCapabilities other)
+  public static ColumnCapabilitiesImpl copyOf(@Nullable final ColumnCapabilities other)
   {
     final ColumnCapabilitiesImpl capabilities = new ColumnCapabilitiesImpl();
-    capabilities.merge(other);
-    capabilities.setFilterable(other.isFilterable());
-    capabilities.setIsComplete(other.isComplete());
+    if (other != null) {
+      capabilities.type = other.getType();
+      capabilities.dictionaryEncoded = other.isDictionaryEncoded();
+      capabilities.runLengthEncoded = other.isRunLengthEncoded();
+      capabilities.hasInvertedIndexes = other.hasBitmapIndexes();
+      capabilities.hasSpatialIndexes = other.hasSpatialIndexes();
+      capabilities.hasMultipleValues = other.hasMultipleValues();
+      capabilities.dictionaryValuesSorted = other.areDictionaryValuesSorted();
+      capabilities.dictionaryValuesUnique = other.areDictionaryValuesUnique();
+      capabilities.filterable = other.isFilterable();
+    }
     return capabilities;
   }
 
+  /**
+   * Used at indexing time to finalize all {@link ColumnCapabilities.Capable#UNKNOWN} values to
+   * {@link ColumnCapabilities.Capable#FALSE}, in order to present a snapshot of the state of the this column
+   */
+  @Nullable
+  public static ColumnCapabilitiesImpl complete(
+      @Nullable final ColumnCapabilities capabilities,
+      boolean convertUnknownToTrue
+  )
+  {
+    if (capabilities == null) {
+      return null;
+    }
+    ColumnCapabilitiesImpl copy = copyOf(capabilities);
+    copy.hasMultipleValues = copy.hasMultipleValues.complete(convertUnknownToTrue);
+    copy.dictionaryValuesSorted = copy.dictionaryValuesSorted.complete(convertUnknownToTrue);
+    copy.dictionaryValuesUnique = copy.dictionaryValuesUnique.complete(convertUnknownToTrue);
+    return copy;
+  }
+
+
+  /**
+   * Create a no frills, simple column with {@link ValueType} set and everything else false
+   */
+  public static ColumnCapabilitiesImpl createSimpleNumericColumn(ValueType valueType)

Review comment:
       renamed, except to `createSimpleNumericColumnCapabilities` because i wasn't paying enough attention to the suggestion i guess, but also ok i think :shrug:




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a change in pull request #9731: ColumnCapabilities.hasMultipleValues refactor

Posted by GitBox <gi...@apache.org>.
clintropolis commented on a change in pull request #9731:
URL: https://github.com/apache/druid/pull/9731#discussion_r435666847



##########
File path: processing/src/main/java/org/apache/druid/segment/ColumnSelectorBitmapIndexSelector.java
##########
@@ -157,14 +158,16 @@ public void close() throws IOException
   }
 
   @Override
-  public boolean hasMultipleValues(final String dimension)
+  public ColumnCapabilities.Capable hasMultipleValues(final String dimension)
   {
     if (isVirtualColumn(dimension)) {
       return virtualColumns.getVirtualColumn(dimension).capabilities(dimension).hasMultipleValues();
     }
 
     final ColumnHolder columnHolder = index.getColumnHolder(dimension);
-    return columnHolder != null && columnHolder.getCapabilities().hasMultipleValues();
+    return columnHolder != null
+           ? columnHolder.getCapabilities().hasMultipleValues()
+           : ColumnCapabilities.Capable.UNKNOWN;

Review comment:
       Ah, I changed `ExpressionFilter` to check `isMaybeTrue` to handle unknown, so i think unknown is fine here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a change in pull request #9731: ColumnCapabilities.hasMultipleValues refactor

Posted by GitBox <gi...@apache.org>.
clintropolis commented on a change in pull request #9731:
URL: https://github.com/apache/druid/pull/9731#discussion_r435566970



##########
File path: processing/src/main/java/org/apache/druid/segment/DimensionHandlerUtils.java
##########
@@ -60,7 +60,8 @@
                                   .setDictionaryEncoded(false)
                                   .setDictionaryValuesUnique(false)
                                   .setDictionaryValuesSorted(false)
-                                  .setHasBitmapIndexes(false);
+                                  .setHasBitmapIndexes(false)
+                                  .setHasMultipleValues(false);

Review comment:
       Since null capabilities for vector selectors means unambiguously that the column doesn't exist, I changed this back to leaving multi-value as unknown in favor of checking for this condition in `makeVectorProcessor`, which I think is the best thing to do for now




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org