You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/04/22 04:00:43 UTC

[GitHub] [druid] jon-wei opened a new pull request #9742: Adjust string dim comparator used for segment merging

jon-wei opened a new pull request #9742:
URL: https://github.com/apache/druid/pull/9742


   This PR fixes a bug where rollup fails to happen during segment merging when multivalue string dimensions are used.
   
   The included unit test shows how to trigger the bug:
   
   In one segment, I have the following two rows:
   ```
   {dimA=leek, dimMultiVal=[1, 2, 4]} | {sumCount=1}
   {dimA=leek, dimMultiVal=[1, 2, 3, 5]} | {sumCount=1}
   ```
   In a second segment, I have the following two rows:
   ```
   {dimA=leek, dimMultiVal=[1, 2, 4]} | {sumCount=1}
   {dimA=potato, dimMultiVal=[0, 1, 4]} | {sumCount=1}
   ```
   
   Prior to this patch, merging these two segments with rollup enabled results in a segment with 4 rows:
   ```
   {dimA=leek, dimMultiVal=[1, 2, 4]} | {sumCount=1}
   {dimA=leek, dimMultiVal=[1, 2, 3, 5]} | {sumCount=1}
   {dimA=leek, dimMultiVal=[1, 2, 4]} | {sumCount=1}
   {dimA=potato, dimMultiVal=[0, 1, 4]} | {sumCount=1}
   ```
   
   This bug occurs because the comparator used by `TimeAndDimsPointer` during segment merging (from `DimensionHandler::getEncodedValueSelectorComparator` which is `StringDimensionHandler.DIMENSION_SELECTOR_COMPARATOR` for String dimensions) has different logic from the comparator used in `OnHeapIncrementaIndex`. The latter uses `StringDimensionIndexer.compareUnsortedEncodedKeyComponents`.
   
   As a result, using the example rows above, the priority queue in `MergingRowIterator` would first return `{dimA=leek, dimMultiVal=[1, 2, 4]} | {sumCount=1}` from the first segment. Afterwards, the first segment's iterator would present the row `{dimA=leek, dimMultiVal=[1, 2, 3, 5]} | {sumCount=1}`, and the second segment's iterator would present the row `{dimA=leek, dimMultiVal=[1, 2, 4]} | {sumCount=1}`.
   
   `StringDimensionHandler.DIMENSION_SELECTOR_COMPARATOR` before this PR considers ``{dimA=leek, dimMultiVal=[1, 2, 3, 5]}` to precede `dimA=leek, dimMultiVal=[1, 2, 4]} | {sumCount=1}`, so rollup fails to happen for the `dimA=leek, dimMultiVal=[1, 2, 4]} | {sumCount=1}` row.
   
   This PR takes the approach of adjusting `StringDimensionHandler.DIMENSION_SELECTOR_COMPARATOR` so that follows how `StringDimensionIndexer.compareUnsortedEncodedKeyComponents` behaves.
   
   This PR has:
   - [x] been self-reviewed.
   - [ ] added documentation for new or modified features or behaviors.
   - [x] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/licenses.yaml)
   - [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths.
   - [ ] added integration tests.
   - [x] been tested in a test Druid cluster.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a change in pull request #9742: Adjust string comparators used for ingestion

Posted by GitBox <gi...@apache.org>.
clintropolis commented on a change in pull request #9742:
URL: https://github.com/apache/druid/pull/9742#discussion_r414845626



##########
File path: processing/src/main/java/org/apache/druid/segment/StringDimensionHandler.java
##########
@@ -33,61 +33,44 @@
 
 public class StringDimensionHandler implements DimensionHandler<Integer, int[], String>
 {
-
   /**
-   * Compares {@link IndexedInts} lexicographically, with the exception that if a row contains only zeros (that's the
-   * index of null) at all positions, it is considered "null" as a whole and is "less" than any "non-null" row. Empty
-   * row (size is zero) is also considered "null".
-   *
-   * The implementation is a bit complicated because it tries to check each position of both rows only once.
+   * This comparator uses the following rules:
+   * - If the value arrays have different lengths, the shorter value array is considered smaller
+   *   - The single exception to this is null and the empty list, which are considered equal
+   * - If the value arrays are the same length, compare value by value until a difference is reached
    */
   private static final Comparator<ColumnValueSelector> DIMENSION_SELECTOR_COMPARATOR = (s1, s2) -> {
     IndexedInts row1 = getRow(s1);
     IndexedInts row2 = getRow(s2);
     int len1 = row1.size();
     int len2 = row2.size();
-    boolean row1IsNull = true;
-    boolean row2IsNull = true;
-    for (int i = 0; i < Math.min(len1, len2); i++) {
-      int v1 = row1.get(i);
-      row1IsNull &= v1 == 0;
-      int v2 = row2.get(i);
-      row2IsNull &= v2 == 0;
-      int valueDiff = Integer.compare(v1, v2);
-      if (valueDiff != 0) {
-        return valueDiff;
-      }
-    }
-    //noinspection SubtractionInCompareTo -- substraction is safe here, because lengths or rows are small numbers.
-    int lenDiff = len1 - len2;
-    if (lenDiff == 0) {
-      return 0;
-    } else {
-      if (!row1IsNull || !row2IsNull) {
-        return lenDiff;
-      } else {
-        return compareRestNulls(row1, len1, row2, len2);
+    int retVal = Integer.compare(len1, len2);
+    int valsIndex = 0;
+
+    if (retVal != 0) {

Review comment:
       Hmm, this sort seems to behave quite a bit differently than the intended function of the previous sort; this one ordering by row length then lexicographically element by element, while the previous ordering lexicographically element by element then by length.
   
   ```
   ["a", "b"],
   ["a", "c"],
   ["d", "e"],
   ["a", "b", "c"]
   ```
    compared to  
   ```
   ["a", "b"],
   ["a", "b", "c"],
   ["a", "c"],
   ["d", "e"]
   ```
   
   It might be more prudent to adjust the other sort to be more like this one is intended to be, though I imagine in practice that most of the time this doesn't make much difference except when all the times are the same (so less query granularity) and this is the prominent ordering column, and even then might not make much difference in practice, so don't make this change probably if it is a major hassle and instead we just put in release notes that there is a behavior change.

##########
File path: processing/src/main/java/org/apache/druid/segment/DimensionHandler.java
##########
@@ -122,6 +122,10 @@ DimensionMergerV9 makeMerger(
    * Returns a comparator that knows how to compare {@link ColumnValueSelector} of the assumed dimension type,
    * corresponding to this DimensionHandler. E. g. {@link StringDimensionHandler} returns a comparator, that compares
    * {@link ColumnValueSelector}s as {@link DimensionSelector}s.
+   *
+   * The comparison rules used by this method should match the rules used by

Review comment:
       Thanks for adding this comment :+1:, would it also be worth adding the implications of this, e.g. it can result in imperfect rollup?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jon-wei commented on a change in pull request #9742: Adjust string comparators used for ingestion

Posted by GitBox <gi...@apache.org>.
jon-wei commented on a change in pull request #9742:
URL: https://github.com/apache/druid/pull/9742#discussion_r414874713



##########
File path: processing/src/main/java/org/apache/druid/segment/DimensionHandler.java
##########
@@ -122,6 +122,10 @@ DimensionMergerV9 makeMerger(
    * Returns a comparator that knows how to compare {@link ColumnValueSelector} of the assumed dimension type,
    * corresponding to this DimensionHandler. E. g. {@link StringDimensionHandler} returns a comparator, that compares
    * {@link ColumnValueSelector}s as {@link DimensionSelector}s.
+   *
+   * The comparison rules used by this method should match the rules used by

Review comment:
       I added a note about imperfect rollup to these javadocs




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jon-wei commented on a change in pull request #9742: Adjust string comparators used for ingestion

Posted by GitBox <gi...@apache.org>.
jon-wei commented on a change in pull request #9742:
URL: https://github.com/apache/druid/pull/9742#discussion_r414875330



##########
File path: processing/src/main/java/org/apache/druid/segment/StringDimensionHandler.java
##########
@@ -33,61 +33,44 @@
 
 public class StringDimensionHandler implements DimensionHandler<Integer, int[], String>
 {
-
   /**
-   * Compares {@link IndexedInts} lexicographically, with the exception that if a row contains only zeros (that's the
-   * index of null) at all positions, it is considered "null" as a whole and is "less" than any "non-null" row. Empty
-   * row (size is zero) is also considered "null".
-   *
-   * The implementation is a bit complicated because it tries to check each position of both rows only once.
+   * This comparator uses the following rules:
+   * - If the value arrays have different lengths, the shorter value array is considered smaller
+   *   - The single exception to this is null and the empty list, which are considered equal
+   * - If the value arrays are the same length, compare value by value until a difference is reached
    */
   private static final Comparator<ColumnValueSelector> DIMENSION_SELECTOR_COMPARATOR = (s1, s2) -> {
     IndexedInts row1 = getRow(s1);
     IndexedInts row2 = getRow(s2);
     int len1 = row1.size();
     int len2 = row2.size();
-    boolean row1IsNull = true;
-    boolean row2IsNull = true;
-    for (int i = 0; i < Math.min(len1, len2); i++) {
-      int v1 = row1.get(i);
-      row1IsNull &= v1 == 0;
-      int v2 = row2.get(i);
-      row2IsNull &= v2 == 0;
-      int valueDiff = Integer.compare(v1, v2);
-      if (valueDiff != 0) {
-        return valueDiff;
-      }
-    }
-    //noinspection SubtractionInCompareTo -- substraction is safe here, because lengths or rows are small numbers.
-    int lenDiff = len1 - len2;
-    if (lenDiff == 0) {
-      return 0;
-    } else {
-      if (!row1IsNull || !row2IsNull) {
-        return lenDiff;
-      } else {
-        return compareRestNulls(row1, len1, row2, len2);
+    int retVal = Integer.compare(len1, len2);
+    int valsIndex = 0;
+
+    if (retVal != 0) {

Review comment:
       Good point, I've adjusted both comparators to follow the rules described here: https://github.com/apache/druid/pull/9742/files#diff-4c87c650e139d3889184df880b97ff4aR37




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9742: Adjust string comparators used for ingestion

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9742:
URL: https://github.com/apache/druid/pull/9742#discussion_r416029670



##########
File path: processing/src/main/java/org/apache/druid/segment/StringDimensionIndexer.java
##########
@@ -399,23 +399,42 @@ public int compareUnsortedEncodedKeyComponents(int[] lhs, int[] rhs)
     int lhsLen = lhs.length;
     int rhsLen = rhs.length;
 
-    int retVal = Ints.compare(lhsLen, rhsLen);
+    int lenCompareResult = Ints.compare(lhsLen, rhsLen);
+    if (lenCompareResult != 0) {
+      // if the values don't have the same length, check if we're comparing [] and [null], which are equivalent

Review comment:
       @jon-wei Should [1] and [1, null] not be equivalent?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org