You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/11/22 08:36:47 UTC

[GitHub] [druid] kfaraz opened a new pull request #11973: Handle null values in Range Partition dimension distribution

kfaraz opened a new pull request #11973:
URL: https://github.com/apache/druid/pull/11973


   ### Description
   This PR adds support for handling null dimension values while creating partition boundaries
   in range partitioning.
   
   This means that we can now have partition boundaries like `[null, "abc"]` or `["abc", null, "def"]`.
   
   This fixes the following problems seen in production environments.
   - In cases of very sparse data, where there is no row that contains non-null values
     for all partition dimensions, partitioning just doesn't work. This is because all the rows
     are ignored in the `PartialDimensionDistributionTask` and `intervalToPartitions` turns
     out to be empty.
   - If there are too many rows where the first partition dimension is null, all of them would
     end up in the first partition thus bloating up that partition. This can now be mitigated in
     cases where the later dimensions have non-null values which vary over these rows.
   
   #### The tuple of nulls
   Please note that even though `[null, null]` is also a valid boundary now, it will never actually
   be encountered in practice. This is because lexicographically, the null tuple is
   immediately followed the tuple of nulls. i.e. `null < [null, null]`
   Thus, if the start and end of a partition were `null` and `[null, null]` respectively,
   that partition would contain zero rows (end is not inclusive). There would be no
   point in having such a partition and so it never occurs in practice.
   
   ### Changes
    * Do not skip nulls in `PartialDimensionDistributionTask`
    * Add a null-safe serde to use instead of `ArrayOfStringsSerde`
   
   <hr>
   
   This PR has:
   - [ ] been self-reviewed.
      - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [x] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [ ] added integration tests.
   - [x] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] cheddar commented on a change in pull request #11973: Handle null values in Range Partition dimension distribution

Posted by GitBox <gi...@apache.org>.
cheddar commented on a change in pull request #11973:
URL: https://github.com/apache/druid/pull/11973#discussion_r755686465



##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/common/task/batch/parallel/distribution/ArrayOfStringsNullSafeSerde.java
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.indexing.common.task.batch.parallel.distribution;
+
+import org.apache.datasketches.ArrayOfItemsSerDe;
+import org.apache.datasketches.ArrayOfStringsSerDe;
+import org.apache.datasketches.Util;
+import org.apache.datasketches.memory.Memory;
+import org.apache.datasketches.memory.WritableMemory;
+import org.apache.druid.data.input.StringTuple;
+
+import java.nio.charset.StandardCharsets;
+
+/**
+ * Serde for {@link StringTuple}.
+ * <p>
+ * The implementation is the same as {@link ArrayOfStringsSerDe}, except this
+ * class handles null String values as well.
+ */
+public class ArrayOfStringsNullSafeSerde extends ArrayOfItemsSerDe<String>
+{
+
+  @Override
+  public byte[] serializeToByteArray(final String[] items)
+  {
+    int length = 0;
+    final byte[][] itemsBytes = new byte[items.length][];
+    for (int i = 0; i < items.length; i++) {
+      // If the String is null, make the byte array also null

Review comment:
       Nit on the code, but instead of ternary operators, I think it would be easier to read with
   
   ```
   // One int for the string length (or -1 for null)
   length += Integer.BYTES;
   if (items[i] != null) {
     itemsBytes[i] = items[i].getBytes(...)
     length += itemsBytes[i].length
   }
   ```

##########
File path: indexing-service/src/test/java/org/apache/druid/indexing/common/task/batch/parallel/distribution/ArrayOfStringsNullSafeSerdeTest.java
##########
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.indexing.common.task.batch.parallel.distribution;
+
+import org.apache.datasketches.memory.Memory;
+import org.junit.Assert;
+import org.junit.Test;
+
+public class ArrayOfStringsNullSafeSerdeTest
+{
+
+  private final ArrayOfStringsNullSafeSerde serde = new ArrayOfStringsNullSafeSerde();
+
+  @Test
+  public void testStringArray()
+  {
+    testSerde("abc", "def", "xyz");
+    testSerde("abc", "123", "456.0");
+  }
+
+  @Test
+  public void testSingletonArray()
+  {
+    testSerde("abc");
+    testSerde("xyz");
+  }
+
+  @Test
+  public void testEmptyArray()
+  {
+    testSerde();
+  }
+
+  @Test
+  public void testArrayWithNullString()
+  {
+    testSerde((String) null);
+    testSerde("abc", null, "def");
+    testSerde(null, null, null);
+  }
+
+  @Test
+  public void testArrayWithEmptyString()

Review comment:
       Maybe add a test that has both `null` and empty string in it to show that both can co-exist and will be considered as different things.

##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/common/task/batch/parallel/distribution/ArrayOfStringsNullSafeSerde.java
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.indexing.common.task.batch.parallel.distribution;
+
+import org.apache.datasketches.ArrayOfItemsSerDe;
+import org.apache.datasketches.ArrayOfStringsSerDe;
+import org.apache.datasketches.Util;
+import org.apache.datasketches.memory.Memory;
+import org.apache.datasketches.memory.WritableMemory;
+import org.apache.druid.data.input.StringTuple;
+
+import java.nio.charset.StandardCharsets;
+
+/**
+ * Serde for {@link StringTuple}.
+ * <p>
+ * The implementation is the same as {@link ArrayOfStringsSerDe}, except this
+ * class handles null String values as well.
+ */
+public class ArrayOfStringsNullSafeSerde extends ArrayOfItemsSerDe<String>
+{
+
+  @Override
+  public byte[] serializeToByteArray(final String[] items)
+  {
+    int length = 0;
+    final byte[][] itemsBytes = new byte[items.length][];
+    for (int i = 0; i < items.length; i++) {
+      // If the String is null, make the byte array also null
+      itemsBytes[i] = items[i] == null ? null : items[i].getBytes(StandardCharsets.UTF_8);
+      length += (itemsBytes[i] == null ? 0 : itemsBytes[i].length) + Integer.BYTES;
+    }
+    final byte[] bytes = new byte[length];
+    final WritableMemory mem = WritableMemory.writableWrap(bytes);
+    long offsetBytes = 0;
+    for (int i = 0; i < items.length; i++) {
+      if (itemsBytes[i] != null) {
+        // Write the length of the array and the array itself
+        mem.putInt(offsetBytes, itemsBytes[i].length);
+        offsetBytes += Integer.BYTES;
+        mem.putByteArray(offsetBytes, itemsBytes[i], 0, itemsBytes[i].length);
+        offsetBytes += itemsBytes[i].length;
+      } else {
+        // If the byte array is null, write the length as -1
+        mem.putInt(offsetBytes, -1);
+        offsetBytes += Integer.BYTES;
+      }
+    }
+    return bytes;
+  }
+
+  @Override
+  public String[] deserializeFromMemory(final Memory mem, final int numItems)
+  {
+    final String[] array = new String[numItems];
+    long offsetBytes = 0;
+    for (int i = 0; i < numItems; i++) {
+      // Read the length of the byte array
+      Util.checkBounds(offsetBytes, Integer.BYTES, mem.getCapacity());
+      final int arrayLength = mem.getInt(offsetBytes);
+      offsetBytes += Integer.BYTES;
+
+      // Negative strLength represents a null byte array and a null String
+      if (arrayLength >= 0) {

Review comment:
       This is maybe aesthetic, but `-1` has special meaning, `-2` does not.  But this code acts as if all negative values have special meaning.  I think it is better to check for `!= -1`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kfaraz commented on a change in pull request #11973: Handle null values in Range Partition dimension distribution

Posted by GitBox <gi...@apache.org>.
kfaraz commented on a change in pull request #11973:
URL: https://github.com/apache/druid/pull/11973#discussion_r754900489



##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/common/task/batch/parallel/PartialDimensionDistributionTask.java
##########
@@ -68,8 +68,9 @@
 {
   public static final String TYPE = "partial_dimension_distribution";
 
-  // Future work: StringDistribution does not handle inserting NULLs. This is the same behavior as hadoop indexing.
-  private static final boolean SKIP_NULL = true;
+  // Do not skip nulls as StringDistribution can handle null values.
+  // This behavior is different from hadoop indexing.
+  private static final boolean SKIP_NULL = false;

Review comment:
       I think we should be fine without the flag.
   
   The following effects would be observed on single dim:
   1. Partitioning would now also work on a dimension column that is always null, although it will actually create just one partition.
   2. Estimation of partition boundaries will also take into account null values. So the algorithm would do a better job of estimating the size of the first partition (it would be closer to the target rows). The sizes of later partitions will not be affected (although the same data being ingested before and after this change could have different partition boundaries as the first partition boundary might shift and the others would shift with it)
   
   With the addition of (multi dimension) range partitioning, single dim is inevitably being affected as it now goes through the multi dim flow itself. So this would only be another part of that overall effect.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 merged pull request #11973: Handle null values in Range Partition dimension distribution

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 merged pull request #11973:
URL: https://github.com/apache/druid/pull/11973


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kfaraz commented on a change in pull request #11973: Handle null values in Range Partition dimension distribution

Posted by GitBox <gi...@apache.org>.
kfaraz commented on a change in pull request #11973:
URL: https://github.com/apache/druid/pull/11973#discussion_r755728836



##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/common/task/batch/parallel/distribution/ArrayOfStringsNullSafeSerde.java
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.indexing.common.task.batch.parallel.distribution;
+
+import org.apache.datasketches.ArrayOfItemsSerDe;
+import org.apache.datasketches.ArrayOfStringsSerDe;
+import org.apache.datasketches.Util;
+import org.apache.datasketches.memory.Memory;
+import org.apache.datasketches.memory.WritableMemory;
+import org.apache.druid.data.input.StringTuple;
+
+import java.nio.charset.StandardCharsets;
+
+/**
+ * Serde for {@link StringTuple}.
+ * <p>
+ * The implementation is the same as {@link ArrayOfStringsSerDe}, except this
+ * class handles null String values as well.
+ */
+public class ArrayOfStringsNullSafeSerde extends ArrayOfItemsSerDe<String>
+{
+
+  @Override
+  public byte[] serializeToByteArray(final String[] items)
+  {
+    int length = 0;
+    final byte[][] itemsBytes = new byte[items.length][];
+    for (int i = 0; i < items.length; i++) {
+      // If the String is null, make the byte array also null

Review comment:
       Fixed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on a change in pull request #11973: Handle null values in Range Partition dimension distribution

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on a change in pull request #11973:
URL: https://github.com/apache/druid/pull/11973#discussion_r754830086



##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/common/task/batch/parallel/PartialDimensionDistributionTask.java
##########
@@ -68,8 +68,9 @@
 {
   public static final String TYPE = "partial_dimension_distribution";
 
-  // Future work: StringDistribution does not handle inserting NULLs. This is the same behavior as hadoop indexing.
-  private static final boolean SKIP_NULL = true;
+  // Do not skip nulls as StringDistribution can handle null values.
+  // This behavior is different from hadoop indexing.
+  private static final boolean SKIP_NULL = false;

Review comment:
       can this be selectively turned on only when more than one dimension is being used? I don't know for certain what the impact of not skipping null will be but then that impact will be limited to new range partitioning only. or it can be based on a flag that you can pass via the context. thoughts? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kfaraz commented on a change in pull request #11973: Handle null values in Range Partition dimension distribution

Posted by GitBox <gi...@apache.org>.
kfaraz commented on a change in pull request #11973:
URL: https://github.com/apache/druid/pull/11973#discussion_r754900489



##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/common/task/batch/parallel/PartialDimensionDistributionTask.java
##########
@@ -68,8 +68,9 @@
 {
   public static final String TYPE = "partial_dimension_distribution";
 
-  // Future work: StringDistribution does not handle inserting NULLs. This is the same behavior as hadoop indexing.
-  private static final boolean SKIP_NULL = true;
+  // Do not skip nulls as StringDistribution can handle null values.
+  // This behavior is different from hadoop indexing.
+  private static final boolean SKIP_NULL = false;

Review comment:
       I think we should be fine without the flag.
   
   The following effects would be observed on single dim:
   1. Partitioning would now also work on a dimension column that is always null, although it will actually create just one partition.
   2. Estimation of partition boundaries will also take into account null values. So the algorithm would do a better job of estimating the size of the first (it would be closer to the target rows). The sizes of later partitions will not be affected.
   
   With the addition of (multi dimension) range partitioning, single dim is inevitably being affected as it now goes through the multi dim flow itself. So this would only be another part of that overall effect.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kfaraz commented on a change in pull request #11973: Handle null values in Range Partition dimension distribution

Posted by GitBox <gi...@apache.org>.
kfaraz commented on a change in pull request #11973:
URL: https://github.com/apache/druid/pull/11973#discussion_r755728718



##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/common/task/batch/parallel/distribution/ArrayOfStringsNullSafeSerde.java
##########
@@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.indexing.common.task.batch.parallel.distribution;
+
+import org.apache.datasketches.ArrayOfItemsSerDe;
+import org.apache.datasketches.ArrayOfStringsSerDe;
+import org.apache.datasketches.Util;
+import org.apache.datasketches.memory.Memory;
+import org.apache.datasketches.memory.WritableMemory;
+import org.apache.druid.data.input.StringTuple;
+
+import java.nio.charset.StandardCharsets;
+
+/**
+ * Serde for {@link StringTuple}.
+ * <p>
+ * The implementation is the same as {@link ArrayOfStringsSerDe}, except this
+ * class handles null String values as well.
+ */
+public class ArrayOfStringsNullSafeSerde extends ArrayOfItemsSerDe<String>
+{
+
+  @Override
+  public byte[] serializeToByteArray(final String[] items)
+  {
+    int length = 0;
+    final byte[][] itemsBytes = new byte[items.length][];
+    for (int i = 0; i < items.length; i++) {
+      // If the String is null, make the byte array also null
+      itemsBytes[i] = items[i] == null ? null : items[i].getBytes(StandardCharsets.UTF_8);
+      length += (itemsBytes[i] == null ? 0 : itemsBytes[i].length) + Integer.BYTES;
+    }
+    final byte[] bytes = new byte[length];
+    final WritableMemory mem = WritableMemory.writableWrap(bytes);
+    long offsetBytes = 0;
+    for (int i = 0; i < items.length; i++) {
+      if (itemsBytes[i] != null) {
+        // Write the length of the array and the array itself
+        mem.putInt(offsetBytes, itemsBytes[i].length);
+        offsetBytes += Integer.BYTES;
+        mem.putByteArray(offsetBytes, itemsBytes[i], 0, itemsBytes[i].length);
+        offsetBytes += itemsBytes[i].length;
+      } else {
+        // If the byte array is null, write the length as -1
+        mem.putInt(offsetBytes, -1);
+        offsetBytes += Integer.BYTES;
+      }
+    }
+    return bytes;
+  }
+
+  @Override
+  public String[] deserializeFromMemory(final Memory mem, final int numItems)
+  {
+    final String[] array = new String[numItems];
+    long offsetBytes = 0;
+    for (int i = 0; i < numItems; i++) {
+      // Read the length of the byte array
+      Util.checkBounds(offsetBytes, Integer.BYTES, mem.getCapacity());
+      final int arrayLength = mem.getInt(offsetBytes);
+      offsetBytes += Integer.BYTES;
+
+      // Negative strLength represents a null byte array and a null String
+      if (arrayLength >= 0) {

Review comment:
       Fixed.

##########
File path: indexing-service/src/test/java/org/apache/druid/indexing/common/task/batch/parallel/distribution/ArrayOfStringsNullSafeSerdeTest.java
##########
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.indexing.common.task.batch.parallel.distribution;
+
+import org.apache.datasketches.memory.Memory;
+import org.junit.Assert;
+import org.junit.Test;
+
+public class ArrayOfStringsNullSafeSerdeTest
+{
+
+  private final ArrayOfStringsNullSafeSerde serde = new ArrayOfStringsNullSafeSerde();
+
+  @Test
+  public void testStringArray()
+  {
+    testSerde("abc", "def", "xyz");
+    testSerde("abc", "123", "456.0");
+  }
+
+  @Test
+  public void testSingletonArray()
+  {
+    testSerde("abc");
+    testSerde("xyz");
+  }
+
+  @Test
+  public void testEmptyArray()
+  {
+    testSerde();
+  }
+
+  @Test
+  public void testArrayWithNullString()
+  {
+    testSerde((String) null);
+    testSerde("abc", null, "def");
+    testSerde(null, null, null);
+  }
+
+  @Test
+  public void testArrayWithEmptyString()

Review comment:
       Added.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on pull request #11973: Handle null values in Range Partition dimension distribution

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on pull request #11973:
URL: https://github.com/apache/druid/pull/11973#issuecomment-977674228


   Thank you @kfaraz. I have merged this change. I just realized that this new behaviour should be called out in the docs as well. will that be done in a follow-up PR? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kfaraz commented on pull request #11973: Handle null values in Range Partition dimension distribution

Posted by GitBox <gi...@apache.org>.
kfaraz commented on pull request #11973:
URL: https://github.com/apache/druid/pull/11973#issuecomment-977691731


   Thanks, @abhishekagarwal87 . The doc changes for multi-dim partitioning are being made in #11983 . I will create a follow up PR that documents the change in this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] kfaraz commented on a change in pull request #11973: Handle null values in Range Partition dimension distribution

Posted by GitBox <gi...@apache.org>.
kfaraz commented on a change in pull request #11973:
URL: https://github.com/apache/druid/pull/11973#discussion_r754900489



##########
File path: indexing-service/src/main/java/org/apache/druid/indexing/common/task/batch/parallel/PartialDimensionDistributionTask.java
##########
@@ -68,8 +68,9 @@
 {
   public static final String TYPE = "partial_dimension_distribution";
 
-  // Future work: StringDistribution does not handle inserting NULLs. This is the same behavior as hadoop indexing.
-  private static final boolean SKIP_NULL = true;
+  // Do not skip nulls as StringDistribution can handle null values.
+  // This behavior is different from hadoop indexing.
+  private static final boolean SKIP_NULL = false;

Review comment:
       I think we should be fine without the flag.
   
   The following effects would be observed on single dim:
   1. Partitioning would now also work on a dimension column that is always null, although it will actually create just one partition.
   2. Estimation of partition boundaries will also take into account null values. So the algorithm would do a better job of estimating the size of the first partition (it would be closer to the target rows). The sizes of later partitions will not be affected.
   
   With the addition of (multi dimension) range partitioning, single dim is inevitably being affected as it now goes through the multi dim flow itself. So this would only be another part of that overall effect.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org