You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/08/26 11:07:40 UTC
[GitHub] [druid] clintropolis commented on a change in pull request #10304: Add vectorization for druid-histogram extension

clintropolis commented on a change in pull request #10304:
URL: https://github.com/apache/druid/pull/10304#discussion_r477198847



##########
File path: extensions-core/histogram/src/main/java/org/apache/druid/query/aggregation/histogram/ApproximateHistogramBufferAggregator.java
##########
@@ -28,54 +28,30 @@
 public class ApproximateHistogramBufferAggregator implements BufferAggregator
 {
   private final BaseFloatColumnValueSelector selector;
-  private final int resolution;
+  private final ApproximateHistogramBufferAggregatorInternal innerAggregator;

Review comment:
       Did you consider making the shared functionality just be static methods to be more consistent with how `HyperUniquesBufferAggregator` and `HyperUniquesVectorAggregator` are implemented? This is totally nitpicking, but something just seems off about these things having a thing called 'innerAggregator' that doesn't implement any of the aggregator interfaces.

##########
File path: extensions-core/histogram/src/main/java/org/apache/druid/query/aggregation/histogram/FixedBucketsHistogram.java
##########
@@ -431,6 +433,33 @@ public void incrementMissing()
     }
   }
 
+  /**
+   * Merge another datapoint into this one. The other datapoin could be

Review comment:
       typo: 'datapoin' -> 'datapoint'

##########
File path: extensions-core/histogram/src/main/java/org/apache/druid/query/aggregation/histogram/FixedBucketsHistogramAggregatorFactory.java
##########
@@ -99,6 +105,34 @@ public BufferAggregator factorizeBuffered(ColumnSelectorFactory metricFactory)
     );
   }
 
+  @Override
+  public VectorAggregator factorizeVector(VectorColumnSelectorFactory columnSelectorFactory)
+  {
+    ColumnCapabilities capabilities = columnSelectorFactory.getColumnCapabilities(fieldName);
+    if (null == capabilities) {

Review comment:
       I think this isn't possible since `canVectorize` checks that capabilities isn't null

##########
File path: extensions-core/histogram/src/main/java/org/apache/druid/query/aggregation/histogram/FixedBucketsHistogramAggregatorFactory.java
##########
@@ -99,6 +105,34 @@ public BufferAggregator factorizeBuffered(ColumnSelectorFactory metricFactory)
     );
   }
 
+  @Override
+  public VectorAggregator factorizeVector(VectorColumnSelectorFactory columnSelectorFactory)
+  {
+    ColumnCapabilities capabilities = columnSelectorFactory.getColumnCapabilities(fieldName);
+    if (null == capabilities) {
+      throw new IAE("could not find the column type for column %s", fieldName);
+    }
+    ValueType type = capabilities.getType();
+    if (type.isNumeric()) {
+      return new FixedBucketsHistogramVectorAggregator(
+          columnSelectorFactory.makeValueSelector(fieldName),
+          lowerLimit,
+          upperLimit,
+          numBuckets,
+          outlierHandlingMode
+      );
+    } else {
+      throw new IAE("cannot vectorize fixed bucket histogram aggregation for type %s", type);
+    }
+  }
+
+  @Override
+  public boolean canVectorize(ColumnInspector columnInspector)
+  {
+    ColumnCapabilities capabilities = columnInspector.getColumnCapabilities(fieldName);
+    return (capabilities != null) && capabilities.getType().isNumeric();

Review comment:
       Did you mean to only handle numeric primitive inputs? The input type could also be complex if you handle fixed bucket histogram inputs, but you would need another vector aggregator implementation I think that takes an object selector instead of value selector

##########
File path: extensions-core/histogram/src/main/java/org/apache/druid/query/aggregation/histogram/ApproximateHistogramVectorAggregator.java
##########
@@ -0,0 +1,92 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.aggregation.histogram;
+
+import org.apache.druid.query.aggregation.VectorAggregator;
+import org.apache.druid.segment.vector.VectorValueSelector;
+
+import javax.annotation.Nullable;
+import java.nio.ByteBuffer;
+
+public class ApproximateHistogramVectorAggregator implements VectorAggregator
+{
+
+  private final VectorValueSelector selector;
+  private final ApproximateHistogramBufferAggregatorInternal innerAggregator;
+
+  public ApproximateHistogramVectorAggregator(
+      VectorValueSelector selector,
+      int resolution
+  )
+  {
+    this.selector = selector;
+    this.innerAggregator = new ApproximateHistogramBufferAggregatorInternal(resolution);
+  }
+
+  @Override
+  public void init(final ByteBuffer buf, final int position)
+  {
+    innerAggregator.init(buf, position);
+  }
+
+  @Override
+  public void aggregate(final ByteBuffer buf, final int position, final int startRow, final int endRow)
+  {
+    final boolean[] isValueNull = selector.getNullVector();
+    final float[] vector = selector.getFloatVector();
+    ApproximateHistogram histogram = innerAggregator.get(buf, position);
+
+    for (int i = startRow; i < endRow; i++) {
+      if (isValueNull != null && isValueNull[i]) {

Review comment:
       you can also ignore null checks entirely if `NullHandling.sqlCompatible()` is `true`, would suggest saving it as a private final field in the constructor and then maybe add something like `final boolean checkNulls = hasNulls && isValueNull != null`

##########
File path: extensions-core/histogram/src/test/java/org/apache/druid/query/aggregation/histogram/ApproximateHistogramVectorAggregatorTest.java
##########
@@ -0,0 +1,136 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.aggregation.histogram;
+
+import org.apache.druid.common.config.NullHandling;
+import org.apache.druid.query.aggregation.VectorAggregator;
+import org.apache.druid.segment.column.ColumnCapabilities;
+import org.apache.druid.segment.column.ColumnCapabilitiesImpl;
+import org.apache.druid.segment.column.ValueType;
+import org.apache.druid.segment.vector.VectorColumnSelectorFactory;
+import org.apache.druid.segment.vector.VectorValueSelector;
+import org.easymock.EasyMock;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.nio.ByteBuffer;
+
+import static org.easymock.EasyMock.createMock;
+import static org.easymock.EasyMock.expect;
+
+public class ApproximateHistogramVectorAggregatorTest
+{
+  private static final float[] FLOATS = {23, 19, 10, 16, 36, 2, 9, 32, 30, 45, 33};   // Last value is never included
+  private static final boolean[] NULL_VECTOR =
+      {false, false, false, false, false, false, false, false, false, false, true};
+  private VectorColumnSelectorFactory vectorColumnSelectorFactory;
+
+  @Before
+  public void setup()
+  {
+    NullHandling.initializeForTests();
+    VectorValueSelector vectorValueSelector_1 = createMock(VectorValueSelector.class);
+    expect(vectorValueSelector_1.getFloatVector()).andReturn(FLOATS).anyTimes();
+    expect(vectorValueSelector_1.getNullVector()).andReturn(NULL_VECTOR).anyTimes();
+
+    VectorValueSelector vectorValueSelector_2 = createMock(VectorValueSelector.class);
+    expect(vectorValueSelector_2.getFloatVector()).andReturn(FLOATS).anyTimes();
+    expect(vectorValueSelector_2.getNullVector()).andReturn(null).anyTimes();
+
+    EasyMock.replay(vectorValueSelector_1);
+    EasyMock.replay(vectorValueSelector_2);
+
+    ColumnCapabilities columnCapabilities
+        = new ColumnCapabilitiesImpl().setType(ValueType.DOUBLE).setDictionaryEncoded(true);

Review comment:
       nit: suggest `ColumnCapabilitiesImpl.createSimpleNumericColumnCapabilities(ValueType.DOUBLE)` since it will create realistic double capabilities (numbers are not dictionary encoded for example)

##########
File path: extensions-core/histogram/src/main/java/org/apache/druid/query/aggregation/histogram/FixedBucketsHistogram.java
##########
@@ -431,6 +433,33 @@ public void incrementMissing()
     }
   }
 
+  /**
+   * Merge another datapoint into this one. The other datapoin could be
+   *  - base64 encoded string of {@code FixedBucketsHistogram}
+   *  - {@code FixedBucketsHistogram} object
+   *  - Numeric value
+   *
+   * @param val
+   */
+  void combine(@Nullable Object val)

Review comment:
       I'm not sure this function should be shared between the vectorized and non-vectorized aggregator. For the vector aggregator the `if` should probably be outside of the `for` loop i think, because the contents of the vector will be consistent throughout the loop.
   
   Also, I think you might need different selectors depending on if the inputs to the aggregator are numeric primitives (value selector to get double vector and null boolean vector), or if the input is other fixed bucket histogram sketches (object selector to get array of histogram objects). The fixed bucket histogram aggregator is a combined primitive and sketch merging aggregator, unlike the approximate histogram aggregators which are split and handles the sketch inputs and result merges with the 'fold' aggregators.

##########
File path: extensions-core/histogram/src/main/java/org/apache/druid/query/aggregation/histogram/ApproximateHistogramFoldingBufferAggregatorInternal.java
##########
@@ -0,0 +1,99 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.aggregation.histogram;
+
+import javax.annotation.Nullable;
+import java.nio.ByteBuffer;
+
+/**
+ * A helper class used by {@link ApproximateHistogramFoldingBufferAggregator} and
+ * {@link ApproximateHistogramFoldingVectorAggregator} for aggregation operations on byte buffers.
+ * Getting the object from value selectors is outside this class.
+ */
+final class ApproximateHistogramFoldingBufferAggregatorInternal
+{
+  private final int resolution;
+  private final float upperLimit;
+  private final float lowerLimit;
+
+  private float[] tmpBufferA;
+  private long[] tmpBufferB;
+
+  public ApproximateHistogramFoldingBufferAggregatorInternal(
+      int resolution,
+      float lowerLimit,
+      float upperLimit
+  )
+  {
+    this.resolution = resolution;
+    this.lowerLimit = lowerLimit;
+    this.upperLimit = upperLimit;
+
+    tmpBufferA = new float[resolution];
+    tmpBufferB = new long[resolution];
+  }
+
+  public void init(ByteBuffer buf, int position)
+  {
+    ApproximateHistogram h = new ApproximateHistogram(resolution, lowerLimit, upperLimit);
+
+    ByteBuffer mutationBuffer = buf.duplicate();
+    mutationBuffer.position(position);
+    // use dense storage for aggregation
+    h.toBytesDense(mutationBuffer);
+  }
+
+  public void aggregate(ByteBuffer buf, int position, @Nullable ApproximateHistogram hNext)
+  {
+    if (hNext == null) {
+      return;
+    }
+    ByteBuffer mutationBuffer = buf.duplicate();
+    mutationBuffer.position(position);
+
+    ApproximateHistogram h0 = ApproximateHistogram.fromBytesDense(mutationBuffer);
+    foldFast(h0, hNext);
+
+    mutationBuffer.position(position);
+    h0.toBytesDense(mutationBuffer);
+  }
+
+  public void foldFast(ApproximateHistogram left, ApproximateHistogram right)
+  {
+    //TODO: do these have to set in every call
+    left.setLowerLimit(lowerLimit);

Review comment:
       A quick look and I think I agree, but am not totally certain. Can you try to find out if it is needed so we can remove this TODO and either remove the code, or add a comment on why it needs to be here?

##########
File path: extensions-core/histogram/src/test/java/org/apache/druid/query/aggregation/histogram/FixedBucketsHistogramVectorAggregatorTest.java
##########
@@ -0,0 +1,209 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.druid.query.aggregation.histogram;
+
+import org.apache.druid.common.config.NullHandling;
+import org.apache.druid.query.aggregation.VectorAggregator;
+import org.apache.druid.segment.column.ColumnCapabilities;
+import org.apache.druid.segment.column.ColumnCapabilitiesImpl;
+import org.apache.druid.segment.column.ValueType;
+import org.apache.druid.segment.vector.VectorColumnSelectorFactory;
+import org.apache.druid.segment.vector.VectorValueSelector;
+import org.easymock.EasyMock;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.nio.ByteBuffer;
+
+import static org.easymock.EasyMock.createMock;
+import static org.easymock.EasyMock.expect;
+
+public class FixedBucketsHistogramVectorAggregatorTest
+{
+  private static final double[] DOUBLES = {1.0, 12.0, 3.0, 14.0, 15.0, 16.0};
+  private static final boolean[] NULL_VECTOR = {false, false, false, false, true, false};
+  private VectorColumnSelectorFactory vectorColumnSelectorFactory;
+
+  @Before
+  public void setup()
+  {
+    NullHandling.initializeForTests();
+    VectorValueSelector vectorValueSelector_1 = createMock(VectorValueSelector.class);
+    expect(vectorValueSelector_1.getDoubleVector()).andReturn(DOUBLES).anyTimes();
+    expect(vectorValueSelector_1.getNullVector()).andReturn(NULL_VECTOR).anyTimes();
+
+    VectorValueSelector vectorValueSelector_2 = createMock(VectorValueSelector.class);
+    expect(vectorValueSelector_2.getDoubleVector()).andReturn(DOUBLES).anyTimes();
+    expect(vectorValueSelector_2.getNullVector()).andReturn(null).anyTimes();
+
+    EasyMock.replay(vectorValueSelector_1);
+    EasyMock.replay(vectorValueSelector_2);
+
+    ColumnCapabilities columnCapabilities
+        = new ColumnCapabilitiesImpl().setType(ValueType.DOUBLE).setDictionaryEncoded(true);

Review comment:
       nit: same comment about capabilities

##########
File path: extensions-core/histogram/src/main/java/org/apache/druid/query/aggregation/histogram/ApproximateHistogramAggregatorFactory.java
##########
@@ -102,6 +105,21 @@ public BufferAggregator factorizeBuffered(ColumnSelectorFactory metricFactory)
     );
   }
 
+  @Override
+  public VectorAggregator factorizeVector(VectorColumnSelectorFactory metricVectorFactory)
+  {
+    return new ApproximateHistogramVectorAggregator(
+        metricVectorFactory.makeValueSelector(fieldName),
+        resolution
+    );
+  }
+
+  @Override
+  public boolean canVectorize(ColumnInspector columnInspector)
+  {
+    return true;

Review comment:
       should this check if the column is numeric or complex similar to the fixed buckets aggregator factory? I don't think we have a good way for aggregators to handle string inputs in vectorized engine yet either, unless you use `SingleValueDimensionVectorSelector` or `MultiValueDimensionVectorSelector` and lookup the string values for the int arrays yourself, so should probably exclude strings at least (not that they make much sense as an input anyway).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org