You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/01/15 06:57:56 UTC

[GitHub] [druid] gianm opened a new pull request #10767: Vectorized theta sketch aggregator.

gianm opened a new pull request #10767:
URL: https://github.com/apache/druid/pull/10767


   Also a refactoring of BufferAggregator and VectorAggregator such that
   they share a common interface, BaseBufferAggregator. This allows
   implementing both in the same file with an abstract + dual subclass
   structure.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on pull request #10767: Vectorized theta sketch aggregator + rework of VectorColumnProcessorFactory.

Posted by GitBox <gi...@apache.org>.
gianm commented on pull request #10767:
URL: https://github.com/apache/druid/pull/10767#issuecomment-763818055


   Pushed up a new patch.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a change in pull request #10767: Vectorized theta sketch aggregator + rework of VectorColumnProcessorFactory.

Posted by GitBox <gi...@apache.org>.
clintropolis commented on a change in pull request #10767:
URL: https://github.com/apache/druid/pull/10767#discussion_r560893111



##########
File path: processing/src/main/java/org/apache/druid/segment/VectorColumnProcessorFactory.java
##########
@@ -22,34 +22,59 @@
 import org.apache.druid.segment.column.ColumnCapabilities;
 import org.apache.druid.segment.vector.MultiValueDimensionVectorSelector;
 import org.apache.druid.segment.vector.SingleValueDimensionVectorSelector;
+import org.apache.druid.segment.vector.VectorObjectSelector;
 import org.apache.druid.segment.vector.VectorValueSelector;
 
 /**
  * Class that encapsulates knowledge about how to create vector column processors. Used by
- * {@link DimensionHandlerUtils#makeVectorProcessor}.
+ * {@link ColumnProcessors#makeVectorProcessor}.
  *
- * Unlike {@link ColumnProcessorFactory}, this interface does not have a "defaultType" method. The default type is
- * always implicitly STRING. It also does not have a "makeComplexProcessor" method; instead, complex-typed columns
- * are fed into "makeSingleValueDimensionProcessor". This behavior may change in the future to better align
- * with {@link ColumnProcessorFactory}.
+ * Column processors can be any type "T". The idea is that a ColumnProcessorFactory embodies the logic for wrapping
+ * and processing selectors of various types, and so enables nice code design, where type-dependent code is not
+ * sprinkled throughout.
+ *
+ * Unlike {@link ColumnProcessorFactory}, this interface does not have a "defaultType" method, because vector
+ * column types are always known, so it isn't necessary.
  *
  * @see ColumnProcessorFactory the non-vectorized version
  */
 public interface VectorColumnProcessorFactory<T>
 {
+  /**
+   * Called when {@link ColumnCapabilities#getType()} is STRING and the underlying column always has a single value
+   * per row.
+   */
   T makeSingleValueDimensionProcessor(
       ColumnCapabilities capabilities,
       SingleValueDimensionVectorSelector selector
   );
 
+  /**
+   * Called when {@link ColumnCapabilities#getType()} is STRING and the underlying column may have multiple values
+   * per row.
+   */
   T makeMultiValueDimensionProcessor(
       ColumnCapabilities capabilities,
       MultiValueDimensionVectorSelector selector
   );
 
+  /**
+   * Called when {@link ColumnCapabilities#getType()} is FLOAT.
+   */
   T makeFloatProcessor(ColumnCapabilities capabilities, VectorValueSelector selector);
 
+  /**
+   * Called when {@link ColumnCapabilities#getType()} is DOUBLE.
+   */
   T makeDoubleProcessor(ColumnCapabilities capabilities, VectorValueSelector selector);
 
+  /**
+   * Called when {@link ColumnCapabilities#getType()} is LONG.
+   */
   T makeLongProcessor(ColumnCapabilities capabilities, VectorValueSelector selector);
+
+  /**
+   * Called when {@link ColumnCapabilities#getType()} is COMPLEX.
+   */
+  T makeComplexProcessor(ColumnCapabilities capabilities, VectorObjectSelector selector);

Review comment:
       Since this isn't actually specific to complex columns, can this be called `makeObjectProcessor`? I have the same method signature except with that name added to this class in #10613, there to support using object selectors for string expression virtual columns which are not dictionary encoded.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on pull request #10767: Vectorized theta sketch aggregator.

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on pull request #10767:
URL: https://github.com/apache/druid/pull/10767#issuecomment-760774449


   Right. I was being subtle about it :) It would be nice to use the same style everywhere. If that pattern needs some modification, I can do that as well. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on pull request #10767: Vectorized theta sketch aggregator.

Posted by GitBox <gi...@apache.org>.
gianm commented on pull request #10767:
URL: https://github.com/apache/druid/pull/10767#issuecomment-760776516


   Which pattern do you think is better? I haven't had a chance to take a look at yours yet, so I'm wondering if I should make yours match mine, or mine match yours.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm merged pull request #10767: Vectorized theta sketch aggregator + rework of VectorColumnProcessorFactory.

Posted by GitBox <gi...@apache.org>.
gianm merged pull request #10767:
URL: https://github.com/apache/druid/pull/10767


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on pull request #10767: Vectorized theta sketch aggregator.

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on pull request #10767:
URL: https://github.com/apache/druid/pull/10767#issuecomment-760844449


   I usually avoid abstract classes so there is no tight coupling. It can make unit testing easier as well. I can pass a mock helper into a vector aggregator and unit test in different ways easily. so my preference is toward the way ApproximateHistogram is implemented. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on pull request #10767: Vectorized theta sketch aggregator.

Posted by GitBox <gi...@apache.org>.
gianm commented on pull request #10767:
URL: https://github.com/apache/druid/pull/10767#issuecomment-761138114


   I've pushed up a new patch that undoes the refactoring and instead implements things using the same technique as #10304.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on pull request #10767: Vectorized theta sketch aggregator.

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on pull request #10767:
URL: https://github.com/apache/druid/pull/10767#issuecomment-760715147


   By the way, I did a similar thing in https://github.com/apache/druid/pull/10304 to reduce some boilerplate code but used composition instead of inheritance. Added package-private`*BufferAggregatorHelper` classes which in turn are used by vector and buffer aggregator. The helper classes know how to work with buffers. The logic for getting an object out of selector and vectorization stuff was left out for Buffer and Vector classes. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on pull request #10767: Vectorized theta sketch aggregator.

Posted by GitBox <gi...@apache.org>.
gianm commented on pull request #10767:
URL: https://github.com/apache/druid/pull/10767#issuecomment-760769950


   > By the way, I did a similar thing in #10304 to reduce some boilerplate code but used composition instead of inheritance. Added package-private`*BufferAggregatorHelper` classes which in turn are used by vector and buffer aggregator. The helper classes know how to work with buffers. The logic for getting an object out of selector and vectorization stuff was left out for Buffer and Vector classes.
   
   Ah, thanks for pointing that out. I haven't had a chance to look at your patch yet. Do you think I should rework this patch to do the same thing you did? I suppose we don't want to have too many ways of doing the same thing in the codebase.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #10767: Vectorized theta sketch aggregator + rework of VectorColumnProcessorFactory.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #10767:
URL: https://github.com/apache/druid/pull/10767#discussion_r561153188



##########
File path: processing/src/main/java/org/apache/druid/segment/VectorColumnProcessorFactory.java
##########
@@ -22,34 +22,59 @@
 import org.apache.druid.segment.column.ColumnCapabilities;
 import org.apache.druid.segment.vector.MultiValueDimensionVectorSelector;
 import org.apache.druid.segment.vector.SingleValueDimensionVectorSelector;
+import org.apache.druid.segment.vector.VectorObjectSelector;
 import org.apache.druid.segment.vector.VectorValueSelector;
 
 /**
  * Class that encapsulates knowledge about how to create vector column processors. Used by
- * {@link DimensionHandlerUtils#makeVectorProcessor}.
+ * {@link ColumnProcessors#makeVectorProcessor}.
  *
- * Unlike {@link ColumnProcessorFactory}, this interface does not have a "defaultType" method. The default type is
- * always implicitly STRING. It also does not have a "makeComplexProcessor" method; instead, complex-typed columns
- * are fed into "makeSingleValueDimensionProcessor". This behavior may change in the future to better align
- * with {@link ColumnProcessorFactory}.
+ * Column processors can be any type "T". The idea is that a ColumnProcessorFactory embodies the logic for wrapping
+ * and processing selectors of various types, and so enables nice code design, where type-dependent code is not
+ * sprinkled throughout.
+ *
+ * Unlike {@link ColumnProcessorFactory}, this interface does not have a "defaultType" method, because vector
+ * column types are always known, so it isn't necessary.
  *
  * @see ColumnProcessorFactory the non-vectorized version
  */
 public interface VectorColumnProcessorFactory<T>
 {
+  /**
+   * Called when {@link ColumnCapabilities#getType()} is STRING and the underlying column always has a single value
+   * per row.
+   */
   T makeSingleValueDimensionProcessor(
       ColumnCapabilities capabilities,
       SingleValueDimensionVectorSelector selector
   );
 
+  /**
+   * Called when {@link ColumnCapabilities#getType()} is STRING and the underlying column may have multiple values
+   * per row.
+   */
   T makeMultiValueDimensionProcessor(
       ColumnCapabilities capabilities,
       MultiValueDimensionVectorSelector selector
   );
 
+  /**
+   * Called when {@link ColumnCapabilities#getType()} is FLOAT.
+   */
   T makeFloatProcessor(ColumnCapabilities capabilities, VectorValueSelector selector);
 
+  /**
+   * Called when {@link ColumnCapabilities#getType()} is DOUBLE.
+   */
   T makeDoubleProcessor(ColumnCapabilities capabilities, VectorValueSelector selector);
 
+  /**
+   * Called when {@link ColumnCapabilities#getType()} is LONG.
+   */
   T makeLongProcessor(ColumnCapabilities capabilities, VectorValueSelector selector);
+
+  /**
+   * Called when {@link ColumnCapabilities#getType()} is COMPLEX.
+   */
+  T makeComplexProcessor(ColumnCapabilities capabilities, VectorObjectSelector selector);

Review comment:
       Sure, although I'll keep the javadoc the same for now, since as of this patch it's still only called when the type is COMPLEX. It seems that one of us will need to rework implementations a bit based on which patch gets merged first.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on pull request #10767: Vectorized theta sketch aggregator.

Posted by GitBox <gi...@apache.org>.
gianm commented on pull request #10767:
URL: https://github.com/apache/druid/pull/10767#issuecomment-763398344


   When debugging a test failure, I realized that some additional work is needed to make the aggregator able to handle both complex inputs (existing theta sketches) and regular inputs (strings, primitives). There's some debt that needed to be paid down:
   
   - VectorColumnProcessorFactory had to be taught about complex columns, instead of handling them all as all-null string columns
   - Existing users of VectorColumnProcessorFactory had to be taught about complex columns too
   - To support the above, makeVectorProcessor needed to be moved from DimensionHandlerUtils to ColumnProcessors and harmonized with the ColumnProcessor stuff


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org