You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by "vvivekiyer (via GitHub)" <gi...@apache.org> on 2023/02/22 18:02:16 UTC

[GitHub] [pinot] vvivekiyer opened a new issue, #10318: [multistage] Return type for AVG() aggregation function

vvivekiyer opened a new issue, #10318:
URL: https://github.com/apache/pinot/issues/10318

   The AVG() aggregation function in Multistage engine relied to Calcite's default logic of setting the return type to be same as the input column. For example, if the input column is INT, the return type for AVG() is INT
   
   In our v1 engine, the return type for AVG aggregation function is always set to double. To match behavior, we created PR https://github.com/apache/pinot/pull/10314 to override calcite's `deriveAvgAggType` to always cast the return type to DOUBLE. 
   
   However, instead of casting to DOUBLE always, we'd like to do something smarter where we cast to the correct floating point type depending on input.
   
   Here's what some other systems do:
   
   Postgres
   > numeric for any integer type argument, double precision for a floating-point argument, otherwise the same as the argument data type
   
   MySQL
   > The SUM() and AVG() functions return a DECIMAL value for exact-value arguments (integer or DECIMAL), and a DOUBLE value for approximate-value arguments (FLOAT or DOUBLE).
   
   Creating this OSS issue to get consensus on the right way to do it in Pinot.
   
   cc: @walterddr  @siddharthteotia  @somandal 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] vvivekiyer commented on issue #10318: [multistage] Return type for AVG() aggregation function

Posted by "vvivekiyer (via GitHub)" <gi...@apache.org>.
vvivekiyer commented on issue #10318:
URL: https://github.com/apache/pinot/issues/10318#issuecomment-1451433594

   Created a PR for this fix. 
   cc: @walterddr @somandal @siddharthteotia 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] vvivekiyer commented on issue #10318: [multistage] Return type for AVG() aggregation function

Posted by "vvivekiyer (via GitHub)" <gi...@apache.org>.
vvivekiyer commented on issue #10318:
URL: https://github.com/apache/pinot/issues/10318#issuecomment-1449070682

   **POSTGRES: Numeric DataType and AVG Return Type**
   Postgres supports some variable size datatypes (where the number of digits before and after decimal point is fixed and can be controlled by the user) like NUMERIC and DECIMAL. These are typically used for financial data.  The precision and scale values determine the number of digits before/after the decimal. 
   
   For AVG aggregation function, Postgres has the following return type:
   
   > avg ( smallint ) → numeric
   > avg ( integer ) → numeric
   > avg ( bigint ) → numeric
   > avg ( numeric ) → numeric
   > avg ( real ) → double precision
   > avg ( double precision ) → double precision
   > avg ( interval ) → interval
   
   
   For the cases where Postgres returns a numeric datatype (for all integer types above), the precision and scale of the result value could be different from the input values. It is calculated based on the input values as follows:
   > precision = max(precision(input_values) + ceil(log10(count(input_values))), scale(input_values) + decimal_places) 
   
   
   **Pinot: AVG Return Type**
   
   - I think the closest support we have to variable sized data-types is BIG_DECIMAL. Based on my reading, BIG_DECIMAL can support larger than DOUBLE values, I don't think it can handle exact PRECISION and SCALE during storage. Precision and Scale support is added through specific aggregation functions like SUMPRECISION, etc 
   
   -  Additionally, during the planning stage, we do not have access to the column values. So, we can't  calculate precision/scale values like Postgres does for NUMERIC datatype.
   
   Given these 2 limitations above, I'm wondering if it's possible to mimic Postgres behavior. Please let me know if I'm missing something @walterddr @somandal 
   
   
   At the very least, I think we need to change AVG return type to at least handle BIG_DECIMAL instead of returning a blanket double for everything.
   
   > INT, LONG, FLOAT, DOUBLE -> return DOUBLE
   > BIG_DECIMAL -> return BIG_DECIMAL 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] vvivekiyer commented on issue #10318: [multistage] Return type for AVG() aggregation function

Posted by "vvivekiyer (via GitHub)" <gi...@apache.org>.
vvivekiyer commented on issue #10318:
URL: https://github.com/apache/pinot/issues/10318#issuecomment-1448605060

   Let me do some digging today on Postgres numeric type and Calcite and get back. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] walterddr commented on issue #10318: [multistage] Return type for AVG() aggregation function

Posted by "walterddr (via GitHub)" <gi...@apache.org>.
walterddr commented on issue #10318:
URL: https://github.com/apache/pinot/issues/10318#issuecomment-1451137404

   +1 thank you so much for driving this @vvivekiyer this is very thorough research and discussion. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] walterddr commented on issue #10318: [multistage] Return type for AVG() aggregation function

Posted by "walterddr (via GitHub)" <gi...@apache.org>.
walterddr commented on issue #10318:
URL: https://github.com/apache/pinot/issues/10318#issuecomment-1447128731

   any updates on this one or anything we can do to get the ball rolling?
   
   from my end i voted for PostgreQL behavior, we can also check and see if the current Calcite extensions support generating behaviors like this (most likely going to modify the code that's changed in #10314)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] vvivekiyer commented on issue #10318: [multistage] Return type for AVG() aggregation function

Posted by "vvivekiyer (via GitHub)" <gi...@apache.org>.
vvivekiyer commented on issue #10318:
URL: https://github.com/apache/pinot/issues/10318#issuecomment-1451073873

   Based on offline conversation with Rong, we've agreed to have the return type as follows
   > INT, LONG, FLOAT, DOUBLE -> return DOUBLE
   > BIG_DECIMAL -> return BIG_DECIMAL
   
   I'll propose a PR to fix the return type for BIG_DECIMAL. 
   
   We considered two other alternative approaches:
   1. Use the formula similar to postgres above. But that's not feasible in Pinot as the planner doesn't have details about table size or # of records.  We also don't store this metadata in TableCache or Zk.
   2. As AVG is converted to SUM / Count() , we thought we could use Calcite's deriveDecimalType here to determine the type. But given that the default return type of COUNT() is always BIG_INT, this would not help as well. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] vvivekiyer closed issue #10318: [multistage] Return type for AVG() aggregation function

Posted by "vvivekiyer (via GitHub)" <gi...@apache.org>.
vvivekiyer closed issue #10318: [multistage] Return type for AVG() aggregation function
URL: https://github.com/apache/pinot/issues/10318


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] walterddr commented on issue #10318: [multistage] Return type for AVG() aggregation function

Posted by "walterddr (via GitHub)" <gi...@apache.org>.
walterddr commented on issue #10318:
URL: https://github.com/apache/pinot/issues/10318#issuecomment-1450432071

   It is possible to mimic Postgres behavior. Calcite RelDataTypeFactory has precision and scale associated with each numeric type. we can apply the function as above with several caveats
   
   see https://github.com/apache/calcite/blob/2dba40e7a0a5651eac5a30d9e0a72f178bd9bff2/core/src/main/java/org/apache/calcite/rel/type/RelDataTypeFactoryImpl.java#L562-L587
   
   1. for all the precision and scale factors listed --> we round up to the nearest associated Pinot type.
       - according to calcite: INTEGER(10, 0) and BIGINT(38, 0), FLOAT(14, 7), DOUBLE(30, 15), 
       - for example: so if our scale calculation results in a (12, 2) --> this matches to FLOAT
   2. we don't need to exactly compute the ceil(log10(count()), we can use the table size to estimate given that the scale differences are so large between 2 pinot data types
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] walterddr commented on issue #10318: [multistage] Return type for AVG() aggregation function

Posted by "walterddr (via GitHub)" <gi...@apache.org>.
walterddr commented on issue #10318:
URL: https://github.com/apache/pinot/issues/10318#issuecomment-1440523550

   +1 for standardizing the type hoisting behavior. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] somandal commented on issue #10318: [multistage] Return type for AVG() aggregation function

Posted by "somandal (via GitHub)" <gi...@apache.org>.
somandal commented on issue #10318:
URL: https://github.com/apache/pinot/issues/10318#issuecomment-1447137611

   +1 for PostgreSQL behavior if possible. @vvivekiyer any luck in figuring out how "numeric" in PostgreSQL maps to Pinot data types?
   
   Also one more concern that came up was that the default behavior of AVG() in v1 engine will differ in v2 engine if we take a different approach. Will that be acceptable? This is a behavior change which will need to be called out for customers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org