You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by "xiangfu0 (via GitHub)" <gi...@apache.org> on 2023/07/21 19:48:10 UTC
[GitHub] [pinot] xiangfu0 commented on pull request #11117: [multistage] bridge v2 query engine for leaf stage v1 multi-value column

xiangfu0 commented on PR #11117:
URL: https://github.com/apache/pinot/pull/11117#issuecomment-1646168883

   > the grand question is whether we continue to support MV as a data type or collapse into ARRAY. if the answer to the above question is NO, this PR is not needed. if the answer is YES. there are several concerns. i thought about it a bit last night. the problem is 2 fold.
   > 
   > 1. how do we support it conforming with ANSI syntax
   > 2. how do we make sure v1 perform the same as before.
   > 
   > for Q1: the answer is probably MULTI_SET
   > 
   > * it is a SET with no ordering but can contain duplicates, suitable for our use case for GROUP-BY and FILTER, which they are considered as expanded/unnested during eval.
   > * but not every use cases is considered as MULTI_SET, for example selection will consider MV as an ARRAY
   >   therefore,
   > 
   > 1. an explicit `USE_AS_MULTISET(mv)` is desired to bridge the syntatic gap
   > 2. explicit `USE_AS_ARRAY(mv)` is the default (considering select * there's no reason to ask user explicitly put this in
   > 
   > for Q2: the problem is these USE_AS_*** methods are considered transform, which shouldn't be the case there's a simply way to solve this --> in CalciteRexExpressionUtils, we can simply ignored the functionCall and directly return the operand as a reference in V1 b/c V1 is SqlNode and bares no type info, this means directly putting the operand input reference without the USE_AS_*** type conversion will work naturally with the V1 context.
   > 
   > so in short my proposal is
   > 
   > 1. have `USE_AS_ARRAY` and `USE_AS_MULTISET` as scalarFunction wrappers (unimplemented) to bridge the gap of the syntactic problem on calcite
   > 2. in PhysicalPlanner explicitly exclude these syntatitic functions and directly drop to the operand;
   > 3. in the mid term we will have MV, MULTI_SET, ARRAY as 3 "co-existing" types, but in V1 there's only MV; in V2 MV is considered by default as ARRAY, unless `USE_AS_MULTISET` is used.
   > 4. in the long term, once we fully found all alternatives in standard SQL syntax for MV, we can safely consider MV as ARRAY. and only do MULTISET when necessary (i can only see it being useful in `MV > 5` situation, where there's no array equivalent, only multiset equivalent)
   > 
   > WDYT? Please let me know
   
   MV column in V2 is modeled as ARRAY by default.
   So in terms of supporting v1 format, we need to:
   1. use `ARRAY_TO_MV` or `USE_AS_MV` as the bridge to ensure Calcite understand the type consistency/conversion.
   2. (TODO) Implement MV GROUP BY in v2 intermediate stage to complete the story.
   3. (TODO) Implement ARRAY GROUP BY later on


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org