You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by "davecromberge (via GitHub)" <gi...@apache.org> on 2023/12/07 14:33:14 UTC

[I] Optimising StarTree metric configuration [pinot]

davecromberge opened a new issue, #12111:
URL: https://github.com/apache/pinot/issues/12111

   Problem
   ---------
   
   Some aggregation functions are closely related and can produce results from the same underlying metric.  In the StarTree index, the metric data is replicated for each function name pair.
   
   For example:
   
   ```
   {
     "dimensionsSplitOrder": [
        "team"
     ],
     "functionColumnPairs": [
         "DISTINCT_COUNT_CPC_SKETCH__players_cpc",
         "DISTINCT_COUNT_RAW_CPC_SKETCH__players_cpc"
      ],
   }
   ```
   
   If the "players" metric was an array of bytes, this would be replicated in the StarTree for each aggregation function above.  This increases storage and the resources used to construct and merge segments.
   
   Proposal
   ---------
   
   I would like to propose using the value aggregator name to remove redundant metric computation, and de-duplicate these to store the metric once.  This would impact the StarTree construction logic and the query evaluation logic.
   
   It would be nice to have feedback on this idea before I explore it further.
   
   /cc @Jackie-Jiang @snleee 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [I] Optimising StarTree metric configuration [pinot]

Posted by "davecromberge (via GitHub)" <gi...@apache.org>.
davecromberge commented on issue #12111:
URL: https://github.com/apache/pinot/issues/12111#issuecomment-1846024519

   On closer inspection the value aggregators do appear separate but related, so deduplication is not straightforward.  I'm going to study the code further and look for other alternatives.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [I] Optimising StarTree metric configuration [pinot]

Posted by "davecromberge (via GitHub)" <gi...@apache.org>.
davecromberge commented on issue #12111:
URL: https://github.com/apache/pinot/issues/12111#issuecomment-1854123973

   I have explored the idea of maintaining a mapping from `AggregationFunctionType` to `AggregationType` and have a [diff](https://github.com/apache/pinot/compare/master...davecromberge:attempt-2/shared-aggregations?expand=1) available of the initial attempt.  
   
   In summary, this does seem simpler than the "virtual" column approach but there are some comments about the implementation that I like to clarify / discuss.
   
   1. Re-using the `ValueAggregatorFactory` mapping was not accessible to some spi packages, and introducing it might introduce circular dependencies between packages.
   2. The segment index reader was not modified to exclude buffer views for duplicated indexes, but this was done in the StarTree loader instead where ForwardIndexReaders are not duplicated.  The reason for doing so is to improve memory performance where an existing index is loaded where the aggregation might already be duplicated.
   3. In the use case where adding a "duplicate" AggregationColumnPair to existing segments;  the StarTree might need to be re-created, even though the new column pair is covered by the existing aggregations.  
   4. The `StarTreeV2BuilderConfig` removes duplicates column pairs before the StarTree is constructed.  A consequence is that the resulting segment metadata would not match all the AggregationSpecs from the table - which might cause subsequent reloads / cycles of reloading.  This may require the de-duplication to be done in the StarTree builder itself.  
   
   One of the more difficult aspects of these changes is understanding the role of the SegmentMetadataV2 in the wider codebase and whether to either:
   - reflect all aggregation column pairs that are supported or covered by the startree index, or;
   - reflect only those aggregation column pairs that are actually in the startree index
   
   In the meantime, I'm happy to take the conversation into a PR if you are happy with the general direction that this is taking.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [I] Optimising StarTree metric configuration [pinot]

Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang closed issue #12111: Optimising StarTree metric configuration
URL: https://github.com/apache/pinot/issues/12111


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [I] Optimising StarTree metric configuration [pinot]

Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang commented on issue #12111:
URL: https://github.com/apache/pinot/issues/12111#issuecomment-1851057424

   This is a great optimization!
   Ideally we should provide a mapping from query `AggregationFunctionType` to aggregator `AggregationFunctionType`, which is already happening implicitly under `ValueAggregatorFactory`. If we make this mapping explicit, we can always store the aggregator type in the metadata (for backward compatibility, we can do a map lookup when loading the metadata), and when solving queries, we lookup the mapping again to search for the function.
   IMO this way is easier to manage comparing to introducing the "virtual" function concept.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [I] Optimising StarTree metric configuration [pinot]

Posted by "davecromberge (via GitHub)" <gi...@apache.org>.
davecromberge commented on issue #12111:
URL: https://github.com/apache/pinot/issues/12111#issuecomment-1850111531

   Instead of de-duplication, I have decided to extend the configuration so that the user can associate one or more "virtual" aggregation functions with each aggregation spec.  When encoded in metadata, this would look as follows:
   
   ```
   segment.index.version = v3
   startree.v2.count = 1
   startree.v2.0.total.docs = 2
   startree.v2.0.split.order = workspace_id
   startree.v2.0.split.order = cohort_code
   startree.v2.0.function.column.pairs = distinctCountThetaSketch__impressions
   startree.v2.0.aggregation.count = 1
   startree.v2.0.aggregation.0.function.type = distinctCountThetaSketch
   startree.v2.0.aggregation.0.column.name = impressions
   startree.v2.0.aggregation.0.compression.codec = LZ4
   startree.v2.0.aggregation.0.virtual.aggregation.count = 1
   startree.v2.0.aggregation.0.virtual.aggregation.0.function.type = distinctCountRawThetaSketch
   startree.v2.0.max.leaf.records = 10000
   ```
   
   When absent or missing, this can be encoded as:
   ```
   segment.index.version = v3
   startree.v2.count = 1
   startree.v2.0.total.docs = 2
   startree.v2.0.split.order = workspace_id
   startree.v2.0.split.order = cohort_code
   startree.v2.0.function.column.pairs = distinctCountThetaSketch__impressions
   startree.v2.0.aggregation.count = 1
   startree.v2.0.aggregation.0.function.type = distinctCountThetaSketch
   startree.v2.0.aggregation.0.column.name = impressions
   startree.v2.0.aggregation.0.compression.codec = LZ4
   startree.v2.0.aggregation.0.virtual.aggregation.count = 0
   startree.v2.0.max.leaf.records = 10000
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [I] Optimising StarTree metric configuration [pinot]

Posted by "Jackie-Jiang (via GitHub)" <gi...@apache.org>.
Jackie-Jiang commented on issue #12111:
URL: https://github.com/apache/pinot/issues/12111#issuecomment-1858427204

   Thanks for trying this approach out! Let's move the conversation to a PR and we can discuss the pros and cons there


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


Re: [I] Optimising StarTree metric configuration [pinot]

Posted by "davecromberge (via GitHub)" <gi...@apache.org>.
davecromberge commented on issue #12111:
URL: https://github.com/apache/pinot/issues/12111#issuecomment-1851908193

   @Jackie-Jiang thank you for your input!
   
   When I started considering solutions, I had similar inclinations to capture the relationships between value aggregators but I was going to do it internally.  Giving the end-user control over which function to use is better.
   
   In the meantime, this [diff](https://github.com/apache/pinot/compare/master...davecromberge:attempt-1/virtual-functions?expand=1) shows the fundamental changes that I had in mind to implement the virtual aggregation functions - I'd be interested in what you think.  The advantage of this approach is that the internals for constructing a StarTree remains unchanged.  When loading the StarTree, the forward index is re-used for all virtual aggregation functions in order to correctly retrieve a Datasource.
   Note: this draft implementation might not address all corner-cases in the query evaluation part and furthermore it only supports the StarTreeV2.
   
    I would like to explore your idea further in a new branch and will share further results.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org