You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2020/10/14 23:06:09 UTC

[GitHub] [incubator-pinot] npawar commented on issue #5509: Derived columns

npawar commented on issue #5509:
URL: https://github.com/apache/incubator-pinot/issues/5509#issuecomment-708706434

**Challenges**
Although this seems exactly like transform functions, there's some differences because of which we cannot handle this solely as regular transform functions.

Say we have columns `a, b, c` in the raw data source.
Say we have columns `a, c, x, y` in the Pinot schema, such that
```
a -> a
c -> c
x -> f (b)
y -> f(a, c)

```
1) The way transform functions are designed right now, arguments to transform functions can only be from `a, b, c`. If we wanted to add `z -> f(x, y)`, this would not be supported.
2) Transform functions are only evaluated during ingestion. In case of derived columns, we want to support adding even after segment creation i.e. add some derived columns to an existing schema, and see the new values in the segments after a reload.

**Changes**
1 is easy to fix, and can be done by simply enhancing the support for transform configs. Some changes needed:
- Remove validations which prevent adding transform functions of the derived kind (i.e. y = f(z) and x = f(y) is blocked in Table config validations right now, can be easily removed).
- In the ExpressionTransformer, we simply identify the derived columns, and evaluate them after the non-derived fields.

Handling 2 will need some more changes. We need to start computing the derived field's transform functions, during segment reload. For this, we could piggyback on the `DefaultColumnHandler`. Similar to `BaseDefaultColumnHandler#updateDefaultColumn`, we can introduce a `BaseDefaultColumnHandler#updateDerivedColumns`, which can be called instead of `updateDefaultColumn` if the column is a derived field.

**Identifying derived fields**
We also need a flag on the FieldSpec called `derived`. This flag will help us distinguish between derived fields and regular fields. Here's an example for why we need this:
You may have `y = f(z) and x = f(y)`. Here x is obviously derived, as it is using `y` as arguments, and `y` is not in source data.
You may also have `y = y and x = f(y)`. Here it is not obvious that x is a derived field or not, because y is both source and destination.
In both of the examples, `x` will be evaluated during segment creation. The deciding factor for whether a field is derived or not, is whether user wants Pinot to generate the values during a segment reload, if the column was not already present.
If user marks `x` as a derived column and reloads segments, all segments missing `x` should evaluate `x = f(y)` using value of `y` already in the segment.
If user does not mark `x` as a derived column, during reload, all segments missing `x` should simply add default value for `x`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org