You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2020/03/10 23:48:48 UTC

[GitHub] [incubator-pinot] npawar opened a new issue #5135: Transform functions in Pinot schema

npawar opened a new issue #5135: Transform functions in Pinot schema
URL: https://github.com/apache/incubator-pinot/issues/5135
 
 
   Consider,
   X: Data at source. This can be either a stream or data files. The formats are typically JSON, AVRO, CSV etc.
   Y: Data in Pinot. This is the record/document in Pinot.
   
   When data is ingested into Pinot (either realtime ingestion or batch ingestion), all columns in X directly need to map to Y. The only exception to this is the time column, where we allow transformation from one time format to another, but we are limited to 1 column. This means that every column in the destination schema should be present exactly as it is in the source schema (except the time column).
   This is not always practical. It is often desirable to have some amount of transformations to the source columns before they get to the destination. 
   
   For example, consider this sample ads data schema
   Source columns - **userID, name.firstName, name.lastName, IP, eventType, cost, timestamp **
   ```
     {
       "userID": 1,
       "name”: { “firstName": "John", "lastName": "Doe"},
       "IP": "10.1.2.3",
       "eventType": "IMPRESSION",
       "cost": 2000,
       “timestamp”: 1583882502198
     },
     {
       "userID": 2,
       “name”: { "firstName": "Mary", "lastName": "Smith"},
       "IP": "10.5.6.7",
       "eventType": "IMPRESSION",
       "cost": 4000,
       “timestamp”: 1583882502198
     },
     {
       "userID": 3,
       “name”: { "firstName": "Rita", "lastName": "Skeeter"},
       "IP": "10.9.8.7",
       "eventType": "CLICK",
       "cost": 600,
       “timestamp”: 1583882502198
     }
   ```
   
   Destination columns - **userId, fullName, country, zipcode, impressions, clicks, cost, hoursSinceEpoch, daysSinceEpoch**
   userId - Map userID to userId
   fullName - Concat name.firstName and name.lastName
   country  - Extract country from IP
   zipcode - Extract zipcode from IP
   impressions - 1 if eventType=IMPRESSION, 0 otherwise
   clicks - 1 if eventType=CLICK, 0 otherwise
   cost - Directly maps from cost, no transformations
   hoursSinceEpoch - convert timestamp to epoch hours
   daysSinceEpoch - convert timestamp to epoch days
   
   The only way to achieve this in Pinot is for the user to write a custom transformation job and prepare data based on the destination schema
   
   Hence, the motivations for this proposal are as follows:
   1. Source and destination are not always 1:1 - Users have to write a transformation job, separately for realtime and for offline, which can lead to inconsistencies. It also adds an additional step for user onboarding.
   2. Be able to read nested source data fields 
   3. Be able to support multiple time columns - in order to use dataTimeSpec, we need to have support for derived functions. 
   4. Be able to share transformation functions across usecases, instead of each user writing one for themselves
   5. Better schema evolution - When you add a new column, if it is derived from existing columns it can be backfilled with the correct values instead of default null values.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] npawar commented on issue #5135: Transform functions in Pinot schema

Posted by GitBox <gi...@apache.org>.
npawar commented on issue #5135: Transform functions in Pinot schema
URL: https://github.com/apache/incubator-pinot/issues/5135#issuecomment-597374764
 
 
   Detailed design proposal here: https://docs.google.com/document/d/13BywJncHrLAFLm-qy4kfKaPxXfAg9XE5v3_fk9sGVSo/edit?usp=sharing

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] npawar edited a comment on issue #5135: Transform functions in Pinot schema

Posted by GitBox <gi...@apache.org>.
npawar edited a comment on issue #5135:
URL: https://github.com/apache/incubator-pinot/issues/5135#issuecomment-625540358


   Next steps:
   1) Transformations using columns which themselves are a product of transformation - https://github.com/apache/incubator-pinot/issues/5351
   2) Support for custom functions (non-Groovy function evaluators) - https://github.com/apache/incubator-pinot/issues/5352
   3) Date time related custom functions - https://github.com/apache/incubator-pinot/issues/5313
   4) Flatten  - https://github.com/apache/incubator-pinot/issues/5264
   5) Filter - https://github.com/apache/incubator-pinot/issues/5268


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] npawar commented on issue #5135: Transform functions in Pinot schema

Posted by GitBox <gi...@apache.org>.
npawar commented on issue #5135:
URL: https://github.com/apache/incubator-pinot/issues/5135#issuecomment-625541525


   Closing this, as there's an issue for every followup


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] npawar commented on issue #5135: Transform functions in Pinot schema

Posted by GitBox <gi...@apache.org>.
npawar commented on issue #5135:
URL: https://github.com/apache/incubator-pinot/issues/5135#issuecomment-625540358


   Next steps:
   1) Transformations using columns which themselves are a product of transformation - https://github.com/apache/incubator-pinot/issues/5351
   2) Support for custom functions (non-Groovy function evaluators) - 
   3) Date time related custom functions - https://github.com/apache/incubator-pinot/issues/5313
   4) Flatten  - https://github.com/apache/incubator-pinot/issues/5264
   5) Filter - https://github.com/apache/incubator-pinot/issues/5268


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org