You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2020/03/05 18:47:26 UTC

[GitHub] [incubator-pinot] apucher opened a new issue #5117: Synthetic Data Generator

apucher opened a new issue #5117: Synthetic Data Generator
URL: https://github.com/apache/incubator-pinot/issues/5117
 
 
   **Design Doc**: https://cwiki.apache.org/confluence/display/INCUBATOR/Synthetic+Data+Generator+for+Pinot
   
   As Pinot moves forward and becomes easier to set up and explore for humans, we're hitting a limit in terms of (a) what data sets we can include and (b) how much data we can package with the distribution and (c) how well these data sets showcase Apache Pinot and its ecosystem. This is true for both, the source distributions and the pre-made docker images. Many public data sets are available for personal or academic use only and therefore, strictly speaking, prevent Apache Pinot from packaging or including them in other ways. Additionally, we can only package so much data before bloating the size of the repository and images. Finally, pre-existing data sets may not be able to showcase or stress a very specific part of Pinot for testing or demonstration purposes.
   
   One way we could work around this limitation is by generating synthetic "mock" data that looks and feels like real datasets without actually including the original data. Instead of shipping pre-made data sets we can generate time series from templates and features that we designed or extracted previously. This works around both licensing and capacity issues, and allows us to generate well-suited testing and demo data on-demand.
   
   **Proposed Approach**
   We want to add support for complex data generator "templates" to pinot-admin. The existing tool already has rudimentary abilities to generate data for benchmarking or testing, but this data is strictly random noise and usually unsuited for dimensional breakdowns. We propose to add generator templates that produce time series that would appear familiar to developers, analysts, and other stakeholders of businesses and intuitively "make sense". For example, these templates could produce diurnal (day-night) page view and click time series for an imaginary website or long-tail (spiky) error metrics that sensibly de-compose into multiple dimensions. This approach is trivially extensible and new templates can be added as needed.
   
   We would re-use pinot-admins "GenerateData" command and extend the existing schema-annotations with a "template" property that enables both pinot contributors as well as pinot users to configure arbitrary generator templates in the familiar JSON format. We provide several examples in the design doc.
   
   **Time Series Examples**
   Selection. below. See design doc for more examples.
   
   Seasonal time series
   ![image](https://user-images.githubusercontent.com/25439965/76014202-0fd34780-5ece-11ea-92a0-cab08e03bf76.png)
   
   Rare events time series
   ![image](https://user-images.githubusercontent.com/25439965/76014281-2d081600-5ece-11ea-9e0b-a06ac0241295.png)
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] apucher closed issue #5117: Synthetic Data Generator

Posted by GitBox <gi...@apache.org>.
apucher closed issue #5117: Synthetic Data Generator
URL: https://github.com/apache/incubator-pinot/issues/5117
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org