You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2019/10/28 22:25:43 UTC

[GitHub] [incubator-pinot] siddharthteotia opened a new pull request #4747: Data Anonymizer Tool

siddharthteotia opened a new pull request #4747: Data Anonymizer Tool
URL: https://github.com/apache/incubator-pinot/pull/4747
 
 
   It is not always possible to use **production data** for:
   
   - Writing regression test frameworks without being dependent on externally available datasets and production data.
   - Performance benchmarking and functional evaluation of other systems like Druid, Azure Data Explorer (Kusto), Impala etc.
   - Estimation - How will we do on 10TB of data? The tool will allow to generate any volume of data for any arbitrary schema.
   
   **Why not use public datasets?**
   
   - Some of the public data sets we explored are modest volume datasets (1million to few million rows) compared to what Pinot currently runs on in production.
   - We want to do the benchmarking on a dataset that closely resembles prod data to make more informed decision from data points.
   - TPCH would have been useful but most queries are join centric.
   
   The tool first understands the characteristics of production data that Pinot runs on and uses those characteristics to generate irreversible random data (one Avro file per segment).
   
   Preserves characteristics like cardinality, distribution of values, length.
   
   Data generation approach preserves query patterns. Example:
   
   SELECT * FROM Pinot_Prod_Table WHERE COL < 20000
   
   The tool build a sorted global dictionary which ensures that when it maps 20000 to a random generated value V, all the original column values < 20000 are also mapped to random generated values < V. With this approach, if the original query returned 100k rows on the actual data, the generated query should also return roughly the same number of rows on generated data.
   
   The code is very well documented and explains the purpose/usage/implementation notes in detail.
   
   (1) Filter column extractor from query file
   (2) Global dictionary builder
   (3) Data generator
   (4) Query generator

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org