You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2021/02/04 23:29:01 UTC

[GitHub] [incubator-pinot] sajjad-moradi opened a new pull request #6546: Improve Real Time Provisioning Helper tool

sajjad-moradi opened a new pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546


   ## Description
   Currently Real Time Provisioning Helper tool takes a completed segment as an input. With the changes in this PR, a user can provide data characteristics instead of an actual segment. With this option, the tool does a preprocessing step and generates a segment based on the provided characteristics. After the segment is generated, it just uses that segment to provide insight on the memory footprint as usual.
   
   That main changes in the code:
   - refactored a few existing `Generator`s in `data/generator` package and also added a couple of new ones
   - added `Segment Generator` to `Memory Estimator`
   - modified `RealtimeProvisioingHelperCommand`
   ## Testing Done
   - Unit tests
   - Ran `pinot-admin RealtimeProvisioningHelper` locally with the same files provided in the unit tests (1M rows):
   ```bash
   2021/02/04 15:13:39.243 INFO [RealtimeProvisioningHelperCommand] [main] Executing command: RealtimeProvisioningHelper -tableConfigFile table-config.json -numPartitions 10 -pushFrequency null -numHosts 2,4,6,8,10,12,14,16 -numHours 2,3,4,5,6,7,8,9,10,11,12 -schemaFile schema.json -dataCharacteristicsFile data-characteristics.json -ingestionRate 150 -maxUsableHostMemory 48G -retentionHours 48
   2021/02/04 15:13:41.549 INFO [MemoryEstimator$SegmentGenerator] [main] Successfully generated data file: /var/folders/sd/fgc60hhj2994pk9vm1xw235h000xqy/T/2021-02-04_15:13:39-csv/output_0.csv
   2021/02/04 15:13:41.549 INFO [MemoryEstimator$SegmentGenerator] [main] Started creating segment from file: /var/folders/sd/fgc60hhj2994pk9vm1xw235h000xqy/T/2021-02-04_15:13:39-csv/output_0.csv
   2021/02/04 15:13:49.084 INFO [MemoryEstimator$SegmentGenerator] [main] Successfully created segment: testTable_18667_18766_0 at directory: /var/folders/sd/fgc60hhj2994pk9vm1xw235h000xqy/T/2021-02-04_15:13:39-segment/testTable_18667_18766_0
   2021/02/04 15:13:49.085 INFO [MemoryEstimator$SegmentGenerator] [main] Verifying the segment by loading it
   2021/02/04 15:13:49.161 INFO [MemoryEstimator$SegmentGenerator] [main] Successfully loaded segment: testTable_18667_18766_0 of size: 18766286 bytes
   
   ============================================================
   RealtimeProvisioningHelper -tableConfigFile table-config.json -numPartitions 10 -pushFrequency null -numHosts 2,4,6,8,10,12,14,16 -numHours 2,3,4,5,6,7,8,9,10,11,12 -schemaFile schema.json -dataCharacteristicsFile data-characteristics.json -ingestionRate 150 -maxUsableHostMemory 48G -retentionHours 48
   
   Note:
   
   * Table retention and push frequency ignored for determining retentionHours since it is specified in command
   * See https://docs.pinot.apache.org/operators/operating-pinot/tuning/realtime
   2021/02/04 15:13:53.141 INFO [RealtimeProvisioningHelperCommand] [main] 
   Memory used per host (Active/Mapped)
   
   numHosts --> 2               |4               |6               |8               |10              |12              |14              |16              |
   numHours
    2 --------> 7.67G/17.86G    |4.09G/9.53G     |2.56G/5.95G     |2.05G/4.76G     |1.53G/3.57G     |1.53G/3.57G     |1.53G/3.57G     |1.02G/2.38G     |
    3 --------> 8.03G/18.22G    |4.28G/9.72G     |2.68G/6.07G     |2.14G/4.86G     |1.61G/3.64G     |1.61G/3.64G     |1.61G/3.64G     |1.07G/2.43G     |
    4 --------> 8.38G/18.58G    |4.47G/9.91G     |2.79G/6.19G     |2.24G/4.95G     |1.68G/3.72G     |1.68G/3.72G     |1.68G/3.72G     |1.12G/2.48G     |
    5 --------> 9.02G/18.93G    |4.81G/10.1G     |3.01G/6.31G     |2.41G/5.05G     |1.8G/3.79G      |1.8G/3.79G      |1.8G/3.79G      |1.2G/2.52G      |
    6 --------> 9.1G/19.29G     |4.85G/10.29G    |3.03G/6.43G     |2.43G/5.14G     |1.82G/3.86G     |1.82G/3.86G     |1.82G/3.86G     |1.21G/2.57G     |
    7 --------> 9.59G/20.5G     |5.12G/10.93G    |3.2G/6.83G      |2.56G/5.47G     |1.92G/4.1G      |1.92G/4.1G      |1.92G/4.1G      |1.28G/2.73G     |
    8 --------> 9.81G/20G       |5.23G/10.67G    |3.27G/6.67G     |2.62G/5.33G     |1.96G/4G        |1.96G/4G        |1.96G/4G        |1.31G/2.67G     |
    9 --------> 11.01G/21.21G   |5.87G/11.31G    |3.67G/7.07G     |2.94G/5.66G     |2.2G/4.24G      |2.2G/4.24G      |2.2G/4.24G      |1.47G/2.83G     |
   10 --------> 10.8G/20.71G    |5.76G/11.05G    |3.6G/6.9G       |2.88G/5.52G     |2.16G/4.14G     |2.16G/4.14G     |2.16G/4.14G     |1.44G/2.76G     |
   11 --------> 11.87G/21.21G   |6.33G/11.31G    |3.96G/7.07G     |3.16G/5.66G     |2.37G/4.24G     |2.37G/4.24G     |2.37G/4.24G     |1.58G/2.83G     |
   12 --------> 11.23G/21.43G   |5.99G/11.43G    |3.74G/7.14G     |3G/5.71G        |2.25G/4.29G     |2.25G/4.29G     |2.25G/4.29G     |1.5G/2.86G      |
   2021/02/04 15:13:53.142 INFO [RealtimeProvisioningHelperCommand] [main] 
   Optimal segment size
   
   numHosts --> 2               |4               |6               |8               |10              |12              |14              |16              |
   numHours
    2 --------> 19.33M          |19.33M          |19.33M          |19.33M          |19.33M          |19.33M          |19.33M          |19.33M          |
    3 --------> 29M             |29M             |29M             |29M             |29M             |29M             |29M             |29M             |
    4 --------> 38.66M          |38.66M          |38.66M          |38.66M          |38.66M          |38.66M          |38.66M          |38.66M          |
    5 --------> 48.33M          |48.33M          |48.33M          |48.33M          |48.33M          |48.33M          |48.33M          |48.33M          |
    6 --------> 57.99M          |57.99M          |57.99M          |57.99M          |57.99M          |57.99M          |57.99M          |57.99M          |
    7 --------> 67.66M          |67.66M          |67.66M          |67.66M          |67.66M          |67.66M          |67.66M          |67.66M          |
    8 --------> 77.32M          |77.32M          |77.32M          |77.32M          |77.32M          |77.32M          |77.32M          |77.32M          |
    9 --------> 86.99M          |86.99M          |86.99M          |86.99M          |86.99M          |86.99M          |86.99M          |86.99M          |
   10 --------> 96.65M          |96.65M          |96.65M          |96.65M          |96.65M          |96.65M          |96.65M          |96.65M          |
   11 --------> 106.32M         |106.32M         |106.32M         |106.32M         |106.32M         |106.32M         |106.32M         |106.32M         |
   12 --------> 115.98M         |115.98M         |115.98M         |115.98M         |115.98M         |115.98M         |115.98M         |115.98M         |
   2021/02/04 15:13:53.144 INFO [RealtimeProvisioningHelperCommand] [main] 
   Consuming memory
   
   numHosts --> 2               |4               |6               |8               |10              |12              |14              |16              |
   numHours
    2 --------> 1.16G           |632.16M         |395.1M          |316.08M         |237.06M         |237.06M         |237.06M         |158.04M         |
    3 --------> 1.66G           |904.1M          |565.06M         |452.05M         |339.04M         |339.04M         |339.04M         |226.02M         |
    4 --------> 2.15G           |1.15G           |735.02M         |588.02M         |441.01M         |441.01M         |441.01M         |294.01M         |
    5 --------> 2.65G           |1.41G           |904.98M         |723.99M         |542.99M         |542.99M         |542.99M         |361.99M         |
    6 --------> 3.15G           |1.68G           |1.05G           |859.96M         |644.97M         |644.97M         |644.97M         |429.98M         |
    7 --------> 3.65G           |1.95G           |1.22G           |995.93M         |746.94M         |746.94M         |746.94M         |497.96M         |
    8 --------> 4.15G           |2.21G           |1.38G           |1.11G           |848.92M         |848.92M         |848.92M         |565.95M         |
    9 --------> 4.64G           |2.48G           |1.55G           |1.24G           |950.9M          |950.9M          |950.9M          |633.93M         |
   10 --------> 5.14G           |2.74G           |1.71G           |1.37G           |1.03G           |1.03G           |1.03G           |701.92M         |
   11 --------> 5.64G           |3.01G           |1.88G           |1.5G            |1.13G           |1.13G           |1.13G           |769.9M          |
   12 --------> 6.14G           |3.27G           |2.05G           |1.64G           |1.23G           |1.23G           |1.23G           |837.89M         |
   2021/02/04 15:13:53.145 INFO [RealtimeProvisioningHelperCommand] [main] 
   Total number of segments queried per host (for all partitions)
   
   numHosts --> 2               |4               |6               |8               |10              |12              |14              |16              |
   numHours
    2 --------> 360             |192             |120             |96              |72              |72              |72              |48              |
    3 --------> 240             |128             |80              |64              |48              |48              |48              |32              |
    4 --------> 180             |96              |60              |48              |36              |36              |36              |24              |
    5 --------> 150             |80              |50              |40              |30              |30              |30              |20              |
    6 --------> 120             |64              |40              |32              |24              |24              |24              |16              |
    7 --------> 105             |56              |35              |28              |21              |21              |21              |14              |
    8 --------> 90              |48              |30              |24              |18              |18              |18              |12              |
    9 --------> 90              |48              |30              |24              |18              |18              |18              |12              |
   10 --------> 75              |40              |25              |20              |15              |15              |15              |10              |
   11 --------> 75              |40              |25              |20              |15              |15              |15              |10              |
   12 --------> 60              |32              |20              |16              |12              |12              |12              |8               |
   
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r572443033



##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/data/generator/DataGeneratorSpec.java
##########
@@ -38,6 +38,8 @@
   private final Map<String, Integer> cardinalityMap;
   private final Map<String, IntRange> rangeMap;
   private final Map<String, Map<String, Object>> patternMap;
+  private final Map<String, Double> mvCountMap; // map of column name to average number of values per entry
+  private final Map<String, Integer> lengthMap; // map of column name to average length of th entry (used for string generator)

Review comment:
       Both for string and bytes right?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r578062635



##########
File path: pinot-common/src/main/java/org/apache/pinot/common/data/schema/DateTimeFieldSpecMetadata.java
##########
@@ -0,0 +1,47 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.pinot.common.data.schema;
+
+
+public class DateTimeFieldSpecMetadata extends FieldMetadata {

Review comment:
       Discussed offline with @sajjad-moradi  on why this is needed. Please add some javadoc explaining the context on why this is needed. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] sajjad-moradi commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
sajjad-moradi commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r578068608



##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/data/generator/StringGenerator.java
##########
@@ -56,11 +65,18 @@ public void init() {
 
   @Override
   public Object next() {
+    if (numberOfValuesPerEntry == 1) {
+      return getNextString();
+    }
+    return MultiValueGeneratorHelper.generateMultiValueEntries(numberOfValuesPerEntry, rand, this::getNextString);
+  }
+
+  private String getNextString() {
     return vals.get(rand.nextInt(cardinality));

Review comment:
       That's correct, but even with completed segments, `MemoryEstimator` tries to extrapolate the actual memory size based on secondsToConsume (derived from numHours input) parameter:
   ```java
   int totalDocs = (int) (((double) secondsToConsume / _sampleSegmentConsumedSeconds) * _totalDocsInSampleSegment);
   long memoryForConsumingSegmentPerPartition = getMemoryForConsumingSegmentPerPartition(statsFile, totalDocs);
   ```
   My point we have only one completed segment (given or generated) and that should be enough to for estimating memory for different consumption window sizes.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r572512496



##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/admin/command/RealtimeProvisioningHelperCommand.java
##########
@@ -71,9 +72,15 @@
   @Option(name = "-numHours", metaVar = "<String>", usage = "number of hours to consume as comma separated values (default 2,3,4,5,6,7,8,9,10,11,12)")
   private String _numHours = "2,3,4,5,6,7,8,9,10,11,12";
 
-  @Option(name = "-sampleCompletedSegmentDir", required = true, metaVar = "<String>", usage = "Consume from the topic for n hours and provide the path of the segment dir after it completes")
+  @Option(name = "-sampleCompletedSegmentDir", required = false, metaVar = "<String>", usage = "Consume from the topic for n hours and provide the path of the segment dir after it completes")
   private String _sampleCompletedSegmentDir;
 
+  @Option(name = "-dataCharacteristicsFile", required = false, metaVar = "<String>", usage = "File containing characteristics by which a segment will be generated")
+  private String _dataCharacteristicsFile;
+
+  @Option(name = "-schemaFile", required = false, metaVar = "<String>")

Review comment:
       @mcvsubbu , sorry I don't think I am following this comment. We are providing data characteristics right?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] sajjad-moradi commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
sajjad-moradi commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r577251040



##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/realtime/provisioning/MemoryEstimator.java
##########
@@ -107,6 +129,15 @@ public MemoryEstimator(TableConfig tableConfig, File sampleCompletedSegment, int
     _tableDataDir.mkdir();
   }
 
+  public MemoryEstimator(TableConfig tableConfig, File dataCharacteristicsFile, File schemaFile, int ingestionRate,
+      long maxUsableHostMemory, int tableRetentionHours) {
+    this(tableConfig,
+        generateCompletedSegment(dataCharacteristicsFile, schemaFile, tableConfig),

Review comment:
       Why do we want to do that? We generate it in `java.io.tmpdir` so it gets deleted automatically.

##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/data/generator/NumberGenerator.java
##########
@@ -61,23 +66,23 @@ public void init() {
         final int start = rand.nextInt(cardinality);
         final int end = start + cardinality;
         for (int i = start; i < end; i++) {
-          intValues.add(new Integer(i));
+          intValues.add(i);
         }
         break;
       case LONG:
         longValues = new ArrayList<Long>();
         final long longStart = rand.nextInt(cardinality);
         final long longEnd = longStart + cardinality;
         for (long i = longStart; i < longEnd; i++) {
-          longValues.add(new Long(i));
+          longValues.add(i);
         }
         break;
       case FLOAT:
         floatValues = new ArrayList<Float>();
-        final float floatStart = rand.nextFloat() * rand.nextInt(1000);
+        final float floatStart = Math.round(rand.nextFloat()* 100.0f) / 100.0f;

Review comment:
       This makes the generated float numbers easier on the eye - with two decimal places.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] sajjad-moradi commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
sajjad-moradi commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r577268997



##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/data/generator/StringGenerator.java
##########
@@ -56,11 +65,18 @@ public void init() {
 
   @Override
   public Object next() {
+    if (numberOfValuesPerEntry == 1) {
+      return getNextString();
+    }
+    return MultiValueGeneratorHelper.generateMultiValueEntries(numberOfValuesPerEntry, rand, this::getNextString);
+  }
+
+  private String getNextString() {
     return vals.get(rand.nextInt(cardinality));

Review comment:
       I'm not sure why we need to generate millions of records. I assume something below 500K records should suffice and my assumption is that we won't get into trouble with that number of rows. 
   BTW I have tested it on my mac with 10M rows with high cardinality on different columns in order of millions and no issue observed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] sajjad-moradi commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
sajjad-moradi commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r577250380



##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/data/generator/DataGenerator.java
##########
@@ -163,26 +174,75 @@ private FieldSpec buildSpec(DataGeneratorSpec genSpec, String column) {
 
   public static void main(String[] args)
       throws IOException {
-    final String[] columns = {"column1", "column2", "column3", "column4", "column5"};
+
     final Map<String, DataType> dataTypes = new HashMap<>();
     final Map<String, FieldType> fieldTypes = new HashMap<>();
     final Map<String, TimeUnit> timeUnits = new HashMap<>();
 
     final Map<String, Integer> cardinality = new HashMap<>();
     final Map<String, IntRange> range = new HashMap<>();
     final Map<String, Map<String, Object>> template = new HashMap<>();
-
-    for (final String col : columns) {
-      dataTypes.put(col, DataType.INT);
-      fieldTypes.put(col, FieldType.DIMENSION);
-      cardinality.put(col, 1000);
+    Map<String, Double> mvCountMap = new HashMap<>();
+    Map<String, Integer> lengthMap = new HashMap<>();
+    List<String> columnNames = new ArrayList<>();
+
+    int cardinalityValue = 5;
+    int strLength = 5;
+
+    String colName = "colInt";
+    dataTypes.put(colName, DataType.INT);

Review comment:
       Apparently this was a test for data generator. Just wanted to add more cases to it.

##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/data/generator/DataGeneratorSpec.java
##########
@@ -38,6 +38,8 @@
   private final Map<String, Integer> cardinalityMap;
   private final Map<String, IntRange> rangeMap;
   private final Map<String, Map<String, Object>> patternMap;
+  private final Map<String, Double> mvCountMap; // map of column name to average number of values per entry
+  private final Map<String, Integer> lengthMap; // map of column name to average length of th entry (used for string generator)

Review comment:
       Right. Updated it.

##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/data/generator/MultiValueGeneratorHelper.java
##########
@@ -0,0 +1,54 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.pinot.tools.data.generator;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Random;
+import java.util.function.Supplier;
+
+
+/**
+ * A helper class for generating multi value entries
+ */
+public class MultiValueGeneratorHelper {
+
+  /**
+   * Generate MV entries
+   *
+   * @param numberOfValuesPerEntry number of values per each row
+   * @param rand random object
+   * @param nextItemFunc function to get the next random item
+   * @return
+   */
+  public static List<Object> generateMultiValueEntries(double numberOfValuesPerEntry, Random rand,
+      Supplier<Object> nextItemFunc) {
+    List<Object> entries = new ArrayList<>();
+    int i = 0;
+    for (; i < numberOfValuesPerEntry - 1; i++) {

Review comment:
       I'm using the index `i` in the next statement. I could use `numberOfValuesPerEntry - Math.floor(numberOfValuesPerEntry)` instead of `numberOfValuesPerEntry - i`, but index `i` has exactly that value and we already have it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r572452098



##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/realtime/provisioning/MemoryEstimator.java
##########
@@ -387,4 +418,182 @@ private long calculateMemoryForCompletedSegmentsPerPartition(long completedSegme
   public String[][] getNumSegmentsQueriedPerHost() {
     return _numSegmentsQueriedPerHost;
   }
+
+  private static File generateCompletedSegment(File dataCharacteristicsFile, File schemaFile, TableConfig tableConfig) {

Review comment:
       separate schemaFile should not be needed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r572431719



##########
File path: pinot-tools/src/test/resources/memory_estimation/data-characteristics.json
##########
@@ -0,0 +1,37 @@
+{
+  "numberOfRows": 10000,
+  "columnCharacteristics": [

Review comment:
       It is fine to extend that format but hereI feel we are looking at a different format




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] sajjad-moradi commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
sajjad-moradi commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r578068608



##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/data/generator/StringGenerator.java
##########
@@ -56,11 +65,18 @@ public void init() {
 
   @Override
   public Object next() {
+    if (numberOfValuesPerEntry == 1) {
+      return getNextString();
+    }
+    return MultiValueGeneratorHelper.generateMultiValueEntries(numberOfValuesPerEntry, rand, this::getNextString);
+  }
+
+  private String getNextString() {
     return vals.get(rand.nextInt(cardinality));

Review comment:
       That's correct, but even with completed segments, `MemoryEstimator` tries to extrapolate the actual memory size based on secondsToConsume (derived from numHours input) parameter:
   ```java
   int totalDocs = (int) (((double) secondsToConsume / _sampleSegmentConsumedSeconds) * _totalDocsInSampleSegment);
   long memoryForConsumingSegmentPerPartition = getMemoryForConsumingSegmentPerPartition(statsFile, totalDocs);
   ```
   My point is that we have only one completed segment (given or generated) and that should be enough to for estimating memory for different consumption window sizes.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] sajjad-moradi commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
sajjad-moradi commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r577270646



##########
File path: pinot-tools/src/test/resources/memory_estimation/schema.json
##########
@@ -0,0 +1,36 @@
+{
+  "schemaName": "testTable",
+  "dimensionFieldSpecs": [
+    {

Review comment:
       Okay, I gave up 😃  (To be fair, I wanted to do that in the next PR which is the integration part, but now it has it!)
   Refactored the code to use `SchemaWithMetadata`. Now only one file is provided and that one represents both schema and data characteristics.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r572443509



##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/data/generator/MultiValueGeneratorHelper.java
##########
@@ -0,0 +1,54 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.pinot.tools.data.generator;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Random;
+import java.util.function.Supplier;
+
+
+/**
+ * A helper class for generating multi value entries
+ */
+public class MultiValueGeneratorHelper {
+
+  /**
+   * Generate MV entries
+   *
+   * @param numberOfValuesPerEntry number of values per each row
+   * @param rand random object
+   * @param nextItemFunc function to get the next random item
+   * @return
+   */
+  public static List<Object> generateMultiValueEntries(double numberOfValuesPerEntry, Random rand,
+      Supplier<Object> nextItemFunc) {
+    List<Object> entries = new ArrayList<>();
+    int i = 0;
+    for (; i < numberOfValuesPerEntry - 1; i++) {

Review comment:
       (nit) I don't think we follow this coding convention. We always initialize the loop variable again in the for loop




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r570621100



##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/realtime/provisioning/MemoryEstimator.java
##########
@@ -387,4 +418,182 @@ private long calculateMemoryForCompletedSegmentsPerPartition(long completedSegme
   public String[][] getNumSegmentsQueriedPerHost() {
     return _numSegmentsQueriedPerHost;
   }
+
+  private static File generateCompletedSegment(File dataCharacteristicsFile, File schemaFile, TableConfig tableConfig) {

Review comment:
       Please make sure that dataCharacteristicsFile format is the exact same json file format that is taken by the recommendation engine (RecommenderDrivr.java). Otherwise, the follow up integration task won't work. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r572432670



##########
File path: pinot-tools/src/test/java/org/apache/pinot/tools/realtime/provisioning/MemoryEstimatorTest.java
##########
@@ -0,0 +1,85 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.pinot.tools.realtime.provisioning;
+
+import java.io.File;
+import java.net.URL;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.nio.file.Paths;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+import org.apache.commons.io.FileUtils;
+import org.apache.pinot.spi.config.table.TableConfig;
+import org.apache.pinot.spi.utils.JsonUtils;
+import org.testng.annotations.Test;
+
+import static org.testng.Assert.assertEquals;
+
+
+public class MemoryEstimatorTest {
+
+  @Test
+  public void testSegmentGenerator() throws Exception {
+
+    // arrange inputs
+    File schemaFile = readFile("memory_estimation/schema.json");
+    File dataCharacteristicsFile = readFile("memory_estimation/data-characteristics.json");
+    File tableConfigFile = readFile("memory_estimation/table-config.json");
+    TableConfig tableConfig = JsonUtils.fileToObject(tableConfigFile, TableConfig.class);
+
+    // act
+    MemoryEstimator.SegmentGenerator
+        segmentGenerator = new MemoryEstimator.SegmentGenerator(dataCharacteristicsFile, schemaFile, tableConfig, true);
+    File generatedSegment = segmentGenerator.generate();
+
+    // assert
+    Path metadataFile = Paths.get(generatedSegment.getPath(), "v3", "metadata.properties");

Review comment:
       how can we check for average length for string types?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r572456267



##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/realtime/provisioning/MemoryEstimator.java
##########
@@ -107,6 +129,15 @@ public MemoryEstimator(TableConfig tableConfig, File sampleCompletedSegment, int
     _tableDataDir.mkdir();
   }
 
+  public MemoryEstimator(TableConfig tableConfig, File dataCharacteristicsFile, File schemaFile, int ingestionRate,
+      long maxUsableHostMemory, int tableRetentionHours) {
+    this(tableConfig,
+        generateCompletedSegment(dataCharacteristicsFile, schemaFile, tableConfig),

Review comment:
       Can we add a small test (or ensure manually) that the file location of the generated completed segment returned by generateCompletedSegment() is in the same segment file hierarchy as taken by the other constructor




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] mcvsubbu commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
mcvsubbu commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r572467272



##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/admin/command/RealtimeProvisioningHelperCommand.java
##########
@@ -71,9 +72,15 @@
   @Option(name = "-numHours", metaVar = "<String>", usage = "number of hours to consume as comma separated values (default 2,3,4,5,6,7,8,9,10,11,12)")
   private String _numHours = "2,3,4,5,6,7,8,9,10,11,12";
 
-  @Option(name = "-sampleCompletedSegmentDir", required = true, metaVar = "<String>", usage = "Consume from the topic for n hours and provide the path of the segment dir after it completes")
+  @Option(name = "-sampleCompletedSegmentDir", required = false, metaVar = "<String>", usage = "Consume from the topic for n hours and provide the path of the segment dir after it completes")
   private String _sampleCompletedSegmentDir;
 
+  @Option(name = "-dataCharacteristicsFile", required = false, metaVar = "<String>", usage = "File containing characteristics by which a segment will be generated")
+  private String _dataCharacteristicsFile;
+
+  @Option(name = "-schemaFile", required = false, metaVar = "<String>")

Review comment:
       Isnt it good to have the option of providing  a sample segment OR data characteristics?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r572452098



##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/realtime/provisioning/MemoryEstimator.java
##########
@@ -387,4 +418,182 @@ private long calculateMemoryForCompletedSegmentsPerPartition(long completedSegme
   public String[][] getNumSegmentsQueriedPerHost() {
     return _numSegmentsQueriedPerHost;
   }
+
+  private static File generateCompletedSegment(File dataCharacteristicsFile, File schemaFile, TableConfig tableConfig) {

Review comment:
       separate schemaFile should not be needed. You can extract the schema out of dataCharacteristics file




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r578061948



##########
File path: pinot-common/src/main/java/org/apache/pinot/common/data/schema/SchemaMetadataConstants.java
##########
@@ -0,0 +1,26 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.pinot.common.data.schema;
+
+public class SchemaMetadataConstants {

Review comment:
       This should be in RecommenderConstants




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] codecov-io commented on pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#issuecomment-773727645


   # [Codecov](https://codecov.io/gh/apache/incubator-pinot/pull/6546?src=pr&el=h1) Report
   > Merging [#6546](https://codecov.io/gh/apache/incubator-pinot/pull/6546?src=pr&el=desc) (6b7f6c7) into [master](https://codecov.io/gh/apache/incubator-pinot/commit/1beaab59b73f26c4e35f3b9bc856b03806cddf5a?el=desc) (1beaab5) will **decrease** coverage by `22.59%`.
   > The diff coverage is `41.98%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-pinot/pull/6546/graphs/tree.svg?width=650&height=150&src=pr&token=4ibza2ugkz)](https://codecov.io/gh/apache/incubator-pinot/pull/6546?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff             @@
   ##           master    #6546       +/-   ##
   ===========================================
   - Coverage   66.44%   43.85%   -22.60%     
   ===========================================
     Files        1075     1335      +260     
     Lines       54773    65807    +11034     
     Branches     8168     9605     +1437     
   ===========================================
   - Hits        36396    28860     -7536     
   - Misses      15700    34514    +18814     
   + Partials     2677     2433      -244     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration | `43.85% <41.98%> (?)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-pinot/pull/6546?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [...ot/broker/broker/AllowAllAccessControlFactory.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvYnJva2VyL0FsbG93QWxsQWNjZXNzQ29udHJvbEZhY3RvcnkuamF2YQ==) | `100.00% <ø> (ø)` | |
   | [.../helix/BrokerUserDefinedMessageHandlerFactory.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvYnJva2VyL2hlbGl4L0Jyb2tlclVzZXJEZWZpbmVkTWVzc2FnZUhhbmRsZXJGYWN0b3J5LmphdmE=) | `52.83% <0.00%> (-13.84%)` | :arrow_down: |
   | [...org/apache/pinot/broker/queryquota/HitCounter.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcXVlcnlxdW90YS9IaXRDb3VudGVyLmphdmE=) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | [...che/pinot/broker/queryquota/MaxHitRateTracker.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcXVlcnlxdW90YS9NYXhIaXRSYXRlVHJhY2tlci5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...ache/pinot/broker/queryquota/QueryQuotaEntity.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcXVlcnlxdW90YS9RdWVyeVF1b3RhRW50aXR5LmphdmE=) | `0.00% <0.00%> (-50.00%)` | :arrow_down: |
   | [...ker/routing/instanceselector/InstanceSelector.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcm91dGluZy9pbnN0YW5jZXNlbGVjdG9yL0luc3RhbmNlU2VsZWN0b3IuamF2YQ==) | `100.00% <ø> (ø)` | |
   | [...ceselector/StrictReplicaGroupInstanceSelector.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcm91dGluZy9pbnN0YW5jZXNlbGVjdG9yL1N0cmljdFJlcGxpY2FHcm91cEluc3RhbmNlU2VsZWN0b3IuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...roker/routing/segmentpruner/TimeSegmentPruner.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcm91dGluZy9zZWdtZW50cHJ1bmVyL1RpbWVTZWdtZW50UHJ1bmVyLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...roker/routing/segmentpruner/interval/Interval.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcm91dGluZy9zZWdtZW50cHJ1bmVyL2ludGVydmFsL0ludGVydmFsLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...r/routing/segmentpruner/interval/IntervalTree.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcm91dGluZy9zZWdtZW50cHJ1bmVyL2ludGVydmFsL0ludGVydmFsVHJlZS5qYXZh) | `0.00% <0.00%> (ø)` | |
   | ... and [1343 more](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-pinot/pull/6546?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-pinot/pull/6546?src=pr&el=footer). Last update [21b83d9...6b7f6c7](https://codecov.io/gh/apache/incubator-pinot/pull/6546?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r572430013



##########
File path: pinot-tools/src/test/resources/memory_estimation/data-characteristics.json
##########
@@ -0,0 +1,37 @@
+{
+  "numberOfRows": 10000,
+  "columnCharacteristics": [

Review comment:
       This is not same as what is taken by RecommenderDriver right? I don't think it takes columnCharacteristics.  See the file InputManager and SchemaWithMetadata. It is essentially annotated schema




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] codecov-io edited a comment on pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#issuecomment-773727645


   # [Codecov](https://codecov.io/gh/apache/incubator-pinot/pull/6546?src=pr&el=h1) Report
   > Merging [#6546](https://codecov.io/gh/apache/incubator-pinot/pull/6546?src=pr&el=desc) (f4c70da) into [master](https://codecov.io/gh/apache/incubator-pinot/commit/1beaab59b73f26c4e35f3b9bc856b03806cddf5a?el=desc) (1beaab5) will **decrease** coverage by `22.51%`.
   > The diff coverage is `41.89%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-pinot/pull/6546/graphs/tree.svg?width=650&height=150&src=pr&token=4ibza2ugkz)](https://codecov.io/gh/apache/incubator-pinot/pull/6546?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff             @@
   ##           master    #6546       +/-   ##
   ===========================================
   - Coverage   66.44%   43.93%   -22.52%     
   ===========================================
     Files        1075     1345      +270     
     Lines       54773    66038    +11265     
     Branches     8168     9630     +1462     
   ===========================================
   - Hits        36396    29016     -7380     
   - Misses      15700    34584    +18884     
   + Partials     2677     2438      -239     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration | `43.93% <41.89%> (?)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-pinot/pull/6546?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [...ot/broker/broker/AllowAllAccessControlFactory.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvYnJva2VyL0FsbG93QWxsQWNjZXNzQ29udHJvbEZhY3RvcnkuamF2YQ==) | `100.00% <ø> (ø)` | |
   | [.../helix/BrokerUserDefinedMessageHandlerFactory.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvYnJva2VyL2hlbGl4L0Jyb2tlclVzZXJEZWZpbmVkTWVzc2FnZUhhbmRsZXJGYWN0b3J5LmphdmE=) | `52.83% <0.00%> (-13.84%)` | :arrow_down: |
   | [...org/apache/pinot/broker/queryquota/HitCounter.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcXVlcnlxdW90YS9IaXRDb3VudGVyLmphdmE=) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | [...che/pinot/broker/queryquota/MaxHitRateTracker.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcXVlcnlxdW90YS9NYXhIaXRSYXRlVHJhY2tlci5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...ache/pinot/broker/queryquota/QueryQuotaEntity.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcXVlcnlxdW90YS9RdWVyeVF1b3RhRW50aXR5LmphdmE=) | `0.00% <0.00%> (-50.00%)` | :arrow_down: |
   | [...ker/routing/instanceselector/InstanceSelector.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcm91dGluZy9pbnN0YW5jZXNlbGVjdG9yL0luc3RhbmNlU2VsZWN0b3IuamF2YQ==) | `100.00% <ø> (ø)` | |
   | [...ceselector/StrictReplicaGroupInstanceSelector.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcm91dGluZy9pbnN0YW5jZXNlbGVjdG9yL1N0cmljdFJlcGxpY2FHcm91cEluc3RhbmNlU2VsZWN0b3IuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...roker/routing/segmentpruner/TimeSegmentPruner.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcm91dGluZy9zZWdtZW50cHJ1bmVyL1RpbWVTZWdtZW50UHJ1bmVyLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...roker/routing/segmentpruner/interval/Interval.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcm91dGluZy9zZWdtZW50cHJ1bmVyL2ludGVydmFsL0ludGVydmFsLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...r/routing/segmentpruner/interval/IntervalTree.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcm91dGluZy9zZWdtZW50cHJ1bmVyL2ludGVydmFsL0ludGVydmFsVHJlZS5qYXZh) | `0.00% <0.00%> (ø)` | |
   | ... and [1360 more](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-pinot/pull/6546?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-pinot/pull/6546?src=pr&el=footer). Last update [21b83d9...f4c70da](https://codecov.io/gh/apache/incubator-pinot/pull/6546?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r572440543



##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/admin/command/RealtimeProvisioningHelperCommand.java
##########
@@ -71,9 +72,15 @@
   @Option(name = "-numHours", metaVar = "<String>", usage = "number of hours to consume as comma separated values (default 2,3,4,5,6,7,8,9,10,11,12)")
   private String _numHours = "2,3,4,5,6,7,8,9,10,11,12";
 
-  @Option(name = "-sampleCompletedSegmentDir", required = true, metaVar = "<String>", usage = "Consume from the topic for n hours and provide the path of the segment dir after it completes")
+  @Option(name = "-sampleCompletedSegmentDir", required = false, metaVar = "<String>", usage = "Consume from the topic for n hours and provide the path of the segment dir after it completes")
   private String _sampleCompletedSegmentDir;
 
+  @Option(name = "-dataCharacteristicsFile", required = false, metaVar = "<String>", usage = "File containing characteristics by which a segment will be generated")
+  private String _dataCharacteristicsFile;
+
+  @Option(name = "-schemaFile", required = false, metaVar = "<String>")

Review comment:
       We should not provide schema file. See my comment further below. The data characteristics file is essentially the schema annotated with additional info. So it alone should be enough




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] codecov-io commented on pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#issuecomment-773727645


   # [Codecov](https://codecov.io/gh/apache/incubator-pinot/pull/6546?src=pr&el=h1) Report
   > Merging [#6546](https://codecov.io/gh/apache/incubator-pinot/pull/6546?src=pr&el=desc) (6b7f6c7) into [master](https://codecov.io/gh/apache/incubator-pinot/commit/1beaab59b73f26c4e35f3b9bc856b03806cddf5a?el=desc) (1beaab5) will **decrease** coverage by `22.59%`.
   > The diff coverage is `41.98%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-pinot/pull/6546/graphs/tree.svg?width=650&height=150&src=pr&token=4ibza2ugkz)](https://codecov.io/gh/apache/incubator-pinot/pull/6546?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff             @@
   ##           master    #6546       +/-   ##
   ===========================================
   - Coverage   66.44%   43.85%   -22.60%     
   ===========================================
     Files        1075     1335      +260     
     Lines       54773    65807    +11034     
     Branches     8168     9605     +1437     
   ===========================================
   - Hits        36396    28860     -7536     
   - Misses      15700    34514    +18814     
   + Partials     2677     2433      -244     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration | `43.85% <41.98%> (?)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-pinot/pull/6546?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [...ot/broker/broker/AllowAllAccessControlFactory.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvYnJva2VyL0FsbG93QWxsQWNjZXNzQ29udHJvbEZhY3RvcnkuamF2YQ==) | `100.00% <ø> (ø)` | |
   | [.../helix/BrokerUserDefinedMessageHandlerFactory.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvYnJva2VyL2hlbGl4L0Jyb2tlclVzZXJEZWZpbmVkTWVzc2FnZUhhbmRsZXJGYWN0b3J5LmphdmE=) | `52.83% <0.00%> (-13.84%)` | :arrow_down: |
   | [...org/apache/pinot/broker/queryquota/HitCounter.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcXVlcnlxdW90YS9IaXRDb3VudGVyLmphdmE=) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | [...che/pinot/broker/queryquota/MaxHitRateTracker.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcXVlcnlxdW90YS9NYXhIaXRSYXRlVHJhY2tlci5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...ache/pinot/broker/queryquota/QueryQuotaEntity.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcXVlcnlxdW90YS9RdWVyeVF1b3RhRW50aXR5LmphdmE=) | `0.00% <0.00%> (-50.00%)` | :arrow_down: |
   | [...ker/routing/instanceselector/InstanceSelector.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcm91dGluZy9pbnN0YW5jZXNlbGVjdG9yL0luc3RhbmNlU2VsZWN0b3IuamF2YQ==) | `100.00% <ø> (ø)` | |
   | [...ceselector/StrictReplicaGroupInstanceSelector.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcm91dGluZy9pbnN0YW5jZXNlbGVjdG9yL1N0cmljdFJlcGxpY2FHcm91cEluc3RhbmNlU2VsZWN0b3IuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...roker/routing/segmentpruner/TimeSegmentPruner.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcm91dGluZy9zZWdtZW50cHJ1bmVyL1RpbWVTZWdtZW50UHJ1bmVyLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...roker/routing/segmentpruner/interval/Interval.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcm91dGluZy9zZWdtZW50cHJ1bmVyL2ludGVydmFsL0ludGVydmFsLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...r/routing/segmentpruner/interval/IntervalTree.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcm91dGluZy9zZWdtZW50cHJ1bmVyL2ludGVydmFsL0ludGVydmFsVHJlZS5qYXZh) | `0.00% <0.00%> (ø)` | |
   | ... and [1343 more](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-pinot/pull/6546?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-pinot/pull/6546?src=pr&el=footer). Last update [21b83d9...6b7f6c7](https://codecov.io/gh/apache/incubator-pinot/pull/6546?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r572431719



##########
File path: pinot-tools/src/test/resources/memory_estimation/data-characteristics.json
##########
@@ -0,0 +1,37 @@
+{
+  "numberOfRows": 10000,
+  "columnCharacteristics": [

Review comment:
       It is fine to extend that format but here I feel l we are looking at a completely different format so integration in the follow up PR won't really work and this will have to be changed again




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia merged pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia merged pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r572457056



##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/data/generator/StringGenerator.java
##########
@@ -56,11 +65,18 @@ public void init() {
 
   @Override
   public Object next() {
+    if (numberOfValuesPerEntry == 1) {
+      return getNextString();
+    }
+    return MultiValueGeneratorHelper.generateMultiValueEntries(numberOfValuesPerEntry, rand, this::getNextString);
+  }
+
+  private String getNextString() {
     return vals.get(rand.nextInt(cardinality));

Review comment:
       Vals is a hashmap. This is where we are going to run into issues with generation based approach where the characteristics file specifies a very high cardinality for column(s). The HashMap will likely hit OOM. So the other approach of estimating the memory without the completed segment all together by doing all the math becomes quite essential. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r572457056



##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/data/generator/StringGenerator.java
##########
@@ -56,11 +65,18 @@ public void init() {
 
   @Override
   public Object next() {
+    if (numberOfValuesPerEntry == 1) {
+      return getNextString();
+    }
+    return MultiValueGeneratorHelper.generateMultiValueEntries(numberOfValuesPerEntry, rand, this::getNextString);
+  }
+
+  private String getNextString() {
     return vals.get(rand.nextInt(cardinality));

Review comment:
       Vals is a hashmap. This is where we are going to run into issues with generation based approach where the characteristics file specifies a very high cardinality for column(s). The HashMap will likely hit OOM. So the other approach of estimating the memory without the completed segment all together becomes quite essential. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r572512496



##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/admin/command/RealtimeProvisioningHelperCommand.java
##########
@@ -71,9 +72,15 @@
   @Option(name = "-numHours", metaVar = "<String>", usage = "number of hours to consume as comma separated values (default 2,3,4,5,6,7,8,9,10,11,12)")
   private String _numHours = "2,3,4,5,6,7,8,9,10,11,12";
 
-  @Option(name = "-sampleCompletedSegmentDir", required = true, metaVar = "<String>", usage = "Consume from the topic for n hours and provide the path of the segment dir after it completes")
+  @Option(name = "-sampleCompletedSegmentDir", required = false, metaVar = "<String>", usage = "Consume from the topic for n hours and provide the path of the segment dir after it completes")
   private String _sampleCompletedSegmentDir;
 
+  @Option(name = "-dataCharacteristicsFile", required = false, metaVar = "<String>", usage = "File containing characteristics by which a segment will be generated")
+  private String _dataCharacteristicsFile;
+
+  @Option(name = "-schemaFile", required = false, metaVar = "<String>")

Review comment:
       @mcvsubbu , sorry I don't think I am following this comment? We are providing data characteristics right




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r570621100



##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/realtime/provisioning/MemoryEstimator.java
##########
@@ -387,4 +418,182 @@ private long calculateMemoryForCompletedSegmentsPerPartition(long completedSegme
   public String[][] getNumSegmentsQueriedPerHost() {
     return _numSegmentsQueriedPerHost;
   }
+
+  private static File generateCompletedSegment(File dataCharacteristicsFile, File schemaFile, TableConfig tableConfig) {

Review comment:
       Please make sure that dataCharacteristicsFile format is the exact same json file format that is taken by the recommendation engine (RecommenderDrivr.java). Otherwise, the follow up integration task won't work. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r572431719



##########
File path: pinot-tools/src/test/resources/memory_estimation/data-characteristics.json
##########
@@ -0,0 +1,37 @@
+{
+  "numberOfRows": 10000,
+  "columnCharacteristics": [

Review comment:
       It is fine to extend that format but hereI fee l we are looking at a completely different format




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r572433193



##########
File path: pinot-tools/src/test/resources/memory_estimation/schema.json
##########
@@ -0,0 +1,36 @@
+{
+  "schemaName": "testTable",
+  "dimensionFieldSpecs": [
+    {

Review comment:
       Let's make this as comprehensive as possible. 
   
   Include all fixed width - both single and multi-value
   Include all variable width (string and bytes) - both single and multi value




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r572442856



##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/data/generator/DataGenerator.java
##########
@@ -163,26 +174,75 @@ private FieldSpec buildSpec(DataGeneratorSpec genSpec, String column) {
 
   public static void main(String[] args)
       throws IOException {
-    final String[] columns = {"column1", "column2", "column3", "column4", "column5"};
+
     final Map<String, DataType> dataTypes = new HashMap<>();
     final Map<String, FieldType> fieldTypes = new HashMap<>();
     final Map<String, TimeUnit> timeUnits = new HashMap<>();
 
     final Map<String, Integer> cardinality = new HashMap<>();
     final Map<String, IntRange> range = new HashMap<>();
     final Map<String, Map<String, Object>> template = new HashMap<>();
-
-    for (final String col : columns) {
-      dataTypes.put(col, DataType.INT);
-      fieldTypes.put(col, FieldType.DIMENSION);
-      cardinality.put(col, 1000);
+    Map<String, Double> mvCountMap = new HashMap<>();
+    Map<String, Integer> lengthMap = new HashMap<>();
+    List<String> columnNames = new ArrayList<>();
+
+    int cardinalityValue = 5;
+    int strLength = 5;
+
+    String colName = "colInt";
+    dataTypes.put(colName, DataType.INT);

Review comment:
       Sorry, I don't follow these changes. All of this information (column name, type etc) should come from the data characteristics file. What is colInt then?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r572437744



##########
File path: pinot-tools/src/test/resources/memory_estimation/schema.json
##########
@@ -0,0 +1,36 @@
+{
+  "schemaName": "testTable",
+  "dimensionFieldSpecs": [
+    {

Review comment:
       Also, if you think about this from integration perspective, RecommenderDriver doesn't take schema. It takes the schema annotated with characteristics.  (See the files RecommenderDriver, InputManager and SchemaWithMetadata). So, something to think about for integration. If now you are providing MemoryEstimator with schema separately then I am not sure if you are going to change it back to the annotated schema in the follow-up PR when you will integrate with RecommenderDriver. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r578063314



##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/realtime/provisioning/MemoryEstimator.java
##########
@@ -107,6 +129,15 @@ public MemoryEstimator(TableConfig tableConfig, File sampleCompletedSegment, int
     _tableDataDir.mkdir();
   }
 
+  public MemoryEstimator(TableConfig tableConfig, File dataCharacteristicsFile, File schemaFile, int ingestionRate,
+      long maxUsableHostMemory, int tableRetentionHours) {
+    this(tableConfig,
+        generateCompletedSegment(dataCharacteristicsFile, schemaFile, tableConfig),

Review comment:
       Just wanted to ensure that the new code (added in this PR) to generated the completed segment generates it in the correct directory hierarchy as expected by the second constructor. Looks like your unit test already covers it. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] sajjad-moradi commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
sajjad-moradi commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r577250274



##########
File path: pinot-tools/src/test/java/org/apache/pinot/tools/realtime/provisioning/MemoryEstimatorTest.java
##########
@@ -0,0 +1,85 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.pinot.tools.realtime.provisioning;
+
+import java.io.File;
+import java.net.URL;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.nio.file.Paths;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+import org.apache.commons.io.FileUtils;
+import org.apache.pinot.spi.config.table.TableConfig;
+import org.apache.pinot.spi.utils.JsonUtils;
+import org.testng.annotations.Test;
+
+import static org.testng.Assert.assertEquals;
+
+
+public class MemoryEstimatorTest {
+
+  @Test
+  public void testSegmentGenerator() throws Exception {
+
+    // arrange inputs
+    File schemaFile = readFile("memory_estimation/schema.json");
+    File dataCharacteristicsFile = readFile("memory_estimation/data-characteristics.json");
+    File tableConfigFile = readFile("memory_estimation/table-config.json");
+    TableConfig tableConfig = JsonUtils.fileToObject(tableConfigFile, TableConfig.class);
+
+    // act
+    MemoryEstimator.SegmentGenerator
+        segmentGenerator = new MemoryEstimator.SegmentGenerator(dataCharacteristicsFile, schemaFile, tableConfig, true);
+    File generatedSegment = segmentGenerator.generate();
+
+    // assert
+    Path metadataFile = Paths.get(generatedSegment.getPath(), "v3", "metadata.properties");

Review comment:
       I'm not sure! Do you have any suggestion? Although StringGenerator is a simple class and we're ok not checking that.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r572458620



##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/data/generator/NumberGenerator.java
##########
@@ -61,23 +66,23 @@ public void init() {
         final int start = rand.nextInt(cardinality);
         final int end = start + cardinality;
         for (int i = start; i < end; i++) {
-          intValues.add(new Integer(i));
+          intValues.add(i);
         }
         break;
       case LONG:
         longValues = new ArrayList<Long>();
         final long longStart = rand.nextInt(cardinality);
         final long longEnd = longStart + cardinality;
         for (long i = longStart; i < longEnd; i++) {
-          longValues.add(new Long(i));
+          longValues.add(i);
         }
         break;
       case FLOAT:
         floatValues = new ArrayList<Float>();
-        final float floatStart = rand.nextFloat() * rand.nextInt(1000);
+        final float floatStart = Math.round(rand.nextFloat()* 100.0f) / 100.0f;

Review comment:
       why this change?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r572432670



##########
File path: pinot-tools/src/test/java/org/apache/pinot/tools/realtime/provisioning/MemoryEstimatorTest.java
##########
@@ -0,0 +1,85 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.pinot.tools.realtime.provisioning;
+
+import java.io.File;
+import java.net.URL;
+import java.nio.file.Files;
+import java.nio.file.Path;
+import java.nio.file.Paths;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+import org.apache.commons.io.FileUtils;
+import org.apache.pinot.spi.config.table.TableConfig;
+import org.apache.pinot.spi.utils.JsonUtils;
+import org.testng.annotations.Test;
+
+import static org.testng.Assert.assertEquals;
+
+
+public class MemoryEstimatorTest {
+
+  @Test
+  public void testSegmentGenerator() throws Exception {
+
+    // arrange inputs
+    File schemaFile = readFile("memory_estimation/schema.json");
+    File dataCharacteristicsFile = readFile("memory_estimation/data-characteristics.json");
+    File tableConfigFile = readFile("memory_estimation/table-config.json");
+    TableConfig tableConfig = JsonUtils.fileToObject(tableConfigFile, TableConfig.class);
+
+    // act
+    MemoryEstimator.SegmentGenerator
+        segmentGenerator = new MemoryEstimator.SegmentGenerator(dataCharacteristicsFile, schemaFile, tableConfig, true);
+    File generatedSegment = segmentGenerator.generate();
+
+    // assert
+    Path metadataFile = Paths.get(generatedSegment.getPath(), "v3", "metadata.properties");

Review comment:
       how can we check for average length for string and bytes types?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r572453667



##########
File path: pinot-tools/src/test/resources/memory_estimation/data-characteristics.json
##########
@@ -0,0 +1,37 @@
+{
+  "numberOfRows": 10000,
+  "columnCharacteristics": [

Review comment:
       https://github.com/apache/incubator-pinot/blob/master/pinot-controller/src/main/java/org/apache/pinot/controller/recommender/io/InputManager.java
   
   https://github.com/apache/incubator-pinot/blob/master/pinot-controller/src/main/java/org/apache/pinot/controller/recommender/io/metadata/SchemaWithMetaData.java
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r572433193



##########
File path: pinot-tools/src/test/resources/memory_estimation/schema.json
##########
@@ -0,0 +1,36 @@
+{
+  "schemaName": "testTable",
+  "dimensionFieldSpecs": [
+    {

Review comment:
       Let's make this as complete as possible. 
   
   All fixed width - both single and multi-value
   All variable width (string and bytes) - both single and multi value




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r572451751



##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/realtime/provisioning/MemoryEstimator.java
##########
@@ -107,6 +129,15 @@ public MemoryEstimator(TableConfig tableConfig, File sampleCompletedSegment, int
     _tableDataDir.mkdir();
   }
 
+  public MemoryEstimator(TableConfig tableConfig, File dataCharacteristicsFile, File schemaFile, int ingestionRate,

Review comment:
       (nit) Add small javadoc please to highlight how this constructor is different from the other one




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r572446070



##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/realtime/provisioning/MemoryEstimator.java
##########
@@ -387,4 +418,182 @@ private long calculateMemoryForCompletedSegmentsPerPartition(long completedSegme
   public String[][] getNumSegmentsQueriedPerHost() {
     return _numSegmentsQueriedPerHost;
   }
+
+  private static File generateCompletedSegment(File dataCharacteristicsFile, File schemaFile, TableConfig tableConfig) {
+    SegmentGenerator segmentGenerator = new SegmentGenerator(dataCharacteristicsFile, schemaFile, tableConfig, true);
+    return segmentGenerator.generate();
+  }
+
+  /**
+   * This class is used in Memory Estimator to generate segment based on the the given characteristics of data
+   */
+  public static class SegmentGenerator {
+    private static final Logger LOGGER = LoggerFactory.getLogger(SegmentGenerator.class);
+    private static final SimpleDateFormat DATE_FORMAT = new SimpleDateFormat("yyyy-MM-dd_HH:mm:ss");
+
+    private File _dataCharacteristicsFile;
+    private File _schemaFile;
+    private TableConfig _tableConfig;
+    private final boolean _deleteCsv;
+
+    public SegmentGenerator(File dataCharacteristicsFile, File schemaFile, TableConfig tableConfig, boolean deleteCsv) {
+      _dataCharacteristicsFile = dataCharacteristicsFile;
+      _schemaFile = schemaFile;
+      _tableConfig = tableConfig;
+      _deleteCsv = deleteCsv;
+    }
+
+    public File generate() {
+      Date now = new Date();
+
+      // extract schema
+      Schema schema;
+      try {
+        schema = JsonUtils.fileToObject(_schemaFile, Schema.class);
+      } catch (Exception e) {
+        throw new RuntimeException(String.format("Cannot read schema file '%s' to schema object.", _schemaFile));
+      }
+
+      // generate data & creat segment
+      File csvDataFile = generateData(schema, now);
+      File segment = createSegment(csvDataFile, schema, now);
+
+      if (_deleteCsv) {
+        csvDataFile.delete();
+      }
+      return segment;
+    }
+
+    private File generateData(Schema schema, Date now) {
+
+      // extract data characteristics
+      DataCharacteristics dataCharacteristics;
+      try {
+         dataCharacteristics = JsonUtils.fileToObject(_dataCharacteristicsFile, DataCharacteristics.class);
+      } catch (Exception e) {
+        throw new RuntimeException("Cannot deserialize data characteristic file " + _dataCharacteristicsFile);
+      }
+
+      // create maps of "column name" to ...
+      Map<String, Integer> lengths = new HashMap<>();
+      Map<String, Double> mvCounts = new HashMap<>();
+      Map<String, Integer> cardinalities = new HashMap<>();
+      Map<String, FieldSpec.DataType> dataTypes = new HashMap<>();
+      Map<String, FieldSpec.FieldType> fieldTypes = new HashMap<>();
+      Map<String, TimeUnit> timeUnits = new HashMap<>();
+      List<String> colNames = new ArrayList<>();
+      dataCharacteristics.columnCharacteristics.forEach(column -> {
+        colNames.add(column.name);
+        lengths.put(column.name, column.averageLength);
+        mvCounts.put(column.name, column.numberOfValuesPerEntry);
+        cardinalities.put(column.name, column.cardinality);
+      });
+      schema.getAllFieldSpecs().forEach(fieldSpec -> {
+        String name = fieldSpec.getName();
+        dataTypes.put(name, fieldSpec.getDataType());
+        fieldTypes.put(name, fieldSpec.getFieldType());
+      });
+      timeUnits.put(schema.getTimeFieldSpec().getName(),
+          schema.getTimeFieldSpec().getIncomingGranularitySpec().getTimeType());
+
+      // generate data
+      String outputDir = getOutputDir(now, "-csv");
+      DataGeneratorSpec spec =
+          new DataGeneratorSpec(colNames, cardinalities, new HashMap<>(), new HashMap<>(), mvCounts, lengths, dataTypes,
+              fieldTypes, timeUnits, FileFormat.CSV, outputDir, true);
+      DataGenerator dataGenerator = new DataGenerator();
+      try {
+        dataGenerator.init(spec);
+        dataGenerator.generateCsv(dataCharacteristics.numberOfRows, 1);
+        File outputFile = Paths.get(outputDir, "output_0.csv").toFile();
+        LOGGER.info("Successfully generated data file: {}", outputFile);
+        return outputFile;
+      } catch (Exception e) {
+        throw new RuntimeException(e);
+      }
+    }
+
+    private File createSegment(File csvDataFile, Schema schema, Date now) {
+
+      // create segment
+      LOGGER.info("Started creating segment from file: {}", csvDataFile);
+      String outDir = getOutputDir(now, "-segment");
+      SegmentGeneratorConfig segmentGeneratorConfig = getSegmentGeneratorConfig(csvDataFile, schema, outDir);
+      SegmentIndexCreationDriver driver = new SegmentIndexCreationDriverImpl();
+      try {
+        driver.init(segmentGeneratorConfig);
+        driver.build();
+      } catch (Exception e) {
+        throw new RuntimeException("Caught exception while generating segment from file: " + csvDataFile, e);
+      }
+      String segmentName = driver.getSegmentName();
+      File indexDir = new File(outDir, segmentName);
+      LOGGER.info("Successfully created segment: {} at directory: {}", segmentName, indexDir);
+
+      // verify segment
+      LOGGER.info("Verifying the segment by loading it");
+      ImmutableSegment segment;
+      try {
+        segment = ImmutableSegmentLoader.load(indexDir, ReadMode.mmap);
+      } catch (Exception e) {
+        throw new RuntimeException("Caught exception while verifying the created segment", e);
+      }
+      LOGGER.info("Successfully loaded segment: {} of size: {} bytes", segmentName,
+          segment.getSegmentSizeBytes());
+      segment.destroy();
+
+      return indexDir;
+    }
+
+    private SegmentGeneratorConfig getSegmentGeneratorConfig(File csvDataFile, Schema schema, String outDir) {
+      SegmentGeneratorConfig segmentGeneratorConfig = new SegmentGeneratorConfig(_tableConfig, schema);
+      segmentGeneratorConfig.setInputFilePath(csvDataFile.getPath());
+      segmentGeneratorConfig.setFormat(FileFormat.CSV);
+      segmentGeneratorConfig.setOutDir(outDir);
+      segmentGeneratorConfig.setReaderConfig(new CSVRecordReaderConfig()); // FIXME

Review comment:
       Why FIXME?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r578063314



##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/realtime/provisioning/MemoryEstimator.java
##########
@@ -107,6 +129,15 @@ public MemoryEstimator(TableConfig tableConfig, File sampleCompletedSegment, int
     _tableDataDir.mkdir();
   }
 
+  public MemoryEstimator(TableConfig tableConfig, File dataCharacteristicsFile, File schemaFile, int ingestionRate,
+      long maxUsableHostMemory, int tableRetentionHours) {
+    this(tableConfig,
+        generateCompletedSegment(dataCharacteristicsFile, schemaFile, tableConfig),

Review comment:
       Just wanted to ensure that the new code (added in this PR) to generated the completed segment generates it in the correct directory hierarchy as expected by the second constructor. Looks like your unit test already covers it. 
   Until now we were providing a pointer to completed segment dir. So wanted to ensure that the code generating completed segment in this PR follows the directory hierarchy. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] codecov-io edited a comment on pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#issuecomment-773727645


   # [Codecov](https://codecov.io/gh/apache/incubator-pinot/pull/6546?src=pr&el=h1) Report
   > Merging [#6546](https://codecov.io/gh/apache/incubator-pinot/pull/6546?src=pr&el=desc) (c59c821) into [master](https://codecov.io/gh/apache/incubator-pinot/commit/1beaab59b73f26c4e35f3b9bc856b03806cddf5a?el=desc) (1beaab5) will **decrease** coverage by `22.57%`.
   > The diff coverage is `41.68%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/incubator-pinot/pull/6546/graphs/tree.svg?width=650&height=150&src=pr&token=4ibza2ugkz)](https://codecov.io/gh/apache/incubator-pinot/pull/6546?src=pr&el=tree)
   
   ```diff
   @@             Coverage Diff             @@
   ##           master    #6546       +/-   ##
   ===========================================
   - Coverage   66.44%   43.87%   -22.58%     
   ===========================================
     Files        1075     1347      +272     
     Lines       54773    66151    +11378     
     Branches     8168     9643     +1475     
   ===========================================
   - Hits        36396    29022     -7374     
   - Misses      15700    34680    +18980     
   + Partials     2677     2449      -228     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration | `43.87% <41.68%> (?)` | |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/incubator-pinot/pull/6546?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [...ot/broker/broker/AllowAllAccessControlFactory.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvYnJva2VyL0FsbG93QWxsQWNjZXNzQ29udHJvbEZhY3RvcnkuamF2YQ==) | `100.00% <ø> (ø)` | |
   | [.../helix/BrokerUserDefinedMessageHandlerFactory.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvYnJva2VyL2hlbGl4L0Jyb2tlclVzZXJEZWZpbmVkTWVzc2FnZUhhbmRsZXJGYWN0b3J5LmphdmE=) | `52.83% <0.00%> (-13.84%)` | :arrow_down: |
   | [...org/apache/pinot/broker/queryquota/HitCounter.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcXVlcnlxdW90YS9IaXRDb3VudGVyLmphdmE=) | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | [...che/pinot/broker/queryquota/MaxHitRateTracker.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcXVlcnlxdW90YS9NYXhIaXRSYXRlVHJhY2tlci5qYXZh) | `0.00% <0.00%> (ø)` | |
   | [...ache/pinot/broker/queryquota/QueryQuotaEntity.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcXVlcnlxdW90YS9RdWVyeVF1b3RhRW50aXR5LmphdmE=) | `0.00% <0.00%> (-50.00%)` | :arrow_down: |
   | [...ker/routing/instanceselector/InstanceSelector.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcm91dGluZy9pbnN0YW5jZXNlbGVjdG9yL0luc3RhbmNlU2VsZWN0b3IuamF2YQ==) | `100.00% <ø> (ø)` | |
   | [...ceselector/StrictReplicaGroupInstanceSelector.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcm91dGluZy9pbnN0YW5jZXNlbGVjdG9yL1N0cmljdFJlcGxpY2FHcm91cEluc3RhbmNlU2VsZWN0b3IuamF2YQ==) | `0.00% <0.00%> (ø)` | |
   | [...roker/routing/segmentpruner/TimeSegmentPruner.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcm91dGluZy9zZWdtZW50cHJ1bmVyL1RpbWVTZWdtZW50UHJ1bmVyLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...roker/routing/segmentpruner/interval/Interval.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcm91dGluZy9zZWdtZW50cHJ1bmVyL2ludGVydmFsL0ludGVydmFsLmphdmE=) | `0.00% <0.00%> (ø)` | |
   | [...r/routing/segmentpruner/interval/IntervalTree.java](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree#diff-cGlub3QtYnJva2VyL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9waW5vdC9icm9rZXIvcm91dGluZy9zZWdtZW50cHJ1bmVyL2ludGVydmFsL0ludGVydmFsVHJlZS5qYXZh) | `0.00% <0.00%> (ø)` | |
   | ... and [1360 more](https://codecov.io/gh/apache/incubator-pinot/pull/6546/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/incubator-pinot/pull/6546?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/incubator-pinot/pull/6546?src=pr&el=footer). Last update [21b83d9...c59c821](https://codecov.io/gh/apache/incubator-pinot/pull/6546?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r572457056



##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/data/generator/StringGenerator.java
##########
@@ -56,11 +65,18 @@ public void init() {
 
   @Override
   public Object next() {
+    if (numberOfValuesPerEntry == 1) {
+      return getNextString();
+    }
+    return MultiValueGeneratorHelper.generateMultiValueEntries(numberOfValuesPerEntry, rand, this::getNextString);
+  }
+
+  private String getNextString() {
     return vals.get(rand.nextInt(cardinality));

Review comment:
       Vals is a hashmap. This is where we are going to run into issues with generation based approach where the characteristics file specifies a very high cardinality for column(s). The HashMap will likely hit OOM




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r578055491



##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/data/generator/StringGenerator.java
##########
@@ -56,11 +65,18 @@ public void init() {
 
   @Override
   public Object next() {
+    if (numberOfValuesPerEntry == 1) {
+      return getNextString();
+    }
+    return MultiValueGeneratorHelper.generateMultiValueEntries(numberOfValuesPerEntry, rand, this::getNextString);
+  }
+
+  private String getNextString() {
     return vals.get(rand.nextInt(cardinality));

Review comment:
       It is not unusual to have million rows in a completed segment. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r578062961



##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/data/generator/NumberGenerator.java
##########
@@ -61,23 +66,23 @@ public void init() {
         final int start = rand.nextInt(cardinality);
         final int end = start + cardinality;
         for (int i = start; i < end; i++) {
-          intValues.add(new Integer(i));
+          intValues.add(i);
         }
         break;
       case LONG:
         longValues = new ArrayList<Long>();
         final long longStart = rand.nextInt(cardinality);
         final long longEnd = longStart + cardinality;
         for (long i = longStart; i < longEnd; i++) {
-          longValues.add(new Long(i));
+          longValues.add(i);
         }
         break;
       case FLOAT:
         floatValues = new ArrayList<Float>();
-        final float floatStart = rand.nextFloat() * rand.nextInt(1000);
+        final float floatStart = Math.round(rand.nextFloat()* 100.0f) / 100.0f;

Review comment:
       Got it. Thanks




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r572430013



##########
File path: pinot-tools/src/test/resources/memory_estimation/data-characteristics.json
##########
@@ -0,0 +1,37 @@
+{
+  "numberOfRows": 10000,
+  "columnCharacteristics": [

Review comment:
       This is not same as what is taken by RecommenderDriver right? I don't think it takes columnCharacteristics.  See the file InputManager and SchemaWithMetadata. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] sajjad-moradi commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
sajjad-moradi commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r577250920



##########
File path: pinot-tools/src/main/java/org/apache/pinot/tools/realtime/provisioning/MemoryEstimator.java
##########
@@ -387,4 +418,182 @@ private long calculateMemoryForCompletedSegmentsPerPartition(long completedSegme
   public String[][] getNumSegmentsQueriedPerHost() {
     return _numSegmentsQueriedPerHost;
   }
+
+  private static File generateCompletedSegment(File dataCharacteristicsFile, File schemaFile, TableConfig tableConfig) {
+    SegmentGenerator segmentGenerator = new SegmentGenerator(dataCharacteristicsFile, schemaFile, tableConfig, true);
+    return segmentGenerator.generate();
+  }
+
+  /**
+   * This class is used in Memory Estimator to generate segment based on the the given characteristics of data
+   */
+  public static class SegmentGenerator {
+    private static final Logger LOGGER = LoggerFactory.getLogger(SegmentGenerator.class);
+    private static final SimpleDateFormat DATE_FORMAT = new SimpleDateFormat("yyyy-MM-dd_HH:mm:ss");
+
+    private File _dataCharacteristicsFile;
+    private File _schemaFile;
+    private TableConfig _tableConfig;
+    private final boolean _deleteCsv;
+
+    public SegmentGenerator(File dataCharacteristicsFile, File schemaFile, TableConfig tableConfig, boolean deleteCsv) {
+      _dataCharacteristicsFile = dataCharacteristicsFile;
+      _schemaFile = schemaFile;
+      _tableConfig = tableConfig;
+      _deleteCsv = deleteCsv;
+    }
+
+    public File generate() {
+      Date now = new Date();
+
+      // extract schema
+      Schema schema;
+      try {
+        schema = JsonUtils.fileToObject(_schemaFile, Schema.class);
+      } catch (Exception e) {
+        throw new RuntimeException(String.format("Cannot read schema file '%s' to schema object.", _schemaFile));
+      }
+
+      // generate data & creat segment
+      File csvDataFile = generateData(schema, now);
+      File segment = createSegment(csvDataFile, schema, now);
+
+      if (_deleteCsv) {
+        csvDataFile.delete();
+      }
+      return segment;
+    }
+
+    private File generateData(Schema schema, Date now) {
+
+      // extract data characteristics
+      DataCharacteristics dataCharacteristics;
+      try {
+         dataCharacteristics = JsonUtils.fileToObject(_dataCharacteristicsFile, DataCharacteristics.class);
+      } catch (Exception e) {
+        throw new RuntimeException("Cannot deserialize data characteristic file " + _dataCharacteristicsFile);
+      }
+
+      // create maps of "column name" to ...
+      Map<String, Integer> lengths = new HashMap<>();
+      Map<String, Double> mvCounts = new HashMap<>();
+      Map<String, Integer> cardinalities = new HashMap<>();
+      Map<String, FieldSpec.DataType> dataTypes = new HashMap<>();
+      Map<String, FieldSpec.FieldType> fieldTypes = new HashMap<>();
+      Map<String, TimeUnit> timeUnits = new HashMap<>();
+      List<String> colNames = new ArrayList<>();
+      dataCharacteristics.columnCharacteristics.forEach(column -> {
+        colNames.add(column.name);
+        lengths.put(column.name, column.averageLength);
+        mvCounts.put(column.name, column.numberOfValuesPerEntry);
+        cardinalities.put(column.name, column.cardinality);
+      });
+      schema.getAllFieldSpecs().forEach(fieldSpec -> {
+        String name = fieldSpec.getName();
+        dataTypes.put(name, fieldSpec.getDataType());
+        fieldTypes.put(name, fieldSpec.getFieldType());
+      });
+      timeUnits.put(schema.getTimeFieldSpec().getName(),
+          schema.getTimeFieldSpec().getIncomingGranularitySpec().getTimeType());
+
+      // generate data
+      String outputDir = getOutputDir(now, "-csv");
+      DataGeneratorSpec spec =
+          new DataGeneratorSpec(colNames, cardinalities, new HashMap<>(), new HashMap<>(), mvCounts, lengths, dataTypes,
+              fieldTypes, timeUnits, FileFormat.CSV, outputDir, true);
+      DataGenerator dataGenerator = new DataGenerator();
+      try {
+        dataGenerator.init(spec);
+        dataGenerator.generateCsv(dataCharacteristics.numberOfRows, 1);
+        File outputFile = Paths.get(outputDir, "output_0.csv").toFile();
+        LOGGER.info("Successfully generated data file: {}", outputFile);
+        return outputFile;
+      } catch (Exception e) {
+        throw new RuntimeException(e);
+      }
+    }
+
+    private File createSegment(File csvDataFile, Schema schema, Date now) {
+
+      // create segment
+      LOGGER.info("Started creating segment from file: {}", csvDataFile);
+      String outDir = getOutputDir(now, "-segment");
+      SegmentGeneratorConfig segmentGeneratorConfig = getSegmentGeneratorConfig(csvDataFile, schema, outDir);
+      SegmentIndexCreationDriver driver = new SegmentIndexCreationDriverImpl();
+      try {
+        driver.init(segmentGeneratorConfig);
+        driver.build();
+      } catch (Exception e) {
+        throw new RuntimeException("Caught exception while generating segment from file: " + csvDataFile, e);
+      }
+      String segmentName = driver.getSegmentName();
+      File indexDir = new File(outDir, segmentName);
+      LOGGER.info("Successfully created segment: {} at directory: {}", segmentName, indexDir);
+
+      // verify segment
+      LOGGER.info("Verifying the segment by loading it");
+      ImmutableSegment segment;
+      try {
+        segment = ImmutableSegmentLoader.load(indexDir, ReadMode.mmap);
+      } catch (Exception e) {
+        throw new RuntimeException("Caught exception while verifying the created segment", e);
+      }
+      LOGGER.info("Successfully loaded segment: {} of size: {} bytes", segmentName,
+          segment.getSegmentSizeBytes());
+      segment.destroy();
+
+      return indexDir;
+    }
+
+    private SegmentGeneratorConfig getSegmentGeneratorConfig(File csvDataFile, Schema schema, String outDir) {
+      SegmentGeneratorConfig segmentGeneratorConfig = new SegmentGeneratorConfig(_tableConfig, schema);
+      segmentGeneratorConfig.setInputFilePath(csvDataFile.getPath());
+      segmentGeneratorConfig.setFormat(FileFormat.CSV);
+      segmentGeneratorConfig.setOutDir(outDir);
+      segmentGeneratorConfig.setReaderConfig(new CSVRecordReaderConfig()); // FIXME

Review comment:
       It was for my book-keeping that got away to get cleaned up before sending the PR. It's removed now.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r572437744



##########
File path: pinot-tools/src/test/resources/memory_estimation/schema.json
##########
@@ -0,0 +1,36 @@
+{
+  "schemaName": "testTable",
+  "dimensionFieldSpecs": [
+    {

Review comment:
       Also, if you think about this from integration perspective, RecommenderDriver doesn't take schema. It takes the schema annotated with characteristics.  (See the files RecommenderDriver, InputManager and SchemaWithMetadata). So, something to think about for integration.
   
   If now you are providing MemoryEstimator with schema separately then I am not sure if you are going to change it back to the annotated schema in the follow-up PR when you will integrate with RecommenderDriver.  From the annotated file, you should be able to extract out the schema easily. The API will look much cleaner as compared to now with separate schema and characteristics files. That's the reason why RecommenderDriver just takes one file. 
   
   https://github.com/apache/incubator-pinot/blob/master/pinot-controller/src/main/java/org/apache/pinot/controller/recommender/io/InputManager.java
   
   https://github.com/apache/incubator-pinot/blob/master/pinot-controller/src/main/java/org/apache/pinot/controller/recommender/io/metadata/SchemaWithMetaData.java




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [incubator-pinot] siddharthteotia commented on a change in pull request #6546: Improve Real Time Provisioning Helper tool

Posted by GitBox <gi...@apache.org>.
siddharthteotia commented on a change in pull request #6546:
URL: https://github.com/apache/incubator-pinot/pull/6546#discussion_r578061642



##########
File path: pinot-common/src/main/java/org/apache/pinot/common/data/schema/FieldMetadata.java
##########
@@ -16,24 +16,24 @@
  * specific language governing permissions and limitations
  * under the License.
  */
-package org.apache.pinot.controller.recommender.io.metadata;
+package org.apache.pinot.common.data.schema;
 
 import com.fasterxml.jackson.annotation.JsonIgnoreProperties;
 import com.fasterxml.jackson.annotation.JsonSetter;
 import com.fasterxml.jackson.annotation.Nulls;
 import org.apache.pinot.spi.data.FieldSpec;
 
-import static org.apache.pinot.controller.recommender.rules.io.params.RecommenderConstants.DEFAULT_AVERAGE_NUM_VALUES_PER_ENTRY;
-import static org.apache.pinot.controller.recommender.rules.io.params.RecommenderConstants.DEFAULT_CARDINALITY;
-import static org.apache.pinot.controller.recommender.rules.io.params.RecommenderConstants.DEFAULT_DATA_LENGTH;
+import static org.apache.pinot.common.data.schema.SchemaMetadataConstants.DEFAULT_AVERAGE_NUM_VALUES_PER_ENTRY;

Review comment:
       It is preferable to have all the recommender code in a sub-directory as it is currently organized in pinot/controller/recommender/...
   
   So, please consider undoing this directory move




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org