You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pinot.apache.org by GitBox <gi...@apache.org> on 2022/06/07 00:15:41 UTC

[GitHub] [pinot] Jackie-Jiang opened a new pull request, #8846: Reduce the heap memory usage for segment creation

Jackie-Jiang opened a new pull request, #8846:
URL: https://github.com/apache/pinot/pull/8846

   During the segment creation, currently we hold multiple copies of unique values for all columns in heap memory (3 copies for dictionary encoded column, 2 copies for raw index column).
   This PR enhances the `StatsCollector` to early release one copy when the values are no longer needed, and remove the reference of one copy of the unique values from the dictionary creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] codecov-commenter commented on pull request #8846: Reduce the heap memory usage for segment creation

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on PR #8846:
URL: https://github.com/apache/pinot/pull/8846#issuecomment-1148094810

   # [Codecov](https://codecov.io/gh/apache/pinot/pull/8846?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#8846](https://codecov.io/gh/apache/pinot/pull/8846?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (4f338c8) into [master](https://codecov.io/gh/apache/pinot/commit/d1d2ddbc116f6dddbefdf1e69dd409434eaae0e4?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (d1d2ddb) will **decrease** coverage by `6.87%`.
   > The diff coverage is `92.25%`.
   
   > :exclamation: Current head 4f338c8 differs from pull request most recent head b2cad00. Consider uploading reports for the commit b2cad00 to get more accurate results
   
   ```diff
   @@             Coverage Diff              @@
   ##             master    #8846      +/-   ##
   ============================================
   - Coverage     69.80%   62.92%   -6.88%     
   + Complexity     4659     4647      -12     
   ============================================
     Files          1741     1695      -46     
     Lines         91476    89470    -2006     
     Branches      13677    13457     -220     
   ============================================
   - Hits          63854    56301    -7553     
   - Misses        23199    29111    +5912     
   + Partials       4423     4058     -365     
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | integration1 | `?` | |
   | integration2 | `?` | |
   | unittests1 | `66.24% <92.25%> (+0.02%)` | :arrow_up: |
   | unittests2 | `14.16% <0.00%> (+0.03%)` | :arrow_up: |
   
   Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more.
   
   | [Impacted Files](https://codecov.io/gh/apache/pinot/pull/8846?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [...t/creator/impl/SegmentIndexCreationDriverImpl.java](https://codecov.io/gh/apache/pinot/pull/8846/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2NyZWF0b3IvaW1wbC9TZWdtZW50SW5kZXhDcmVhdGlvbkRyaXZlckltcGwuamF2YQ==) | `84.61% <ø> (ø)` | |
   | [.../stats/BigDecimalColumnPreIndexStatsCollector.java](https://codecov.io/gh/apache/pinot/pull/8846/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2NyZWF0b3IvaW1wbC9zdGF0cy9CaWdEZWNpbWFsQ29sdW1uUHJlSW5kZXhTdGF0c0NvbGxlY3Rvci5qYXZh) | `69.04% <83.33%> (-0.96%)` | :arrow_down: |
   | [...impl/stats/BytesColumnPredIndexStatsCollector.java](https://codecov.io/gh/apache/pinot/pull/8846/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2NyZWF0b3IvaW1wbC9zdGF0cy9CeXRlc0NvbHVtblByZWRJbmRleFN0YXRzQ29sbGVjdG9yLmphdmE=) | `79.62% <83.33%> (-1.14%)` | :arrow_down: |
   | [...impl/stats/StringColumnPreIndexStatsCollector.java](https://codecov.io/gh/apache/pinot/pull/8846/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2NyZWF0b3IvaW1wbC9zdGF0cy9TdHJpbmdDb2x1bW5QcmVJbmRleFN0YXRzQ29sbGVjdG9yLmphdmE=) | `79.24% <83.33%> (-1.15%)` | :arrow_down: |
   | [...impl/stats/DoubleColumnPreIndexStatsCollector.java](https://codecov.io/gh/apache/pinot/pull/8846/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2NyZWF0b3IvaW1wbC9zdGF0cy9Eb3VibGVDb2x1bW5QcmVJbmRleFN0YXRzQ29sbGVjdG9yLmphdmE=) | `78.57% <87.50%> (-1.43%)` | :arrow_down: |
   | [.../impl/stats/FloatColumnPreIndexStatsCollector.java](https://codecov.io/gh/apache/pinot/pull/8846/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2NyZWF0b3IvaW1wbC9zdGF0cy9GbG9hdENvbHVtblByZUluZGV4U3RhdHNDb2xsZWN0b3IuamF2YQ==) | `78.57% <87.50%> (-1.43%)` | :arrow_down: |
   | [...or/impl/stats/IntColumnPreIndexStatsCollector.java](https://codecov.io/gh/apache/pinot/pull/8846/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2NyZWF0b3IvaW1wbC9zdGF0cy9JbnRDb2x1bW5QcmVJbmRleFN0YXRzQ29sbGVjdG9yLmphdmE=) | `78.57% <87.50%> (-1.43%)` | :arrow_down: |
   | [...r/impl/stats/LongColumnPreIndexStatsCollector.java](https://codecov.io/gh/apache/pinot/pull/8846/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2NyZWF0b3IvaW1wbC9zdGF0cy9Mb25nQ29sdW1uUHJlSW5kZXhTdGF0c0NvbGxlY3Rvci5qYXZh) | `78.57% <87.50%> (-1.43%)` | :arrow_down: |
   | [.../impl/stats/AbstractColumnStatisticsCollector.java](https://codecov.io/gh/apache/pinot/pull/8846/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2NyZWF0b3IvaW1wbC9zdGF0cy9BYnN0cmFjdENvbHVtblN0YXRpc3RpY3NDb2xsZWN0b3IuamF2YQ==) | `92.85% <93.33%> (ø)` | |
   | [...segment/creator/impl/SegmentDictionaryCreator.java](https://codecov.io/gh/apache/pinot/pull/8846/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGlub3Qtc2VnbWVudC1sb2NhbC9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvcGlub3Qvc2VnbWVudC9sb2NhbC9zZWdtZW50L2NyZWF0b3IvaW1wbC9TZWdtZW50RGljdGlvbmFyeUNyZWF0b3IuamF2YQ==) | `90.00% <95.65%> (-1.12%)` | :arrow_down: |
   | ... and [421 more](https://codecov.io/gh/apache/pinot/pull/8846/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/pinot/pull/8846?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/pinot/pull/8846?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [d1d2ddb...b2cad00](https://codecov.io/gh/apache/pinot/pull/8846?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on a diff in pull request #8846: Reduce the heap memory usage for segment creation

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on code in PR #8846:
URL: https://github.com/apache/pinot/pull/8846#discussion_r901025316


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/LongColumnPreIndexStatsCollector.java:
##########
@@ -100,8 +97,11 @@ public int getCardinality() {
 
   @Override
   public void seal() {
-    _sortedValues = _values.toLongArray();
-    Arrays.sort(_sortedValues);
-    _sealed = true;
+    if (!_sealed) {

Review Comment:
   I don't want to introduce an extra per-value check (other checks are per column). IMO it is okay for it to throw NPE if someone uses it incorrectly



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on pull request #8846: Reduce the heap memory usage for segment creation

Posted by GitBox <gi...@apache.org>.
richardstartin commented on PR #8846:
URL: https://github.com/apache/pinot/pull/8846#issuecomment-1164628897

   Just for the sake of posterity, in case the tradeoff is discussed again, the replacement of the hashmap lookups with binary search change was implemented in #8924 and reverted in #8950.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] xiangfu0 commented on a diff in pull request #8846: Reduce the heap memory usage for segment creation

Posted by GitBox <gi...@apache.org>.
xiangfu0 commented on code in PR #8846:
URL: https://github.com/apache/pinot/pull/8846#discussion_r901020485


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/LongColumnPreIndexStatsCollector.java:
##########
@@ -100,8 +97,11 @@ public int getCardinality() {
 
   @Override
   public void seal() {
-    _sortedValues = _values.toLongArray();
-    Arrays.sort(_sortedValues);
-    _sealed = true;
+    if (!_sealed) {

Review Comment:
   If there is no sorting after seal, we should also throw an exception in collect(Object) if _sealed=true to avoid any unexpected behavior. Same for all *ColumnPreIndexStatsCollector



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on a diff in pull request #8846: Reduce the heap memory usage for segment creation

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on code in PR #8846:
URL: https://github.com/apache/pinot/pull/8846#discussion_r899758406


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/LongColumnPreIndexStatsCollector.java:
##########
@@ -57,10 +56,10 @@ public void collect(Object entry) {
     }
   }
 
-  void addressSorted(long entry) {
-    if (_isSorted) {
+  private void addressSorted(long entry) {

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on pull request #8846: Reduce the heap memory usage for segment creation

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on PR #8846:
URL: https://github.com/apache/pinot/pull/8846#issuecomment-1158566486

   Performed a benchmark with 5M lookups on different data types with different cardinality:
   
   ```
   Benchmark                                          (_cardinality)  Mode  Cnt     Score      Error  Units
   BenchmarkBinarySearch.benchmarkDoubleBinarySearch             100  avgt    3   216.168 ±   17.112  ms/op
   BenchmarkBinarySearch.benchmarkDoubleBinarySearch           10000  avgt    3   402.108 ±   37.544  ms/op
   BenchmarkBinarySearch.benchmarkDoubleBinarySearch         1000000  avgt    3   831.683 ±  417.802  ms/op
   BenchmarkBinarySearch.benchmarkDoubleMap                      100  avgt    3    65.926 ±   21.608  ms/op
   BenchmarkBinarySearch.benchmarkDoubleMap                    10000  avgt    3    87.987 ±    7.291  ms/op
   BenchmarkBinarySearch.benchmarkDoubleMap                  1000000  avgt    3   568.169 ±   87.005  ms/op
   BenchmarkBinarySearch.benchmarkFloatBinarySearch              100  avgt    3   203.972 ±    8.662  ms/op
   BenchmarkBinarySearch.benchmarkFloatBinarySearch            10000  avgt    3   393.657 ±   49.358  ms/op
   BenchmarkBinarySearch.benchmarkFloatBinarySearch          1000000  avgt    3   751.277 ±   13.846  ms/op
   BenchmarkBinarySearch.benchmarkFloatMap                       100  avgt    3    89.837 ±   24.091  ms/op
   BenchmarkBinarySearch.benchmarkFloatMap                     10000  avgt    3    87.869 ±   17.819  ms/op
   BenchmarkBinarySearch.benchmarkFloatMap                   1000000  avgt    3   404.022 ±   56.542  ms/op
   BenchmarkBinarySearch.benchmarkIntBinarySearch                100  avgt    3   191.057 ±   24.173  ms/op
   BenchmarkBinarySearch.benchmarkIntBinarySearch              10000  avgt    3   345.814 ±   78.147  ms/op
   BenchmarkBinarySearch.benchmarkIntBinarySearch            1000000  avgt    3   693.074 ±   23.619  ms/op
   BenchmarkBinarySearch.benchmarkIntMap                         100  avgt    3    81.485 ±    1.535  ms/op
   BenchmarkBinarySearch.benchmarkIntMap                       10000  avgt    3    85.403 ±    7.045  ms/op
   BenchmarkBinarySearch.benchmarkIntMap                     1000000  avgt    3   219.425 ±   59.368  ms/op
   BenchmarkBinarySearch.benchmarkLongBinarySearch               100  avgt    3   182.610 ±   13.995  ms/op
   BenchmarkBinarySearch.benchmarkLongBinarySearch             10000  avgt    3   365.881 ±   15.217  ms/op
   BenchmarkBinarySearch.benchmarkLongBinarySearch           1000000  avgt    3   774.414 ±  402.259  ms/op
   BenchmarkBinarySearch.benchmarkLongMap                        100  avgt    3    65.400 ±    5.087  ms/op
   BenchmarkBinarySearch.benchmarkLongMap                      10000  avgt    3    98.427 ±   11.991  ms/op
   BenchmarkBinarySearch.benchmarkLongMap                    1000000  avgt    3   302.917 ±    6.348  ms/op
   BenchmarkBinarySearch.benchmarkStringBinarySearch             100  avgt    3   306.576 ±   39.766  ms/op
   BenchmarkBinarySearch.benchmarkStringBinarySearch           10000  avgt    3   870.486 ±   52.621  ms/op
   BenchmarkBinarySearch.benchmarkStringBinarySearch         1000000  avgt    3  3173.214 ± 1122.761  ms/op
   BenchmarkBinarySearch.benchmarkStringMap                      100  avgt    3   115.433 ±    4.070  ms/op
   BenchmarkBinarySearch.benchmarkStringMap                    10000  avgt    3   254.615 ±   15.829  ms/op
   BenchmarkBinarySearch.benchmarkStringMap                  1000000  avgt    3  1372.186 ±  197.025  ms/op
   ```
   
   Note that all the strings have the same prefix, so it is actually adding some comparison cost for the binary search. With 1M cardinality, binary search is about 2-3x slower than hash lookup, but it is still fast enough (adding less than 2 seconds for string), and shouldn't become an issue. Considering the memory saving (especially preventing running out of memory when generating segments for wide tables), and the extra GC overhead of creating a map, this change should be beneficial. 
   
   Here is the benchmark code:
   ```
   package org.apache.pinot.perf;
   
   import it.unimi.dsi.fastutil.doubles.Double2IntOpenHashMap;
   import it.unimi.dsi.fastutil.floats.Float2IntOpenHashMap;
   import it.unimi.dsi.fastutil.ints.Int2IntOpenHashMap;
   import it.unimi.dsi.fastutil.longs.Long2IntOpenHashMap;
   import it.unimi.dsi.fastutil.objects.Object2IntOpenHashMap;
   import java.io.IOException;
   import java.util.Arrays;
   import java.util.Random;
   import java.util.concurrent.TimeUnit;
   import org.openjdk.jmh.annotations.Benchmark;
   import org.openjdk.jmh.annotations.BenchmarkMode;
   import org.openjdk.jmh.annotations.Mode;
   import org.openjdk.jmh.annotations.OutputTimeUnit;
   import org.openjdk.jmh.annotations.Param;
   import org.openjdk.jmh.annotations.Scope;
   import org.openjdk.jmh.annotations.Setup;
   import org.openjdk.jmh.annotations.State;
   import org.openjdk.jmh.runner.Runner;
   import org.openjdk.jmh.runner.options.Options;
   import org.openjdk.jmh.runner.options.OptionsBuilder;
   import org.openjdk.jmh.runner.options.TimeValue;
   
   
   @State(Scope.Benchmark)
   public class BenchmarkBinarySearch {
     private static final int NUM_LOOK_UPS = 5000000;
     private static final Random RANDOM = new Random();
   
     @Param({"100", "10000", "1000000"})
     private int _cardinality;
   
     private int[] _sortedInts;
     private long[] _sortedLongs;
     private float[] _sortedFloats;
     private double[] _sortedDoubles;
     private String[] _sortedStrings;
   
     @Setup
     public void setUp()
         throws IOException {
       _sortedInts = new int[_cardinality];
       _sortedLongs = new long[_cardinality];
       _sortedFloats = new float[_cardinality];
       _sortedDoubles = new double[_cardinality];
       _sortedStrings = new String[_cardinality];
       for (int i = 0; i < _cardinality; i++) {
         _sortedInts[i] = i;
         _sortedLongs[i] = i;
         _sortedFloats[i] = i;
         _sortedDoubles[i] = i;
         String string = Integer.toString(i + _cardinality);
         _sortedStrings[i] = string;
       }
     }
   
     @Benchmark
     @BenchmarkMode(Mode.AverageTime)
     @OutputTimeUnit(TimeUnit.MILLISECONDS)
     public int benchmarkIntMap() {
       Int2IntOpenHashMap intMap = new Int2IntOpenHashMap(_cardinality);
       for (int i = 0; i < _cardinality; i++) {
         intMap.put(i, i);
       }
       int result = 0;
       for (int i = 0; i < NUM_LOOK_UPS; i++) {
         result += intMap.get(RANDOM.nextInt(_cardinality));
       }
       return result;
     }
   
     @Benchmark
     @BenchmarkMode(Mode.AverageTime)
     @OutputTimeUnit(TimeUnit.MILLISECONDS)
     public int benchmarkIntBinarySearch() {
       int result = 0;
       for (int i = 0; i < NUM_LOOK_UPS; i++) {
         result += Arrays.binarySearch(_sortedInts, RANDOM.nextInt(_cardinality));
       }
       return result;
     }
   
     @Benchmark
     @BenchmarkMode(Mode.AverageTime)
     @OutputTimeUnit(TimeUnit.MILLISECONDS)
     public int benchmarkLongMap() {
       Long2IntOpenHashMap longMap = new Long2IntOpenHashMap(_cardinality);
       for (int i = 0; i < _cardinality; i++) {
         longMap.put(i, i);
       }
       int result = 0;
       for (int i = 0; i < NUM_LOOK_UPS; i++) {
         result += longMap.get(RANDOM.nextInt(_cardinality));
       }
       return result;
     }
   
     @Benchmark
     @BenchmarkMode(Mode.AverageTime)
     @OutputTimeUnit(TimeUnit.MILLISECONDS)
     public int benchmarkLongBinarySearch() {
       int result = 0;
       for (int i = 0; i < NUM_LOOK_UPS; i++) {
         result += Arrays.binarySearch(_sortedLongs, RANDOM.nextInt(_cardinality));
       }
       return result;
     }
   
     @Benchmark
     @BenchmarkMode(Mode.AverageTime)
     @OutputTimeUnit(TimeUnit.MILLISECONDS)
     public int benchmarkFloatMap() {
       Float2IntOpenHashMap floatMap = new Float2IntOpenHashMap(_cardinality);
       for (int i = 0; i < _cardinality; i++) {
         floatMap.put(i, i);
       }
       int result = 0;
       for (int i = 0; i < NUM_LOOK_UPS; i++) {
         result += floatMap.get(RANDOM.nextInt(_cardinality));
       }
       return result;
     }
   
     @Benchmark
     @BenchmarkMode(Mode.AverageTime)
     @OutputTimeUnit(TimeUnit.MILLISECONDS)
     public int benchmarkFloatBinarySearch() {
       int result = 0;
       for (int i = 0; i < NUM_LOOK_UPS; i++) {
         result += Arrays.binarySearch(_sortedFloats, RANDOM.nextInt(_cardinality));
       }
       return result;
     }
   
     @Benchmark
     @BenchmarkMode(Mode.AverageTime)
     @OutputTimeUnit(TimeUnit.MILLISECONDS)
     public int benchmarkDoubleMap() {
       Double2IntOpenHashMap doubleMap = new Double2IntOpenHashMap(_cardinality);
       for (int i = 0; i < _cardinality; i++) {
         doubleMap.put(i, i);
       }
       int result = 0;
       for (int i = 0; i < NUM_LOOK_UPS; i++) {
         result += doubleMap.get(RANDOM.nextInt(_cardinality));
       }
       return result;
     }
   
     @Benchmark
     @BenchmarkMode(Mode.AverageTime)
     @OutputTimeUnit(TimeUnit.MILLISECONDS)
     public int benchmarkDoubleBinarySearch() {
       int result = 0;
       for (int i = 0; i < NUM_LOOK_UPS; i++) {
         result += Arrays.binarySearch(_sortedDoubles, RANDOM.nextInt(_cardinality));
       }
       return result;
     }
   
     @Benchmark
     @BenchmarkMode(Mode.AverageTime)
     @OutputTimeUnit(TimeUnit.MILLISECONDS)
     public int benchmarkStringMap() {
       Object2IntOpenHashMap<String> stringMap = new Object2IntOpenHashMap<>(_cardinality);
       for (int i = 0; i < _cardinality; i++) {
         stringMap.put(Integer.toString(i + _cardinality), i);
       }
       int result = 0;
       for (int i = 0; i < NUM_LOOK_UPS; i++) {
         result += stringMap.getInt(Integer.toString(RANDOM.nextInt(_cardinality) + _cardinality));
       }
       return result;
     }
   
     @Benchmark
     @BenchmarkMode(Mode.AverageTime)
     @OutputTimeUnit(TimeUnit.MILLISECONDS)
     public int benchmarkStringBinarySearch() {
       int result = 0;
       for (int i = 0; i < NUM_LOOK_UPS; i++) {
         result += Arrays.binarySearch(_sortedStrings, Integer.toString(RANDOM.nextInt(_cardinality) + _cardinality));
       }
       return result;
     }
   
     public static void main(String[] args)
         throws Exception {
       Options opt =
           new OptionsBuilder().include(BenchmarkBinarySearch.class.getSimpleName()).warmupTime(TimeValue.seconds(5))
               .warmupIterations(2).measurementTime(TimeValue.seconds(10)).measurementIterations(3).forks(1).build();
   
       new Runner(opt).run();
     }
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on pull request #8846: Reduce the heap memory usage for segment creation

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on PR #8846:
URL: https://github.com/apache/pinot/pull/8846#issuecomment-1159285009

   > How do you know how much extra time it takes to build a column based on this benchmark? Shouldn’t it benchmark segment build time instead?
   
   This benchmark isolates the steps of map creation + lookup when building a segment with 5M documents. I tried building the segments in the offline integration test, and the time spent is reduced from 4.8s to 3.6s, but that is the combined result of all the changes in this PR. To better testing the performance impact of binary search, I decide to not include that change in this PR, and have a separate one just for the binary search change


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on pull request #8846: Reduce the heap memory usage for segment creation

Posted by GitBox <gi...@apache.org>.
richardstartin commented on PR #8846:
URL: https://github.com/apache/pinot/pull/8846#issuecomment-1158590337

   > Performed a benchmark with 5M lookups on different data types with different cardinality:
   > 
   > ```
   > Benchmark                                          (_cardinality)  Mode  Cnt     Score      Error  Units
   > BenchmarkBinarySearch.benchmarkDoubleBinarySearch             100  avgt    3   216.168 ±   17.112  ms/op
   > BenchmarkBinarySearch.benchmarkDoubleBinarySearch           10000  avgt    3   402.108 ±   37.544  ms/op
   > BenchmarkBinarySearch.benchmarkDoubleBinarySearch         1000000  avgt    3   831.683 ±  417.802  ms/op
   > BenchmarkBinarySearch.benchmarkDoubleMap                      100  avgt    3    65.926 ±   21.608  ms/op
   > BenchmarkBinarySearch.benchmarkDoubleMap                    10000  avgt    3    87.987 ±    7.291  ms/op
   > BenchmarkBinarySearch.benchmarkDoubleMap                  1000000  avgt    3   568.169 ±   87.005  ms/op
   > BenchmarkBinarySearch.benchmarkFloatBinarySearch              100  avgt    3   203.972 ±    8.662  ms/op
   > BenchmarkBinarySearch.benchmarkFloatBinarySearch            10000  avgt    3   393.657 ±   49.358  ms/op
   > BenchmarkBinarySearch.benchmarkFloatBinarySearch          1000000  avgt    3   751.277 ±   13.846  ms/op
   > BenchmarkBinarySearch.benchmarkFloatMap                       100  avgt    3    89.837 ±   24.091  ms/op
   > BenchmarkBinarySearch.benchmarkFloatMap                     10000  avgt    3    87.869 ±   17.819  ms/op
   > BenchmarkBinarySearch.benchmarkFloatMap                   1000000  avgt    3   404.022 ±   56.542  ms/op
   > BenchmarkBinarySearch.benchmarkIntBinarySearch                100  avgt    3   191.057 ±   24.173  ms/op
   > BenchmarkBinarySearch.benchmarkIntBinarySearch              10000  avgt    3   345.814 ±   78.147  ms/op
   > BenchmarkBinarySearch.benchmarkIntBinarySearch            1000000  avgt    3   693.074 ±   23.619  ms/op
   > BenchmarkBinarySearch.benchmarkIntMap                         100  avgt    3    81.485 ±    1.535  ms/op
   > BenchmarkBinarySearch.benchmarkIntMap                       10000  avgt    3    85.403 ±    7.045  ms/op
   > BenchmarkBinarySearch.benchmarkIntMap                     1000000  avgt    3   219.425 ±   59.368  ms/op
   > BenchmarkBinarySearch.benchmarkLongBinarySearch               100  avgt    3   182.610 ±   13.995  ms/op
   > BenchmarkBinarySearch.benchmarkLongBinarySearch             10000  avgt    3   365.881 ±   15.217  ms/op
   > BenchmarkBinarySearch.benchmarkLongBinarySearch           1000000  avgt    3   774.414 ±  402.259  ms/op
   > BenchmarkBinarySearch.benchmarkLongMap                        100  avgt    3    65.400 ±    5.087  ms/op
   > BenchmarkBinarySearch.benchmarkLongMap                      10000  avgt    3    98.427 ±   11.991  ms/op
   > BenchmarkBinarySearch.benchmarkLongMap                    1000000  avgt    3   302.917 ±    6.348  ms/op
   > BenchmarkBinarySearch.benchmarkStringBinarySearch             100  avgt    3   306.576 ±   39.766  ms/op
   > BenchmarkBinarySearch.benchmarkStringBinarySearch           10000  avgt    3   870.486 ±   52.621  ms/op
   > BenchmarkBinarySearch.benchmarkStringBinarySearch         1000000  avgt    3  3173.214 ± 1122.761  ms/op
   > BenchmarkBinarySearch.benchmarkStringMap                      100  avgt    3   115.433 ±    4.070  ms/op
   > BenchmarkBinarySearch.benchmarkStringMap                    10000  avgt    3   254.615 ±   15.829  ms/op
   > BenchmarkBinarySearch.benchmarkStringMap                  1000000  avgt    3  1372.186 ±  197.025  ms/op
   > ```
   > 
   > Note that all the strings have the same prefix, so it is actually adding some comparison cost for the binary search. With 1M cardinality, binary search is about 2-3x slower than hash lookup, but it is still fast enough (adding less than 2 seconds for string), and shouldn't become an issue. Considering the memory saving (especially preventing running out of memory when generating segments for wide tables), and the extra GC overhead of creating a map, this change should be beneficial.
   > 
   > Here is the benchmark code:
   > 
   > ```
   > package org.apache.pinot.perf;
   > 
   > import it.unimi.dsi.fastutil.doubles.Double2IntOpenHashMap;
   > import it.unimi.dsi.fastutil.floats.Float2IntOpenHashMap;
   > import it.unimi.dsi.fastutil.ints.Int2IntOpenHashMap;
   > import it.unimi.dsi.fastutil.longs.Long2IntOpenHashMap;
   > import it.unimi.dsi.fastutil.objects.Object2IntOpenHashMap;
   > import java.io.IOException;
   > import java.util.Arrays;
   > import java.util.Random;
   > import java.util.concurrent.TimeUnit;
   > import org.openjdk.jmh.annotations.Benchmark;
   > import org.openjdk.jmh.annotations.BenchmarkMode;
   > import org.openjdk.jmh.annotations.Mode;
   > import org.openjdk.jmh.annotations.OutputTimeUnit;
   > import org.openjdk.jmh.annotations.Param;
   > import org.openjdk.jmh.annotations.Scope;
   > import org.openjdk.jmh.annotations.Setup;
   > import org.openjdk.jmh.annotations.State;
   > import org.openjdk.jmh.runner.Runner;
   > import org.openjdk.jmh.runner.options.Options;
   > import org.openjdk.jmh.runner.options.OptionsBuilder;
   > import org.openjdk.jmh.runner.options.TimeValue;
   > 
   > 
   > @State(Scope.Benchmark)
   > public class BenchmarkBinarySearch {
   >   private static final int NUM_LOOK_UPS = 5000000;
   >   private static final Random RANDOM = new Random();
   > 
   >   @Param({"100", "10000", "1000000"})
   >   private int _cardinality;
   > 
   >   private int[] _sortedInts;
   >   private long[] _sortedLongs;
   >   private float[] _sortedFloats;
   >   private double[] _sortedDoubles;
   >   private String[] _sortedStrings;
   > 
   >   @Setup
   >   public void setUp()
   >       throws IOException {
   >     _sortedInts = new int[_cardinality];
   >     _sortedLongs = new long[_cardinality];
   >     _sortedFloats = new float[_cardinality];
   >     _sortedDoubles = new double[_cardinality];
   >     _sortedStrings = new String[_cardinality];
   >     for (int i = 0; i < _cardinality; i++) {
   >       _sortedInts[i] = i;
   >       _sortedLongs[i] = i;
   >       _sortedFloats[i] = i;
   >       _sortedDoubles[i] = i;
   >       String string = Integer.toString(i + _cardinality);
   >       _sortedStrings[i] = string;
   >     }
   >   }
   > 
   >   @Benchmark
   >   @BenchmarkMode(Mode.AverageTime)
   >   @OutputTimeUnit(TimeUnit.MILLISECONDS)
   >   public int benchmarkIntMap() {
   >     Int2IntOpenHashMap intMap = new Int2IntOpenHashMap(_cardinality);
   >     for (int i = 0; i < _cardinality; i++) {
   >       intMap.put(i, i);
   >     }
   >     int result = 0;
   >     for (int i = 0; i < NUM_LOOK_UPS; i++) {
   >       result += intMap.get(RANDOM.nextInt(_cardinality));
   >     }
   >     return result;
   >   }
   > 
   >   @Benchmark
   >   @BenchmarkMode(Mode.AverageTime)
   >   @OutputTimeUnit(TimeUnit.MILLISECONDS)
   >   public int benchmarkIntBinarySearch() {
   >     int result = 0;
   >     for (int i = 0; i < NUM_LOOK_UPS; i++) {
   >       result += Arrays.binarySearch(_sortedInts, RANDOM.nextInt(_cardinality));
   >     }
   >     return result;
   >   }
   > 
   >   @Benchmark
   >   @BenchmarkMode(Mode.AverageTime)
   >   @OutputTimeUnit(TimeUnit.MILLISECONDS)
   >   public int benchmarkLongMap() {
   >     Long2IntOpenHashMap longMap = new Long2IntOpenHashMap(_cardinality);
   >     for (int i = 0; i < _cardinality; i++) {
   >       longMap.put(i, i);
   >     }
   >     int result = 0;
   >     for (int i = 0; i < NUM_LOOK_UPS; i++) {
   >       result += longMap.get(RANDOM.nextInt(_cardinality));
   >     }
   >     return result;
   >   }
   > 
   >   @Benchmark
   >   @BenchmarkMode(Mode.AverageTime)
   >   @OutputTimeUnit(TimeUnit.MILLISECONDS)
   >   public int benchmarkLongBinarySearch() {
   >     int result = 0;
   >     for (int i = 0; i < NUM_LOOK_UPS; i++) {
   >       result += Arrays.binarySearch(_sortedLongs, RANDOM.nextInt(_cardinality));
   >     }
   >     return result;
   >   }
   > 
   >   @Benchmark
   >   @BenchmarkMode(Mode.AverageTime)
   >   @OutputTimeUnit(TimeUnit.MILLISECONDS)
   >   public int benchmarkFloatMap() {
   >     Float2IntOpenHashMap floatMap = new Float2IntOpenHashMap(_cardinality);
   >     for (int i = 0; i < _cardinality; i++) {
   >       floatMap.put(i, i);
   >     }
   >     int result = 0;
   >     for (int i = 0; i < NUM_LOOK_UPS; i++) {
   >       result += floatMap.get(RANDOM.nextInt(_cardinality));
   >     }
   >     return result;
   >   }
   > 
   >   @Benchmark
   >   @BenchmarkMode(Mode.AverageTime)
   >   @OutputTimeUnit(TimeUnit.MILLISECONDS)
   >   public int benchmarkFloatBinarySearch() {
   >     int result = 0;
   >     for (int i = 0; i < NUM_LOOK_UPS; i++) {
   >       result += Arrays.binarySearch(_sortedFloats, RANDOM.nextInt(_cardinality));
   >     }
   >     return result;
   >   }
   > 
   >   @Benchmark
   >   @BenchmarkMode(Mode.AverageTime)
   >   @OutputTimeUnit(TimeUnit.MILLISECONDS)
   >   public int benchmarkDoubleMap() {
   >     Double2IntOpenHashMap doubleMap = new Double2IntOpenHashMap(_cardinality);
   >     for (int i = 0; i < _cardinality; i++) {
   >       doubleMap.put(i, i);
   >     }
   >     int result = 0;
   >     for (int i = 0; i < NUM_LOOK_UPS; i++) {
   >       result += doubleMap.get(RANDOM.nextInt(_cardinality));
   >     }
   >     return result;
   >   }
   > 
   >   @Benchmark
   >   @BenchmarkMode(Mode.AverageTime)
   >   @OutputTimeUnit(TimeUnit.MILLISECONDS)
   >   public int benchmarkDoubleBinarySearch() {
   >     int result = 0;
   >     for (int i = 0; i < NUM_LOOK_UPS; i++) {
   >       result += Arrays.binarySearch(_sortedDoubles, RANDOM.nextInt(_cardinality));
   >     }
   >     return result;
   >   }
   > 
   >   @Benchmark
   >   @BenchmarkMode(Mode.AverageTime)
   >   @OutputTimeUnit(TimeUnit.MILLISECONDS)
   >   public int benchmarkStringMap() {
   >     Object2IntOpenHashMap<String> stringMap = new Object2IntOpenHashMap<>(_cardinality);
   >     for (int i = 0; i < _cardinality; i++) {
   >       stringMap.put(Integer.toString(i + _cardinality), i);
   >     }
   >     int result = 0;
   >     for (int i = 0; i < NUM_LOOK_UPS; i++) {
   >       result += stringMap.getInt(Integer.toString(RANDOM.nextInt(_cardinality) + _cardinality));
   >     }
   >     return result;
   >   }
   > 
   >   @Benchmark
   >   @BenchmarkMode(Mode.AverageTime)
   >   @OutputTimeUnit(TimeUnit.MILLISECONDS)
   >   public int benchmarkStringBinarySearch() {
   >     int result = 0;
   >     for (int i = 0; i < NUM_LOOK_UPS; i++) {
   >       result += Arrays.binarySearch(_sortedStrings, Integer.toString(RANDOM.nextInt(_cardinality) + _cardinality));
   >     }
   >     return result;
   >   }
   > 
   >   public static void main(String[] args)
   >       throws Exception {
   >     Options opt =
   >         new OptionsBuilder().include(BenchmarkBinarySearch.class.getSimpleName()).warmupTime(TimeValue.seconds(5))
   >             .warmupIterations(2).measurementTime(TimeValue.seconds(10)).measurementIterations(3).forks(1).build();
   > 
   >     new Runner(opt).run();
   >   }
   > }
   > ```
   
   How do you know how much extra time it takes to build a column based on this benchmark? Shouldn’t it benchmark segment build time instead?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on a diff in pull request #8846: Reduce the heap memory usage for segment creation

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on code in PR #8846:
URL: https://github.com/apache/pinot/pull/8846#discussion_r901025316


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/LongColumnPreIndexStatsCollector.java:
##########
@@ -100,8 +97,11 @@ public int getCardinality() {
 
   @Override
   public void seal() {
-    _sortedValues = _values.toLongArray();
-    Arrays.sort(_sortedValues);
-    _sealed = true;
+    if (!_sealed) {

Review Comment:
   I don't want to introduce an extra per-value check (other checks are per column). IMO it is okay for it to throw NPE if someone uses it incorrectly. Added an assert



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang commented on a diff in pull request #8846: Reduce the heap memory usage for segment creation

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang commented on code in PR #8846:
URL: https://github.com/apache/pinot/pull/8846#discussion_r899754796


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/BigDecimalColumnPreIndexStatsCollector.java:
##########
@@ -113,8 +111,11 @@ public int getCardinality() {
 
   @Override
   public void seal() {
-    _sortedValues = _values.toArray(new BigDecimal[0]);
-    Arrays.sort(_sortedValues);
-    _sealed = true;
+    if (!_sealed) {
+      _sortedValues = _values.toArray(new BigDecimal[0]);
+      _values = null;

Review Comment:
   That shouldn't be needed. Once the `_values` is set to `null`, the whole map can be garbage collected. `_values.clear()` will introduce extra overhead, but not reduce the garbage



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] xiangfu0 commented on a diff in pull request #8846: Reduce the heap memory usage for segment creation

Posted by GitBox <gi...@apache.org>.
xiangfu0 commented on code in PR #8846:
URL: https://github.com/apache/pinot/pull/8846#discussion_r901020485


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/LongColumnPreIndexStatsCollector.java:
##########
@@ -100,8 +97,11 @@ public int getCardinality() {
 
   @Override
   public void seal() {
-    _sortedValues = _values.toLongArray();
-    Arrays.sort(_sortedValues);
-    _sealed = true;
+    if (!_sealed) {

Review Comment:
   If there is no sorting after seal, we should also throw exception in collect(Object) if _sealed=true to avoid any unexpected behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] richardstartin commented on pull request #8846: Reduce the heap memory usage for segment creation

Posted by GitBox <gi...@apache.org>.
richardstartin commented on PR #8846:
URL: https://github.com/apache/pinot/pull/8846#issuecomment-1148400449

   You don't mention replacing constant time lookups with logarithmic time lookups in `SegmentDictionaryCreator` making this a space/time tradeoff. It would be good to have some numbers to demonstrate the tradeoff is worthwhile, and to mention it in the PR description to help someone track down the root cause of a regression if it happens.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] jackjlli commented on a diff in pull request #8846: Reduce the heap memory usage for segment creation

Posted by GitBox <gi...@apache.org>.
jackjlli commented on code in PR #8846:
URL: https://github.com/apache/pinot/pull/8846#discussion_r891664722


##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/BigDecimalColumnPreIndexStatsCollector.java:
##########
@@ -113,8 +111,11 @@ public int getCardinality() {
 
   @Override
   public void seal() {
-    _sortedValues = _values.toArray(new BigDecimal[0]);
-    Arrays.sort(_sortedValues);
-    _sealed = true;
+    if (!_sealed) {
+      _sortedValues = _values.toArray(new BigDecimal[0]);
+      _values = null;

Review Comment:
   Should we consider invoking `_values.clear()` method before setting it `null`? Same for all the other stats collectors.



##########
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/stats/LongColumnPreIndexStatsCollector.java:
##########
@@ -57,10 +56,10 @@ public void collect(Object entry) {
     }
   }
 
-  void addressSorted(long entry) {
-    if (_isSorted) {
+  private void addressSorted(long entry) {

Review Comment:
   I saw some of the `addressSorted` method use `private` but some don't. It'd be good to unify them.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org


[GitHub] [pinot] Jackie-Jiang merged pull request #8846: Reduce the heap memory usage for segment creation

Posted by GitBox <gi...@apache.org>.
Jackie-Jiang merged PR #8846:
URL: https://github.com/apache/pinot/pull/8846


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@pinot.apache.org
For additional commands, e-mail: commits-help@pinot.apache.org