You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Sagar Sumit (Jira)" <ji...@apache.org> on 2022/09/14 06:21:00 UTC

[jira] [Closed] (HUDI-2949) Benchmark Clustering performance

     [ https://issues.apache.org/jira/browse/HUDI-2949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sagar Sumit closed HUDI-2949.
-----------------------------
    Resolution: Done

> Benchmark Clustering performance
> --------------------------------
>
>                 Key: HUDI-2949
>                 URL: https://issues.apache.org/jira/browse/HUDI-2949
>             Project: Apache Hudi
>          Issue Type: Task
>            Reporter: Alexey Kudinkin
>            Assignee: Alexey Kudinkin
>            Priority: Major
>
> These observations are from running Layout Optimization (Clustering) on a [small Amazon Reviews|https://s3.amazonaws.com/amazon-reviews-pds/readme.html] (4.5Gb, reduced) dataset
> h2. *Major*
>   * GC is taking up to *25%* of CPU cycles (a lot of churn)
>  * 
>  ** A lot of ArrayList resizing like in the code below
>  
> {code:java}
> // Creating empty list, then immediately inserting List<Object> values = new ArrayList<>(); values.addAll(JavaConverters.bufferAsJavaListConverter(row.toSeq().toBuffer()).asJava()); values.add(hilbertValue);{code}
>  
>  * 
>  ** A lot of avoidable boxing like following
>  *  
> {code:java}
>  Collecting as Longs, then unboxing into longs List<Long> longList = fieldMap.entrySet().stream().map(...) byte[] hilbertValue = HilbertCurveUtils.indexBytes( hilbertCurve, longList.stream().mapToLong(l -> l).toArray(), 63);     {code}
>  * Up to *20%* of wall-clock time is spent under locks in BoundedInMemoryExecutor (LinkedBlockingQueue)
>  * ~35% of wall-clock time is spent Gzip-ing the output
> h2. *Per stage*
>  
> Experiment could roughly be broken down into following stages * {_}Bulk-insert{_}: of the raw data into Hudi table
>  * {_}Sorting{_}: re-sorting data according to the Layout Optimization config (reshuffling)
>  * {_}Bulk-insert (of the sorted){_}: bulk inserting reshuffled data
>  
> h4. Bulk Insert
> {_}Memory Allocated Total{_}: 22,000 samples x 500kb (sampling frequency) ~= 11Gb
> {_}GC{_}: 6%
>  
> _Observations_ * *~30%* of CPU is spent on Gzip compression
>  * 
>  ** Created HUDI-2928 to flip default from gzip to zstd
> h4. Sorting
> _Memory Allocated Total:_ 36,000 samples x 500kb (sampling frequency) ~= *18Gb*
> {_}GC{_}: ~6%
>  
> _Observations (Memory)_ * About *16%* allocated by {{BinaryStatistics.updateStats}} in Parquet's {{ColumnWriterBase}}
>  * 
>  ** +Writing to Parquet column as a whole allocates+ {*}+~19%+{*}{+}, ie the actual write allocates only 3% and{+} {*}+80% of it is overhead+{*}{+}.{+}
>  ** Allocating {{HeapByteBuffer}} in {{Binary.toByteBuffer}} w/in {{PrimitiveComparator}} (!!!) accounting min/max values for columns
>  ** Created PARQUET-2106 / [PR#940|https://github.com/apache/parquet-mr/pull/940]
>  * About *18%* is spent on {{bytesToAvro}} / {{avroToBytes}} conversion in calls to
>  ** {{OverwriteWithLatestAvroPayload.getInsertValue}}
>  ** {{OverwriteWithLatestAvroPayload.<init>}}
>  * About 4% is allocated in by fetching {{Path.getName}} {{HoodieWrapperFileSystem.getBytesWritten}}
>  ** Internally Hadoop calls {{path.substring}} allocating new string every time
>  * About *5%* of Memory is allocated by {{DefaultSizeEstimator.sizeEstimate}}
>  ** ~3% is in ctor – instance allocates by default:
>  * private final Deque<Object> pending = new ArrayDeque<>(16 * 1024);
>  * Remaining 2% are allocated while traversing the object tree
>  ** Resizing hash-sets
>  ** Fetching methods/fields through reflection (allocates arrays)
>  
> _Observations (CPU)_ * About 30% of time is spent in waiting state under locks w/in {{LinkedBlockingQueue}} in {{BoundedInMemoryQueue}}
>  * About 10% is spent on parsing Spark's {{Row}} in {{HoodieSparkUtils.createRdd}}
>  * About 2% of the CPU wall time spent on Parsing Avro schemas
>  
> h4. Bulk-insert (sorted)
> Memory Allocated (Total): 45,000 samples x 500kb ~= *22Gb*
> GC: *~23%*
>  
> Observations are similar to [unordered bulk-insert|https://app.clickup.com/18029943/v/dc/h67bq-1900/h67bq-5880?block=block-3cfa6bf5-23bd-4e21-8a56-48fcb198b244] 
> h2. Profiling
> All profiles for these benchmarks have been taken using [async-profiler|https://github.com/jvm-profiling-tools/async-profiler].
>  
> {code:java}
> # CPU 
> CPU PID=48449;EVENT=itimer;TS=$(date +%s); ./profiler.sh collect -e $EVENT -d 60 -f "profile_${PID}${EVENT}${TS}.html" $PID
> # Memory
> PID=<pid>;EVENT=alloc;TS=$(date +%s); ./profiler.sh collect -e $EVENT -d 60 -f "profile_${PID}${EVENT}${TS}.html" --alloc 500k $PI
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)