You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@kafka.apache.org by "yashmayya (via GitHub)" <gi...@apache.org> on 2023/05/29 14:38:37 UTC

[GitHub] [kafka] yashmayya opened a new pull request, #13776: KAFKA-15034: Use HashSet for include / exclude fields in ReplaceField SMT; add JMH benchmark

yashmayya opened a new pull request, #13776:
URL: https://github.com/apache/kafka/pull/13776

   - https://issues.apache.org/jira/browse/KAFKA-15034
   - The `ReplaceField` SMT can be configured with a list of fields that are to be included or excluded during every record transformation.
   - Currently, it uses an `ArrayList` for these fields which causes the filter operations to be of O(N) complexity resulting in poor performance when configured with a large number of include / exclude fields.
   - This patch refactors it to use a `HashSet` instead and adds a JMH benchmark to demonstrate the improvements.
   
   
   ## JMH Benchmark Result Before:
   
   ```
   Benchmark                                                  (fieldCount)  Mode  Cnt          Score          Error  Units
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark           100  avgt    5      15989.752 ±      492.902  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark          1000  avgt    5    1162684.117 ±    39909.426  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark         10000  avgt    5  109347304.542 ± 10637587.550  ns/op
   ```
   
   ## JMH Benchmark Result After:
   
   ```
   Benchmark                                                  (fieldCount)  Mode  Cnt       Score       Error  Units
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark           100  avgt    5    4777.655 ±   375.941  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark          1000  avgt    5   65102.026 ±  2451.422  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark         10000  avgt    5  731448.223 ± 68711.412  ns/op
   ```
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] yashmayya commented on a diff in pull request #13776: KAFKA-15034: Improve performance of the ReplaceField SMT; add JMH benchmark

Posted by "yashmayya (via GitHub)" <gi...@apache.org>.

yashmayya commented on code in PR #13776:
URL: https://github.com/apache/kafka/pull/13776#discussion_r1212655937


##########
jmh-benchmarks/src/main/java/org/apache/kafka/jmh/connect/ReplaceFieldBenchmark.java:
##########
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.kafka.jmh.connect;
+
+import org.apache.kafka.connect.source.SourceRecord;
+import org.apache.kafka.connect.transforms.ReplaceField;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Fork;
+import org.openjdk.jmh.annotations.Measurement;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.OutputTimeUnit;
+import org.openjdk.jmh.annotations.Param;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.Warmup;
+
+import java.util.HashMap;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
+
+/**
+ * This benchmark tests the performance of the {@link ReplaceField} {@link org.apache.kafka.connect.transforms.Transformation SMT}
+ * when configured with a large number of include and exclude fields and applied on a {@link SourceRecord} containing a similarly
+ * large number of fields.
+ */
+@State(Scope.Benchmark)
+@Fork(value = 1)
+@Warmup(iterations = 3)
+@Measurement(iterations = 5)
+@BenchmarkMode(Mode.AverageTime)
+@OutputTimeUnit(TimeUnit.NANOSECONDS)
+public class ReplaceFieldBenchmark {
+
+    @Param({"100", "1000", "10000"})
+    private int fieldCount;
+    private ReplaceField<SourceRecord> replaceFieldSmt;
+    private SourceRecord record;
+
+    @Setup
+    public void setup() {
+        this.replaceFieldSmt = new ReplaceField.Value<>();
+        Map<String, String> replaceFieldConfigs = new HashMap<>();
+        replaceFieldConfigs.put("exclude",
+                IntStream.range(0, fieldCount).filter(x -> (x & 1) == 0).mapToObj(x -> "Field-" + x).collect(Collectors.joining(",")));
+        replaceFieldConfigs.put("include",
+                IntStream.range(0, fieldCount).filter(x -> (x & 1) == 1).mapToObj(x -> "Field-" + x).collect(Collectors.joining(",")));

Review Comment:
   Makes sense, I've added a separate parameter controlling the number of include and exclude fields since it was a pretty minor change. These are the new results:
   
   ## JMH Benchmark Result Before (`ArrayList` based implementation):
   
   ```
   Benchmark                                                  (includeExcludeFieldCount)  (valueFieldCount)  Mode  Cnt          Score         Error  Units
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                           1                100  avgt    5        928.115 ±      20.251  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                           1               1000  avgt    5      10380.401 ±     286.643  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                           1              10000  avgt    5      97058.104 ±    3834.409  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                         100                100  avgt    5      15052.629 ±     112.337  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                         100               1000  avgt    5     301212.390 ±    6400.402  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                         100              10000  avgt    5    2218226.090 ±   38198.098  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                       10000                100  avgt    5     582789.404 ±   11436.565  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                       10000               1000  avgt    5    6263588.619 ±  530370.435  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                       10000              10000  avgt    5  133424024.627 ± 9497024.791  ns/op
   ```
   
   ## JMH Benchmark Result After (`HashSet` based implementation):
   
   ```
   Benchmark                                                  (includeExcludeFieldCount)  (valueFieldCount)  Mode  Cnt       Score      Error  Units
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                           1                100  avgt    5    1205.928 ±   32.611  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                           1               1000  avgt    5   10124.067 ±  212.876  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                           1              10000  avgt    5  105143.540 ± 1813.534  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                         100                100  avgt    5    1602.392 ±   15.756  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                         100               1000  avgt    5   11543.659 ±  193.129  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                         100              10000  avgt    5  171689.002 ± 5691.014  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                       10000                100  avgt    5    1686.155 ±   21.922  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                       10000               1000  avgt    5   20584.457 ±  429.614  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark                       10000              10000  avgt    5  221015.401 ± 8108.798  ns/op
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] C0urante merged pull request #13776: KAFKA-15034: Improve performance of the ReplaceField SMT; add JMH benchmark

Posted by "C0urante (via GitHub)" <gi...@apache.org>.

C0urante merged PR #13776:
URL: https://github.com/apache/kafka/pull/13776


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] vamossagar12 commented on a diff in pull request #13776: KAFKA-15034: Improve performance of the ReplaceField SMT; add JMH benchmark

Posted by "vamossagar12 (via GitHub)" <gi...@apache.org>.

vamossagar12 commented on code in PR #13776:
URL: https://github.com/apache/kafka/pull/13776#discussion_r1211158450


##########
jmh-benchmarks/src/main/java/org/apache/kafka/jmh/connect/ReplaceFieldBenchmark.java:
##########
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.kafka.jmh.connect;
+
+import org.apache.kafka.connect.source.SourceRecord;
+import org.apache.kafka.connect.transforms.ReplaceField;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Fork;
+import org.openjdk.jmh.annotations.Measurement;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.OutputTimeUnit;
+import org.openjdk.jmh.annotations.Param;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.Warmup;
+
+import java.util.HashMap;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
+
+/**
+ * This benchmark tests the performance of the {@link ReplaceField} {@link org.apache.kafka.connect.transforms.Transformation SMT}
+ * when configured with a large number of include and exclude fields and applied on a {@link SourceRecord} containing a similarly
+ * large number of fields.
+ */
+@State(Scope.Benchmark)
+@Fork(value = 1)
+@Warmup(iterations = 3)
+@Measurement(iterations = 5)
+@BenchmarkMode(Mode.AverageTime)
+@OutputTimeUnit(TimeUnit.NANOSECONDS)
+public class ReplaceFieldBenchmark {
+
+    @Param({"100", "1000", "10000"})
+    private int fieldCount;
+    private ReplaceField<SourceRecord> replaceFieldSmt;
+    private SourceRecord record;
+
+    @Setup
+    public void setup() {
+        this.replaceFieldSmt = new ReplaceField.Value<>();
+        Map<String, String> replaceFieldConfigs = new HashMap<>();
+        replaceFieldConfigs.put("exclude",
+                IntStream.range(0, fieldCount).filter(x -> (x & 1) == 0).mapToObj(x -> "Field-" + x).collect(Collectors.joining(",")));
+        replaceFieldConfigs.put("include",
+                IntStream.range(0, fieldCount).filter(x -> (x & 1) == 1).mapToObj(x -> "Field-" + x).collect(Collectors.joining(",")));

Review Comment:
   I feel they are going to perform the same. The time complexity of String#hashCode is O(n) and same is the case for equals method. No harm in trying though :) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] yashmayya commented on a diff in pull request #13776: KAFKA-15034: Improve performance of the ReplaceField SMT; add JMH benchmark

Posted by "yashmayya (via GitHub)" <gi...@apache.org>.

yashmayya commented on code in PR #13776:
URL: https://github.com/apache/kafka/pull/13776#discussion_r1210620016


##########
connect/transforms/src/main/java/org/apache/kafka/connect/transforms/ReplaceField.java:
##########
@@ -94,8 +96,8 @@ public void configure(Map<String, ?> configs) {
             {ConfigName.EXCLUDE, "blacklist"},
         }));
 
-        exclude = config.getList(ConfigName.EXCLUDE);
-        include = config.getList(ConfigName.INCLUDE);
+        exclude = new HashSet<>(config.getList(ConfigName.EXCLUDE));
+        include = new HashSet<>(config.getList(ConfigName.INCLUDE));

Review Comment:
   Hm that's an interesting point and I hadn't considered that. However, taking a closer look, I don't think there'll be any performance penalty here because the `ArrayList::contains` method will involve calls to `String::equals` to check whether each value field is present in the exclude / include fields. `String::equals` and `String::hashCode` both have linear time complexity in terms of the length of the String so there won't be much difference in the performance between the `HashSet` version and the `ArrayList` version when there's a small number of include / exclude fields. I ran a modified benchmark (using a single include field and a single exclude field) and the results were fairly similar between the two implementations:
   
   ## Using Strings with length ~10
   
   ```
   With HashSet:
   
   Benchmark                                                  (fieldCount)  Mode  Cnt       Score      Error  Units
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark           100  avgt    5    1034.411 ±  138.769  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark          1000  avgt    5   12210.318 ±  119.902  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark         10000  avgt    5  125383.027 ± 3884.894  ns/op
   
   -------------------------------------------------------------------------------------------------------------------
   
   With ArrayList:
   
   Benchmark                                                  (fieldCount)  Mode  Cnt      Score       Error  Units
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark           100  avgt    5    919.979 ±    15.950  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark          1000  avgt    5   8757.357 ±   114.363  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark         10000  avgt    5  96112.550 ± 10639.302  ns/op
   ```
   
   ## Using Strings with length ~1000
   
   ```
   With HashSet:
   
   Benchmark                                                  (fieldCount)  Mode  Cnt       Score       Error  Units
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark           100  avgt    5     982.217 ±    16.189  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark          1000  avgt    5   12327.410 ±   657.447  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark         10000  avgt    5  162847.616 ± 25125.797  ns/op
   
   -------------------------------------------------------------------------------------------------------------------
   
   With ArrayList:
   
   Benchmark                                                  (fieldCount)  Mode  Cnt       Score       Error  Units
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark           100  avgt    5   16189.095 ±   741.592  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark          1000  avgt    5   24110.247 ±   180.101  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark         10000  avgt    5  165554.095 ± 17875.657  ns/op
   ```
   
   Even on attempting to make use of String interning to speed up the equals checks, the results were very similar.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] yashmayya commented on a diff in pull request #13776: KAFKA-15034: Improve performance of the ReplaceField SMT; add JMH benchmark

Posted by "yashmayya (via GitHub)" <gi...@apache.org>.

yashmayya commented on code in PR #13776:
URL: https://github.com/apache/kafka/pull/13776#discussion_r1210628638


##########
connect/transforms/src/main/java/org/apache/kafka/connect/transforms/ReplaceField.java:
##########
@@ -94,8 +96,8 @@ public void configure(Map<String, ?> configs) {
             {ConfigName.EXCLUDE, "blacklist"},
         }));
 
-        exclude = config.getList(ConfigName.EXCLUDE);
-        include = config.getList(ConfigName.INCLUDE);
+        exclude = new HashSet<>(config.getList(ConfigName.EXCLUDE));
+        include = new HashSet<>(config.getList(ConfigName.INCLUDE));

Review Comment:
   ~~Only the first call to `String::hashCode` for a string object actually does the hash code computation which is cached and subsequent calls just fetch the cached value.~~
   
   Edit: The concern was with computing the hash code for value fields so the caching isn't relevant here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] C0urante commented on a diff in pull request #13776: KAFKA-15034: Improve performance of the ReplaceField SMT; add JMH benchmark

Posted by "C0urante (via GitHub)" <gi...@apache.org>.

C0urante commented on code in PR #13776:
URL: https://github.com/apache/kafka/pull/13776#discussion_r1210481491


##########
jmh-benchmarks/src/main/java/org/apache/kafka/jmh/connect/ReplaceFieldBenchmark.java:
##########
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.kafka.jmh.connect;
+
+import org.apache.kafka.connect.source.SourceRecord;
+import org.apache.kafka.connect.transforms.ReplaceField;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Fork;

Review Comment:
   (To match the suggestion below RE an explicit setup level)
   ```suggestion
   import org.openjdk.jmh.annotations.Fork;
   import org.openjdk.jmh.annotations.Level;
   ```



##########
jmh-benchmarks/src/main/java/org/apache/kafka/jmh/connect/ReplaceFieldBenchmark.java:
##########
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.kafka.jmh.connect;
+
+import org.apache.kafka.connect.source.SourceRecord;
+import org.apache.kafka.connect.transforms.ReplaceField;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Fork;
+import org.openjdk.jmh.annotations.Measurement;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.OutputTimeUnit;
+import org.openjdk.jmh.annotations.Param;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.Warmup;
+
+import java.util.HashMap;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
+
+/**
+ * This benchmark tests the performance of the {@link ReplaceField} {@link org.apache.kafka.connect.transforms.Transformation SMT}
+ * when configured with a large number of include and exclude fields and applied on a {@link SourceRecord} containing a similarly
+ * large number of fields.
+ */
+@State(Scope.Benchmark)
+@Fork(value = 1)
+@Warmup(iterations = 3)
+@Measurement(iterations = 5)
+@BenchmarkMode(Mode.AverageTime)
+@OutputTimeUnit(TimeUnit.NANOSECONDS)
+public class ReplaceFieldBenchmark {
+
+    @Param({"100", "1000", "10000"})
+    private int fieldCount;
+    private ReplaceField<SourceRecord> replaceFieldSmt;
+    private SourceRecord record;
+
+    @Setup

Review Comment:
   In most other benchmarks we're explicit about the setup level, even if it matches the default:
   ```suggestion
       @Setup(Level.Trial)
   ```



##########
connect/transforms/src/main/java/org/apache/kafka/connect/transforms/ReplaceField.java:
##########
@@ -94,8 +96,8 @@ public void configure(Map<String, ?> configs) {
             {ConfigName.EXCLUDE, "blacklist"},
         }));
 
-        exclude = config.getList(ConfigName.EXCLUDE);
-        include = config.getList(ConfigName.INCLUDE);
+        exclude = new HashSet<>(config.getList(ConfigName.EXCLUDE));
+        include = new HashSet<>(config.getList(ConfigName.INCLUDE));

Review Comment:
   Is it possible that there's a performance penalty for this approach when the transform has been configured with a small number of include/exclude fields (e.g., 1), since we'd have to compute the hash code of each field in the value?



##########
jmh-benchmarks/src/main/java/org/apache/kafka/jmh/connect/ReplaceFieldBenchmark.java:
##########
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.kafka.jmh.connect;
+
+import org.apache.kafka.connect.source.SourceRecord;
+import org.apache.kafka.connect.transforms.ReplaceField;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Fork;
+import org.openjdk.jmh.annotations.Measurement;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.OutputTimeUnit;
+import org.openjdk.jmh.annotations.Param;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.Warmup;
+
+import java.util.HashMap;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
+
+/**
+ * This benchmark tests the performance of the {@link ReplaceField} {@link org.apache.kafka.connect.transforms.Transformation SMT}
+ * when configured with a large number of include and exclude fields and applied on a {@link SourceRecord} containing a similarly
+ * large number of fields.
+ */
+@State(Scope.Benchmark)
+@Fork(value = 1)
+@Warmup(iterations = 3)
+@Measurement(iterations = 5)
+@BenchmarkMode(Mode.AverageTime)
+@OutputTimeUnit(TimeUnit.NANOSECONDS)
+public class ReplaceFieldBenchmark {
+
+    @Param({"100", "1000", "10000"})
+    private int fieldCount;
+    private ReplaceField<SourceRecord> replaceFieldSmt;
+    private SourceRecord record;
+
+    @Setup
+    public void setup() {
+        this.replaceFieldSmt = new ReplaceField.Value<>();
+        Map<String, String> replaceFieldConfigs = new HashMap<>();
+        replaceFieldConfigs.put("exclude",
+                IntStream.range(0, fieldCount).filter(x -> (x & 1) == 0).mapToObj(x -> "Field-" + x).collect(Collectors.joining(",")));
+        replaceFieldConfigs.put("include",
+                IntStream.range(0, fieldCount).filter(x -> (x & 1) == 1).mapToObj(x -> "Field-" + x).collect(Collectors.joining(",")));

Review Comment:
   We may want to add a separate parameter for the number of included/excluded fields (can be a single parameter to control both, or a separate parameter for each) in order to cover the case of a value with a large number of fields and a small number of included/excluded fields.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] yashmayya commented on a diff in pull request #13776: KAFKA-15034: Improve performance of the ReplaceField SMT; add JMH benchmark

Posted by "yashmayya (via GitHub)" <gi...@apache.org>.

yashmayya commented on code in PR #13776:
URL: https://github.com/apache/kafka/pull/13776#discussion_r1210647221


##########
connect/transforms/src/main/java/org/apache/kafka/connect/transforms/ReplaceField.java:
##########
@@ -94,8 +96,8 @@ public void configure(Map<String, ?> configs) {
             {ConfigName.EXCLUDE, "blacklist"},
         }));
 
-        exclude = config.getList(ConfigName.EXCLUDE);
-        include = config.getList(ConfigName.INCLUDE);
+        exclude = new HashSet<>(config.getList(ConfigName.EXCLUDE));
+        include = new HashSet<>(config.getList(ConfigName.INCLUDE));

Review Comment:
   The string interning doesn't make much of a difference since there's only a single field value it benefits (also it'd benefit the `HashSet` implementation as well because of the aforementioned hash code caching).
   
   Admittedly, the benchmark does use strings of identical length with long matching prefixes for the value fields as well as the include / exclude fields and since the `String::equals` method should likely be O(1) for strings of unequal length, we'd probably see better results for the `ArrayList` based implementation if the benchmark used randomized strings of random length. However, I'd argue that we're probably overthinking it at this point, WDYT? 😅 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] yashmayya commented on a diff in pull request #13776: KAFKA-15034: Improve performance of the ReplaceField SMT; add JMH benchmark

Posted by "yashmayya (via GitHub)" <gi...@apache.org>.

yashmayya commented on code in PR #13776:
URL: https://github.com/apache/kafka/pull/13776#discussion_r1210628638


##########
connect/transforms/src/main/java/org/apache/kafka/connect/transforms/ReplaceField.java:
##########
@@ -94,8 +96,8 @@ public void configure(Map<String, ?> configs) {
             {ConfigName.EXCLUDE, "blacklist"},
         }));
 
-        exclude = config.getList(ConfigName.EXCLUDE);
-        include = config.getList(ConfigName.INCLUDE);
+        exclude = new HashSet<>(config.getList(ConfigName.EXCLUDE));
+        include = new HashSet<>(config.getList(ConfigName.INCLUDE));

Review Comment:
   Only the first call to `String::hashCode` for a string object actually does the hash code computation which is cached and subsequent calls just fetch the cached value.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] C0urante commented on a diff in pull request #13776: KAFKA-15034: Improve performance of the ReplaceField SMT; add JMH benchmark

Posted by "C0urante (via GitHub)" <gi...@apache.org>.

C0urante commented on code in PR #13776:
URL: https://github.com/apache/kafka/pull/13776#discussion_r1212118898


##########
jmh-benchmarks/src/main/java/org/apache/kafka/jmh/connect/ReplaceFieldBenchmark.java:
##########
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.kafka.jmh.connect;
+
+import org.apache.kafka.connect.source.SourceRecord;
+import org.apache.kafka.connect.transforms.ReplaceField;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Fork;
+import org.openjdk.jmh.annotations.Measurement;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.OutputTimeUnit;
+import org.openjdk.jmh.annotations.Param;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.Warmup;
+
+import java.util.HashMap;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
+
+/**
+ * This benchmark tests the performance of the {@link ReplaceField} {@link org.apache.kafka.connect.transforms.Transformation SMT}
+ * when configured with a large number of include and exclude fields and applied on a {@link SourceRecord} containing a similarly
+ * large number of fields.
+ */
+@State(Scope.Benchmark)
+@Fork(value = 1)
+@Warmup(iterations = 3)
+@Measurement(iterations = 5)
+@BenchmarkMode(Mode.AverageTime)
+@OutputTimeUnit(TimeUnit.NANOSECONDS)
+public class ReplaceFieldBenchmark {
+
+    @Param({"100", "1000", "10000"})
+    private int fieldCount;
+    private ReplaceField<SourceRecord> replaceFieldSmt;
+    private SourceRecord record;
+
+    @Setup
+    public void setup() {
+        this.replaceFieldSmt = new ReplaceField.Value<>();
+        Map<String, String> replaceFieldConfigs = new HashMap<>();
+        replaceFieldConfigs.put("exclude",
+                IntStream.range(0, fieldCount).filter(x -> (x & 1) == 0).mapToObj(x -> "Field-" + x).collect(Collectors.joining(",")));
+        replaceFieldConfigs.put("include",
+                IntStream.range(0, fieldCount).filter(x -> (x & 1) == 1).mapToObj(x -> "Field-" + x).collect(Collectors.joining(",")));

Review Comment:
   Object lookup in a hash set usually involves both computing its hash and performing an equality check, since multiple objects may occupy the same bucket. Lookup in a single-element list may theoretically be faster if it only involves a single equality check.
   
   I think it's worth including in the benchmark for a few reasons:
   - Saves people the trouble of having to look up this PR discussion
   - Covers a more-common case (it's much more likely that someone configures this SMT with 1 field than 10,000)
   - Guards against performance regressions if we change things in the future
   
   But if it's too much work then we can merge as-is. @yashmayya let me know what your decision is.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] yashmayya commented on a diff in pull request #13776: KAFKA-15034: Improve performance of the ReplaceField SMT; add JMH benchmark

Posted by "yashmayya (via GitHub)" <gi...@apache.org>.

yashmayya commented on code in PR #13776:
URL: https://github.com/apache/kafka/pull/13776#discussion_r1210647221


##########
connect/transforms/src/main/java/org/apache/kafka/connect/transforms/ReplaceField.java:
##########
@@ -94,8 +96,8 @@ public void configure(Map<String, ?> configs) {
             {ConfigName.EXCLUDE, "blacklist"},
         }));
 
-        exclude = config.getList(ConfigName.EXCLUDE);
-        include = config.getList(ConfigName.INCLUDE);
+        exclude = new HashSet<>(config.getList(ConfigName.EXCLUDE));
+        include = new HashSet<>(config.getList(ConfigName.INCLUDE));

Review Comment:
   The string interning doesn't make much of a difference since there's only a single field value it benefits (also it'd benefit the `HashSet` implementation as well because of the aforementioned hash code caching).
   
   Admittedly, the benchmark does use strings of identical length for the value fields as well as the include / exclude fields and since the `String::equals` method should likely be O(1) for strings of unequal length, we'd probably see better results for the `ArrayList` based implementation if the benchmark used strings of random length. However, I'd argue that we're probably overthinking it at this point, WDYT? 😅 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] C0urante commented on a diff in pull request #13776: KAFKA-15034: Improve performance of the ReplaceField SMT; add JMH benchmark

Posted by "C0urante (via GitHub)" <gi...@apache.org>.

C0urante commented on code in PR #13776:
URL: https://github.com/apache/kafka/pull/13776#discussion_r1212108143


##########
connect/transforms/src/main/java/org/apache/kafka/connect/transforms/ReplaceField.java:
##########
@@ -94,8 +96,8 @@ public void configure(Map<String, ?> configs) {
             {ConfigName.EXCLUDE, "blacklist"},
         }));
 
-        exclude = config.getList(ConfigName.EXCLUDE);
-        include = config.getList(ConfigName.INCLUDE);
+        exclude = new HashSet<>(config.getList(ConfigName.EXCLUDE));
+        include = new HashSet<>(config.getList(ConfigName.INCLUDE));

Review Comment:
   Can't argue with the numbers. Thanks for looking into this!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] C0urante commented on a diff in pull request #13776: KAFKA-15034: Improve performance of the ReplaceField SMT; add JMH benchmark

Posted by "C0urante (via GitHub)" <gi...@apache.org>.

C0urante commented on code in PR #13776:
URL: https://github.com/apache/kafka/pull/13776#discussion_r1213569347


##########
jmh-benchmarks/src/main/java/org/apache/kafka/jmh/connect/ReplaceFieldBenchmark.java:
##########
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.kafka.jmh.connect;
+
+import org.apache.kafka.connect.source.SourceRecord;
+import org.apache.kafka.connect.transforms.ReplaceField;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Fork;
+import org.openjdk.jmh.annotations.Measurement;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.OutputTimeUnit;
+import org.openjdk.jmh.annotations.Param;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.Warmup;
+
+import java.util.HashMap;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
+
+/**
+ * This benchmark tests the performance of the {@link ReplaceField} {@link org.apache.kafka.connect.transforms.Transformation SMT}
+ * when configured with a large number of include and exclude fields and applied on a {@link SourceRecord} containing a similarly
+ * large number of fields.
+ */
+@State(Scope.Benchmark)
+@Fork(value = 1)
+@Warmup(iterations = 3)
+@Measurement(iterations = 5)
+@BenchmarkMode(Mode.AverageTime)
+@OutputTimeUnit(TimeUnit.NANOSECONDS)
+public class ReplaceFieldBenchmark {
+
+    @Param({"100", "1000", "10000"})
+    private int fieldCount;
+    private ReplaceField<SourceRecord> replaceFieldSmt;
+    private SourceRecord record;
+
+    @Setup
+    public void setup() {
+        this.replaceFieldSmt = new ReplaceField.Value<>();
+        Map<String, String> replaceFieldConfigs = new HashMap<>();
+        replaceFieldConfigs.put("exclude",
+                IntStream.range(0, fieldCount).filter(x -> (x & 1) == 0).mapToObj(x -> "Field-" + x).collect(Collectors.joining(",")));
+        replaceFieldConfigs.put("include",
+                IntStream.range(0, fieldCount).filter(x -> (x & 1) == 1).mapToObj(x -> "Field-" + x).collect(Collectors.joining(",")));

Review Comment:
   Awesome, thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] yashmayya commented on a diff in pull request #13776: KAFKA-15034: Improve performance of the ReplaceField SMT; add JMH benchmark

Posted by "yashmayya (via GitHub)" <gi...@apache.org>.

yashmayya commented on code in PR #13776:
URL: https://github.com/apache/kafka/pull/13776#discussion_r1210622194


##########
jmh-benchmarks/src/main/java/org/apache/kafka/jmh/connect/ReplaceFieldBenchmark.java:
##########
@@ -0,0 +1,78 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.kafka.jmh.connect;
+
+import org.apache.kafka.connect.source.SourceRecord;
+import org.apache.kafka.connect.transforms.ReplaceField;
+import org.openjdk.jmh.annotations.Benchmark;
+import org.openjdk.jmh.annotations.BenchmarkMode;
+import org.openjdk.jmh.annotations.Fork;
+import org.openjdk.jmh.annotations.Measurement;
+import org.openjdk.jmh.annotations.Mode;
+import org.openjdk.jmh.annotations.OutputTimeUnit;
+import org.openjdk.jmh.annotations.Param;
+import org.openjdk.jmh.annotations.Scope;
+import org.openjdk.jmh.annotations.Setup;
+import org.openjdk.jmh.annotations.State;
+import org.openjdk.jmh.annotations.Warmup;
+
+import java.util.HashMap;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
+
+/**
+ * This benchmark tests the performance of the {@link ReplaceField} {@link org.apache.kafka.connect.transforms.Transformation SMT}
+ * when configured with a large number of include and exclude fields and applied on a {@link SourceRecord} containing a similarly
+ * large number of fields.
+ */
+@State(Scope.Benchmark)
+@Fork(value = 1)
+@Warmup(iterations = 3)
+@Measurement(iterations = 5)
+@BenchmarkMode(Mode.AverageTime)
+@OutputTimeUnit(TimeUnit.NANOSECONDS)
+public class ReplaceFieldBenchmark {
+
+    @Param({"100", "1000", "10000"})
+    private int fieldCount;
+    private ReplaceField<SourceRecord> replaceFieldSmt;
+    private SourceRecord record;
+
+    @Setup
+    public void setup() {
+        this.replaceFieldSmt = new ReplaceField.Value<>();
+        Map<String, String> replaceFieldConfigs = new HashMap<>();
+        replaceFieldConfigs.put("exclude",
+                IntStream.range(0, fieldCount).filter(x -> (x & 1) == 0).mapToObj(x -> "Field-" + x).collect(Collectors.joining(",")));
+        replaceFieldConfigs.put("include",
+                IntStream.range(0, fieldCount).filter(x -> (x & 1) == 1).mapToObj(x -> "Field-" + x).collect(Collectors.joining(",")));

Review Comment:
   Given the above observations, do you feel like this is still required?



##########
connect/transforms/src/main/java/org/apache/kafka/connect/transforms/ReplaceField.java:
##########
@@ -94,8 +96,8 @@ public void configure(Map<String, ?> configs) {
             {ConfigName.EXCLUDE, "blacklist"},
         }));
 
-        exclude = config.getList(ConfigName.EXCLUDE);
-        include = config.getList(ConfigName.INCLUDE);
+        exclude = new HashSet<>(config.getList(ConfigName.EXCLUDE));
+        include = new HashSet<>(config.getList(ConfigName.INCLUDE));

Review Comment:
   Hm that's an interesting point and I hadn't considered that. However, taking a closer look, I don't think there'll be any performance penalty here because the `ArrayList::contains` method will involve calls to `String::equals` to check whether each value field is present in the exclude / include fields. `String::equals` and `String::hashCode` both have linear time complexity in terms of the length of the String so there won't be much difference in the performance between the `HashSet` version and the `ArrayList` version when there's a small number of include / exclude fields. I ran a modified benchmark and the results were fairly similar between the two implementations:
   
   ## Using Strings with length ~10
   
   ```
   With HashSet:
   
   Benchmark                                                  (fieldCount)  Mode  Cnt       Score      Error  Units
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark           100  avgt    5    1034.411 ±  138.769  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark          1000  avgt    5   12210.318 ±  119.902  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark         10000  avgt    5  125383.027 ± 3884.894  ns/op
   
   -------------------------------------------------------------------------------------------------------------------
   
   With ArrayList:
   
   Benchmark                                                  (fieldCount)  Mode  Cnt      Score       Error  Units
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark           100  avgt    5    919.979 ±    15.950  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark          1000  avgt    5   8757.357 ±   114.363  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark         10000  avgt    5  96112.550 ± 10639.302  ns/op
   ```
   
   ## Using Strings with length ~1000
   
   ```
   With HashSet:
   
   Benchmark                                                  (fieldCount)  Mode  Cnt       Score       Error  Units
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark           100  avgt    5     982.217 ±    16.189  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark          1000  avgt    5   12327.410 ±   657.447  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark         10000  avgt    5  162847.616 ± 25125.797  ns/op
   
   -------------------------------------------------------------------------------------------------------------------
   
   With ArrayList:
   
   Benchmark                                                  (fieldCount)  Mode  Cnt       Score       Error  Units
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark           100  avgt    5   16189.095 ±   741.592  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark          1000  avgt    5   24110.247 ±   180.101  ns/op
   ReplaceFieldBenchmark.includeExcludeReplaceFieldBenchmark         10000  avgt    5  165554.095 ± 17875.657  ns/op
   ```
   
   Even on attempting to make use of String interning to speed up the equals checks, the results were very similar.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] yashmayya commented on a diff in pull request #13776: KAFKA-15034: Improve performance of the ReplaceField SMT; add JMH benchmark

Posted by "yashmayya (via GitHub)" <gi...@apache.org>.

yashmayya commented on code in PR #13776:
URL: https://github.com/apache/kafka/pull/13776#discussion_r1210647221


##########
connect/transforms/src/main/java/org/apache/kafka/connect/transforms/ReplaceField.java:
##########
@@ -94,8 +96,8 @@ public void configure(Map<String, ?> configs) {
             {ConfigName.EXCLUDE, "blacklist"},
         }));
 
-        exclude = config.getList(ConfigName.EXCLUDE);
-        include = config.getList(ConfigName.INCLUDE);
+        exclude = new HashSet<>(config.getList(ConfigName.EXCLUDE));
+        include = new HashSet<>(config.getList(ConfigName.INCLUDE));

Review Comment:
   The string interning doesn't make much of a difference since there's only a single field value it benefits (also it'd benefit the `HashSet` implementation as well because of the aforementioned hash code caching).
   
   Admittedly, the benchmark does use strings of identical length with long matching prefixes for the value fields as well as the include / exclude fields and since the `String::equals` method should likely be O(1) for strings of unequal length, we'd probably see better results for the `ArrayList` based implementation if the benchmark used randomized strings of random length. However, I'd argue that we're probably overthinking it at this point, WDYT? 😅 
   
   Edit: Interestingly, even using long strings of different lengths and / or prefixes shows much better performance with the `HashSet` implementation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org