You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/04/06 16:52:00 UTC

[jira] [Work logged] (BEAM-14104) Shard aware Kinesis record aggregation (AWS Sdk v2)

     [ https://issues.apache.org/jira/browse/BEAM-14104?focusedWorklogId=753518&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-753518 ]

ASF GitHub Bot logged work on BEAM-14104:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 06/Apr/22 16:51
            Start Date: 06/Apr/22 16:51
    Worklog Time Spent: 10m 
      Work Description: aromanenko-dev commented on code in PR #17113:
URL: https://github.com/apache/beam/pull/17113#discussion_r844150744


##########
sdks/java/io/amazon-web-services2/src/main/java/org/apache/beam/sdk/io/aws2/kinesis/KinesisIO.java:
##########
@@ -947,25 +1003,37 @@ private void validateExplicitHashKey(String hashKey) {
      * with KCL to correctly implement the binary protocol, specifically {@link
      * software.amazon.kinesis.retrieval.kpl.Messages.AggregatedRecord}.
      *
-     * <p>Note: The aggregation is a lot simpler than the one offered by KPL. While the KPL is aware
-     * of effective hash key ranges assigned to each shard, we're not and don't want to be to keep
-     * complexity manageable and avoid the risk of silently loosing records in the KCL:
+     * <p>To aggregate records the best possible way, records are assigned an explicit hash key that
+     * corresponds to the lower bound of the hash key range of the target shard. In case a record
+     * has already an explicit hash key assigned, it is kept unchanged.
      *
-     * <p>{@link software.amazon.kinesis.retrieval.AggregatorUtil#deaggregate(List, BigInteger,
-     * BigInteger)} drops records not matching the expected hash key range.
+     * <p>Hash key ranges of shards are expected to be only slowly changing and get refreshed
+     * infrequently. If using an {@link ExplicitPartitioner} or disabling shard refresh via {@link
+     * RecordAggregation}, no shard details will be pulled.
      */
     static class AggregatedWriter<T> extends Writer<T> {
       private static final Logger LOG = LoggerFactory.getLogger(AggregatedWriter.class);
+      private static final ObjectPool<String, ShardRanges> SHARDRANGES_BY_STREAM =

Review Comment:
   nit: `SHARD_RANGES_BY_STREAM `



##########
sdks/java/io/amazon-web-services2/src/main/java/org/apache/beam/sdk/io/aws2/common/ObjectPool.java:
##########
@@ -0,0 +1,145 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.sdk.io.aws2.common;
+
+import static org.apache.beam.sdk.io.aws2.common.ClientBuilderFactory.buildClient;
+
+import java.util.function.Function;
+import org.apache.beam.sdk.function.ThrowingConsumer;
+import org.apache.beam.sdk.io.aws2.options.AwsOptions;
+import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.BiMap;
+import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.HashBiMap;
+import org.apache.commons.lang3.tuple.Pair;
+import org.checkerframework.checker.nullness.qual.NonNull;
+import org.checkerframework.checker.nullness.qual.Nullable;
+import org.slf4j.LoggerFactory;
+import software.amazon.awssdk.awscore.client.builder.AwsClientBuilder;
+import software.amazon.awssdk.core.SdkClient;
+
+/**
+ * Reference counting object pool to easily share & destroy objects.
+ *
+ * <p>NOTE: This relies heavily on the implementation of {@link #equals(Object)} for {@link KeyT}.
+ * If not implemented properly, clients can't be shared between instances of {@link
+ * org.apache.beam.sdk.transforms.DoFn}.
+ *
+ * @param <KeyT>> Key to share objects by
+ * @param <ObjectT>> Shared object
+ */
+public class ObjectPool<KeyT extends @NonNull Object, ObjectT extends @NonNull Object> {

Review Comment:
   Should this class and its methods be `public` or just package private?? 



##########
sdks/java/io/amazon-web-services2/src/main/java/org/apache/beam/sdk/io/aws2/kinesis/KinesisPartitioner.java:
##########
@@ -47,6 +47,26 @@
     return null;
   }
 
+  /**
+   * An explicit partitioner that always returns a {@code Nonnull} explicit hash key. The partition
+   * key is irrelevant in this case, though it cannot be {@code null}.
+   */
+  interface ExplicitPartitioner<T> extends KinesisPartitioner<T> {
+    @Override
+    default @Nonnull String getPartitionKey(T record) {
+      return "a"; // will be ignored, but can't be null

Review Comment:
   Return just empty string?



##########
sdks/java/io/amazon-web-services2/src/main/java/org/apache/beam/sdk/io/aws2/common/ClientPool.java:
##########
@@ -1,123 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one
- * or more contributor license agreements.  See the NOTICE file
- * distributed with this work for additional information
- * regarding copyright ownership.  The ASF licenses this file
- * to you under the Apache License, Version 2.0 (the
- * "License"); you may not use this file except in compliance
- * with the License.  You may obtain a copy of the License at
- *
- *     http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-package org.apache.beam.sdk.io.aws2.common;
-
-import java.util.function.BiFunction;
-import org.apache.beam.sdk.io.aws2.options.AwsOptions;
-import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.BiMap;
-import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.HashBiMap;
-import org.apache.commons.lang3.tuple.Pair;
-import org.checkerframework.checker.nullness.qual.Nullable;
-import software.amazon.awssdk.awscore.client.builder.AwsClientBuilder;
-
-/**
- * Reference counting pool to easily share AWS clients or similar by individual client provider and
- * configuration (optional).
- *
- * <p>NOTE: This relies heavily on the implementation of {@link #equals(Object)} for {@link
- * ProviderT} and {@link ConfigT}. If not implemented properly, clients can't be shared between
- * instances of {@link org.apache.beam.sdk.transforms.DoFn}.
- *
- * @param <ProviderT> Client provider
- * @param <ConfigT> Optional, nullable configuration
- * @param <ClientT> Client
- */
-public class ClientPool<ProviderT, ConfigT, ClientT extends AutoCloseable> {

Review Comment:
   And it was `public` just by chance or explicitly? Can it break something for users?



##########
sdks/java/io/amazon-web-services2/src/main/java/org/apache/beam/sdk/io/aws2/common/RetryConfiguration.java:
##########
@@ -68,7 +68,7 @@
   public abstract RetryConfiguration.Builder toBuilder();
 
   public static Builder builder() {
-    return Builder.builder();
+    return Builder.builder().numRetries(3);

Review Comment:
   Why number of retries "3" was choosen?





Issue Time Tracking
-------------------

    Worklog Id:     (was: 753518)
    Time Spent: 2h 10m  (was: 2h)

> Shard aware Kinesis record aggregation (AWS Sdk v2)
> ---------------------------------------------------
>
>                 Key: BEAM-14104
>                 URL: https://issues.apache.org/jira/browse/BEAM-14104
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-aws
>            Reporter: Moritz Mack
>            Assignee: Moritz Mack
>            Priority: P2
>              Labels: aws-sdk-v2, kinesis
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Implement advanced Kinesis record aggregation that is aware of active shards in the stream for optimal record aggregation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)