You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@inlong.apache.org by "Yizhou-Yang (via GitHub)" <gi...@apache.org> on 2023/03/27 10:04:07 UTC

[GitHub] [inlong] Yizhou-Yang opened a new pull request, #7659: [INLONG-7581][Sort] Support multiple-sink migration for Elasticsearch

Yizhou-Yang opened a new pull request, #7659:
URL: https://github.com/apache/inlong/pull/7659

   ### Prepare a Pull Request
   - Fixes #7581 
   
   ### Motivation
   Support multiple sink migration for ES sink.
   
   ### Modifications
   mainly modified multipleElasticRowFunction for 2 key features:
   1. support 4 additional options for esloadnode
   2. parsing canal-json data
   
   TODO:
   1. table-level metric
   2. dirty data
   4. runtime strategy+other features 
   
   ### Verifying this change
   <img width="912" alt="image" src="https://user-images.githubusercontent.com/32808678/227854492-fe9f9560-8174-4b9c-9235-992e1de85fee.png">
   <img width="894" alt="image" src="https://user-images.githubusercontent.com/32808678/227854554-174f0564-25d6-41bf-9833-5214f82afdb4.png">
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [inlong] gong commented on pull request #7659: [INLONG-7581][Sort] Support multiple-sink migration for Elasticsearch

Posted by "gong (via GitHub)" <gi...@apache.org>.
gong commented on PR #7659:
URL: https://github.com/apache/inlong/pull/7659#issuecomment-1484848529

   You need rebase master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [inlong] yunqingmoswu commented on a diff in pull request #7659: [INLONG-7581][Sort] Support multiple-sink migration for Elasticsearch

Posted by "yunqingmoswu (via GitHub)" <gi...@apache.org>.
yunqingmoswu commented on code in PR #7659:
URL: https://github.com/apache/inlong/pull/7659#discussion_r1149007749


##########
inlong-sort/sort-connectors/elasticsearch-base/src/main/java/org/apache/inlong/sort/elasticsearch/table/MultipleElasticsearchSinkFunctionBase.java:
##########
@@ -0,0 +1,299 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.inlong.sort.elasticsearch.table;
+
+import org.apache.flink.api.common.functions.RuntimeContext;
+import org.apache.flink.api.common.serialization.SerializationSchema;
+import org.apache.flink.formats.common.TimestampFormat;
+import org.apache.flink.formats.json.JsonOptions.MapNullKeyMode;
+import org.apache.flink.runtime.state.FunctionInitializationContext;
+import org.apache.flink.runtime.state.FunctionSnapshotContext;
+import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.JsonNode;
+import org.apache.flink.table.api.TableSchema;
+import org.apache.flink.table.data.RowData;
+import org.apache.flink.formats.json.JsonRowDataSerializationSchema;
+import java.util.UUID;
+import org.apache.flink.table.types.logical.LogicalType;
+import org.apache.flink.table.types.logical.RowType;
+import org.apache.flink.util.Preconditions;
+import org.apache.inlong.sort.base.dirty.DirtySinkHelper;
+import org.apache.inlong.sort.base.dirty.DirtyType;
+import org.apache.inlong.sort.base.format.DynamicSchemaFormatFactory;
+import org.apache.inlong.sort.base.format.JsonDynamicSchemaFormat;
+import org.apache.inlong.sort.base.metric.SinkMetricData;
+import org.apache.inlong.sort.base.sink.SchemaUpdateExceptionPolicy;
+import org.apache.inlong.sort.elasticsearch.ElasticsearchSinkFunction;
+import org.apache.inlong.sort.elasticsearch.RequestIndexer;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import javax.annotation.Nullable;
+import java.nio.charset.StandardCharsets;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Objects;
+import java.util.function.Function;
+
+/**
+ * Sink function for converting upserts into Elasticsearch ActionRequests.
+ */
+public abstract class MultipleElasticsearchSinkFunctionBase<Request, ContentType>
+        implements
+        ElasticsearchSinkFunction<RowData, Request> {
+
+    private static final long serialVersionUID = 1L;
+
+    private static final Logger LOGGER = LoggerFactory.getLogger(ElasticsearchSinkFunctionBase.class);
+
+    private final String docType;
+    private final ContentType contentType;
+    private final RequestFactory<Request, ContentType> requestFactory;
+    private final Function<RowData, String> createKey;
+    private final Function<RowData, String> createRouting;
+    private final DirtySinkHelper<Object> dirtySinkHelper;
+    private final String multipleFormat;
+    private final String indexPattern;
+    private final TableSchemaFactory tableSchemaFactory;
+    // initialized and reserved for a later feature.
+    private final SchemaUpdateExceptionPolicy schemaUpdateExceptionPolicy;
+    // open and store an index generator for each new index.
+    private Map<String, IndexGenerator> indexGeneratorMap;
+    // table level metrics
+    private SinkMetricData sinkMetricData;
+    private transient JsonDynamicSchemaFormat jsonDynamicSchemaFormat;
+    private transient SerializationSchema<RowData> serializationSchema;
+
+    public MultipleElasticsearchSinkFunctionBase(
+            @Nullable String docType, // this is deprecated in es 7+
+            SerializationSchema<RowData> serializationSchema,
+            ContentType contentType,
+            RequestFactory<Request, ContentType> requestFactory,
+            Function<RowData, String> createKey,
+            @Nullable Function<RowData, String> createRouting,
+            DirtySinkHelper<Object> dirtySinkHelper,
+            TableSchemaFactory tableSchemaFactory,
+            String multipleFormat,
+            String indexPattern,
+            SchemaUpdateExceptionPolicy schemaUpdateExceptionPolicy) {
+        this.docType = docType;
+        this.serializationSchema = Preconditions.checkNotNull(serializationSchema);
+        this.contentType = Preconditions.checkNotNull(contentType);
+        this.requestFactory = Preconditions.checkNotNull(requestFactory);
+        this.createKey = Preconditions.checkNotNull(createKey);
+        this.createRouting = createRouting;
+        this.dirtySinkHelper = dirtySinkHelper;
+        this.tableSchemaFactory = tableSchemaFactory;
+        this.multipleFormat = multipleFormat;
+        this.indexPattern = indexPattern;
+        this.schemaUpdateExceptionPolicy = schemaUpdateExceptionPolicy;
+    }
+
+    @Override
+    public void open(RuntimeContext ctx, SinkMetricData sinkMetricData) {
+        indexGeneratorMap = new HashMap<>();
+        this.sinkMetricData = sinkMetricData;
+    }
+
+    private void sendMetrics(byte[] document) {
+        if (sinkMetricData != null) {
+            sinkMetricData.invoke(1, document.length);
+        }
+    }
+
+    @Override
+    public void initializeState(FunctionInitializationContext context) {
+    }
+
+    @Override
+    public void snapshotState(FunctionSnapshotContext context) {
+    }
+
+    @Override
+    public void process(RowData element, RuntimeContext ctx, RequestIndexer<Request> indexer) {
+        JsonNode rootNode = null;
+        // parse rootnode
+        try {
+            jsonDynamicSchemaFormat =
+                    (JsonDynamicSchemaFormat) DynamicSchemaFormatFactory.getFormat(multipleFormat);
+            rootNode = jsonDynamicSchemaFormat.deserialize(element.getBinary(0));
+            // Ignore ddl change for now
+            boolean isDDL = jsonDynamicSchemaFormat.extractDDLFlag(rootNode);
+            if (isDDL) {
+                throw new IllegalArgumentException("ddl change unsupported");

Review Comment:
   Maybe it is better to ignore the ddl?



##########
inlong-sort/sort-connectors/elasticsearch-base/pom.xml:
##########
@@ -130,6 +130,20 @@
             <artifactId>audit-sdk</artifactId>
             <version>${project.version}</version>
         </dependency>
+        <dependency>
+            <groupId>org.elasticsearch</groupId>
+            <artifactId>elasticsearch</artifactId>
+        </dependency>
+        <dependency>
+            <groupId>org.apache.flink</groupId>
+            <artifactId>flink-connector-elasticsearch-base_2.12</artifactId>
+            <version>1.13.5</version>

Review Comment:
   Please use the variable instead.



##########
inlong-sort/sort-core/src/test/java/org/apache/inlong/sort/parser/ESMultipleSinkTest.java:
##########
@@ -0,0 +1,125 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.inlong.sort.parser;
+
+import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
+import org.apache.flink.table.api.EnvironmentSettings;
+import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;
+import org.apache.inlong.sort.formats.common.VarBinaryFormatInfo;
+import org.apache.inlong.sort.parser.impl.FlinkSqlParser;
+import org.apache.inlong.sort.parser.result.ParseResult;
+import org.apache.inlong.sort.protocol.FieldInfo;
+import org.apache.inlong.sort.protocol.GroupInfo;
+import org.apache.inlong.sort.protocol.StreamInfo;
+import org.apache.inlong.sort.protocol.enums.KafkaScanStartupMode;
+import org.apache.inlong.sort.protocol.node.Node;
+import org.apache.inlong.sort.protocol.node.extract.KafkaExtractNode;
+import org.apache.inlong.sort.protocol.node.format.CanalJsonFormat;
+import org.apache.inlong.sort.protocol.node.format.RawFormat;
+import org.apache.inlong.sort.protocol.node.load.ElasticsearchLoadNode;
+import org.apache.inlong.sort.protocol.transformation.FieldRelation;
+import org.apache.inlong.sort.protocol.transformation.relation.NodeRelation;
+import org.junit.Assert;
+import org.junit.Test;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ * Test for {@link org.apache.inlong.sort.protocol.node.load.DorisLoadNode}

Review Comment:
   Comment error



##########
inlong-sort/sort-connectors/elasticsearch-base/src/main/java/org/apache/inlong/sort/elasticsearch/table/ElasticsearchOptions.java:
##########
@@ -143,6 +144,30 @@ public class ElasticsearchOptions {
                     .withDescription(
                             "The format must produce a valid JSON document. "
                                     + "Please refer to the documentation on formats for more details.");
+    public static final ConfigOption<String> SINK_MULTIPLE_FORMAT =
+            ConfigOptions.key("sink.multiple.format")
+                    .stringType()
+                    .noDefaultValue()
+                    .withDescription(
+                            "The format of multiple sink, it represents the real format of the raw binary data");
+    public static final ConfigOption<String> SINK_MULTIPLE_INDEX_PATTERN =
+            ConfigOptions.key("sink.multiple.index-pattern")
+                    .stringType()
+                    .noDefaultValue()
+                    .withDescription("The option 'sink.multiple.table-pattern' "
+                            + "is used extract table name from the raw binary data, "
+                            + "this is only used in the multiple sink writing scenario.");
+    public static final ConfigOption<Boolean> SINK_MULTIPLE_ENABLE =

Review Comment:
   Please use the variable of 'Constant' of 'base' instead.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [inlong] Yizhou-Yang closed pull request #7659: [INLONG-7581][Sort] Support multiple-sink migration for Elasticsearch

Posted by "Yizhou-Yang (via GitHub)" <gi...@apache.org>.
Yizhou-Yang closed pull request #7659: [INLONG-7581][Sort] Support multiple-sink migration for Elasticsearch
URL: https://github.com/apache/inlong/pull/7659


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [inlong] gong commented on pull request #7659: [INLONG-7581][Sort] Support multiple-sink migration for Elasticsearch

Posted by "gong (via GitHub)" <gi...@apache.org>.
gong commented on PR #7659:
URL: https://github.com/apache/inlong/pull/7659#issuecomment-1484891338

   1、git pull upstream master
   2、git checkout -b INLONG-7581


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [inlong] Yizhou-Yang closed pull request #7659: [INLONG-7581][Sort] Support multiple-sink migration for Elasticsearch

Posted by "Yizhou-Yang (via GitHub)" <gi...@apache.org>.
Yizhou-Yang closed pull request #7659: [INLONG-7581][Sort] Support multiple-sink migration for Elasticsearch
URL: https://github.com/apache/inlong/pull/7659


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [inlong] Yizhou-Yang closed pull request #7659: [INLONG-7581][Sort] Support multiple-sink migration for Elasticsearch

Posted by "Yizhou-Yang (via GitHub)" <gi...@apache.org>.
Yizhou-Yang closed pull request #7659: [INLONG-7581][Sort] Support multiple-sink migration for Elasticsearch
URL: https://github.com/apache/inlong/pull/7659


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@inlong.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org