You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/08/09 07:29:32 UTC

[GitHub] [hudi] codope opened a new pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

codope opened a new pull request #3433:
URL: https://github.com/apache/hudi/pull/3433


   * HUDI-1896 intial source for Cloud Dfs
   
   * update with changes, added for fileMap support HUDI-1896
   
   * update with changes, added for fileMap support HUDI-1896
   
   * s3 meta source HUDI-1896
   
   * adding hoodie cloud object source class
   
   * adding hoodie cloud object source class
   
   * [HUDI-1896] adding selector test cases
   
   * [HUDI-1896] Intial source for Cloud Dfs and test cases
   
   * [HUDI-1896] Intial source for Cloud Dfs and test cases
   
   * [HUDI-1896] Intial source for Cloud Dfs and test cases
   
   Resolve conflicts and rename opt keys
   
   Minor refactoring in CloudObjectsDfsSelector
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
     - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
     - *Added integration tests for end-to-end.*
     - *Added HoodieClientWriteTest to verify the change.*
     - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#issuecomment-895012652


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 076123328724c1ef5051208c57706ae09ba6c11e UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run travis` re-run the last Travis build
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot edited a comment on pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
hudi-bot edited a comment on pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#issuecomment-895012652


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1572",
       "triggerID" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1590",
       "triggerID" : "895681502",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "33f7d78265f1a9635d6254e0dbfb40f161a3d4a7",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1607",
       "triggerID" : "33f7d78265f1a9635d6254e0dbfb40f161a3d4a7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bd9b7dd1e3f0a17fca7fab59650f6f0b03873dc1",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1712",
       "triggerID" : "bd9b7dd1e3f0a17fca7fab59650f6f0b03873dc1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 33f7d78265f1a9635d6254e0dbfb40f161a3d4a7 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1607) 
   * bd9b7dd1e3f0a17fca7fab59650f6f0b03873dc1 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1712) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run travis` re-run the last Travis build
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot edited a comment on pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
hudi-bot edited a comment on pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#issuecomment-895012652


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1572",
       "triggerID" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1590",
       "triggerID" : "895681502",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 076123328724c1ef5051208c57706ae09ba6c11e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1572) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1590) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run travis` re-run the last Travis build
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on a change in pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
codope commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688588000



##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsMetaSource.java
##########
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.CloudObjectsMetaSelector;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Encoders;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * This source provides capability to create the hoodie table for cloudObject Metadata (eg. s3

Review comment:
       Done.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a change in pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r687820245



##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsMetaSource.java
##########
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.CloudObjectsMetaSelector;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Encoders;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * This source provides capability to create the hoodie table for cloudObject Metadata (eg. s3
+ * events data). It will use the cloud queue for receiving the object key events. This can be useful
+ * for check cloud file activity over time and consuming this to create other hoodie table from

Review comment:
       minor: this can be useful "to" check cloud file activity over time and "create the hoodie cloud meta table."

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsMetaSource.java
##########
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.CloudObjectsMetaSelector;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Encoders;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * This source provides capability to create the hoodie table for cloudObject Metadata (eg. s3

Review comment:
       to avoid confusion w/ hoodie metadata table in general, lets call the table we create in this 2 stage pipeline as "hoodie cloud meta table". If you agree, can you please fix the terminology throughout the patch.

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelector.java
##########
@@ -0,0 +1,285 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+
+import com.amazonaws.regions.Regions;
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.AmazonSQSClientBuilder;
+import com.amazonaws.services.sqs.model.BatchResultErrorEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequest;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequestEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchResult;
+import com.amazonaws.services.sqs.model.GetQueueAttributesRequest;
+import com.amazonaws.services.sqs.model.GetQueueAttributesResult;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.json.JSONObject;
+
+import java.io.UnsupportedEncodingException;
+import java.net.URLDecoder;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ * Cloud Objects Selector Class. This class has methods for processing cloud objects. It currently
+ * supports only AWS S3 objects and AWS SQS queue.
+ */
+public class CloudObjectsSelector {
+  public static final List<String> ALLOWED_S3_EVENT_PREFIX =
+      Collections.singletonList("ObjectCreated");
+  public static volatile Logger log = LogManager.getLogger(CloudObjectsSelector.class);
+  public final String queueUrl;
+  public final int longPollWait;
+  public final int maxMessagesEachRequest;
+  public final int maxMessageEachBatch;
+  public final int visibilityTimeout;
+  public final TypedProperties props;
+  public final String fsName;
+  private final String regionName;
+
+  /**
+   * Cloud Objects Selector Class. {@link CloudObjectsSelector}
+   */
+  public CloudObjectsSelector(TypedProperties props) {
+    DataSourceUtils.checkRequiredProperties(props, Arrays.asList(Config.QUEUE_URL_PROP, Config.QUEUE_REGION));
+    this.props = props;
+    this.queueUrl = props.getString(Config.QUEUE_URL_PROP);
+    this.regionName = props.getString(Config.QUEUE_REGION);
+    this.fsName = props.getString(Config.SOURCE_QUEUE_FS_PROP, "s3").toLowerCase();
+    this.longPollWait = props.getInteger(Config.QUEUE_LONGPOLLWAIT_PROP, 20);
+    this.maxMessageEachBatch = props.getInteger(Config.QUEUE_MAXMESSAGESEACHBATCH_PROP, 5);
+    this.visibilityTimeout = props.getInteger(Config.QUEUE_VISIBILITYTIMEOUT_PROP, 30);
+    this.maxMessagesEachRequest = 10;
+  }
+
+  /**
+   * Get SQS queue attributes.
+   *
+   * @param sqsClient AWSClient for sqsClient
+   * @param queueUrl  queue full url
+   * @return map of attributes needed
+   */
+  protected Map<String, String> getSqsQueueAttributes(AmazonSQS sqsClient, String queueUrl) {
+    GetQueueAttributesResult queueAttributesResult =
+        sqsClient.getQueueAttributes(
+            new GetQueueAttributesRequest(queueUrl)
+                .withAttributeNames("ApproximateNumberOfMessages"));
+    return queueAttributesResult.getAttributes();
+  }
+
+  /**
+   * Get the file attributes filePath, eventTime and size from JSONObject record.
+   *
+   * @param record of object event
+   * @return map of file attribute
+   */
+  protected Map<String, Object> getFileAttributesFromRecord(JSONObject record)
+      throws UnsupportedEncodingException {
+
+    Map<String, Object> fileRecord = new HashMap<>();
+    String eventTimeStr = record.getString("eventTime");
+    long eventTime =
+        Date.from(Instant.from(DateTimeFormatter.ISO_INSTANT.parse(eventTimeStr))).getTime();
+
+    JSONObject s3Object = record.getJSONObject("s3").getJSONObject("object");
+    String bucket =
+        URLDecoder.decode(
+            record.getJSONObject("s3").getJSONObject("bucket").getString("name"), "UTF-8");
+    String key = URLDecoder.decode(s3Object.getString("key"), "UTF-8");
+    String filePath = this.fsName + "://" + bucket + "/" + key;
+
+    fileRecord.put("eventTime", eventTime);
+    fileRecord.put("fileSize", s3Object.getLong("size"));
+    fileRecord.put("filePath", filePath);
+    return fileRecord;
+  }
+
+  /**
+   * Amazon SQS Client Builder.
+   */
+  public AmazonSQS createAmazonSqsClient() {
+    return AmazonSQSClientBuilder.standard().withRegion(Regions.fromName(regionName)).build();
+  }
+
+  /**
+   * List messages from queue.
+   */
+  protected List<Message> getMessagesToProcess(
+      AmazonSQS sqsClient,
+      String queueUrl,
+      ReceiveMessageRequest receiveMessageRequest,
+      int maxMessageEachBatch,
+      int maxMessagesEachRequest) {
+    List<Message> messagesToProcess = new ArrayList<>();
+
+    // Get count for available messages
+    Map<String, String> queueAttributesResult = getSqsQueueAttributes(sqsClient, queueUrl);
+    long approxMessagesAvailable =
+        Long.parseLong(queueAttributesResult.get("ApproximateNumberOfMessages"));
+    log.info("Approx. " + approxMessagesAvailable + " messages available in queue.");
+
+    for (int i = 0;
+         i < (int) Math.ceil((double) approxMessagesAvailable / maxMessagesEachRequest);
+         ++i) {
+      List<Message> messages = sqsClient.receiveMessage(receiveMessageRequest).getMessages();
+      log.debug("Messages size: " + messages.size());
+
+      for (Message message : messages) {
+        log.debug("message id: " + message.getMessageId());
+        messagesToProcess.add(message);
+      }
+      log.debug("total fetched messages size: " + messagesToProcess.size());
+      if (messages.isEmpty() || (messagesToProcess.size() >= maxMessageEachBatch)) {
+        break;
+      }
+    }
+    return messagesToProcess;
+  }
+
+  /**
+   * create partitions of list using specific batch size. we can't use third party API for this
+   * functionality, due to https://github.com/apache/hudi/blob/master/style/checkstyle.xml#L270
+   */
+  protected List<List<Message>> createListPartitions(List<Message> singleList, int eachBatchSize) {
+    List<List<Message>> listPartitions = new ArrayList<>();
+
+    if (singleList.size() == 0 || eachBatchSize < 1) {
+      return listPartitions;
+    }
+
+    for (int start = 0; start < singleList.size(); start += eachBatchSize) {
+      int end = Math.min(start + eachBatchSize, singleList.size());
+
+      if (start > end) {
+        throw new IndexOutOfBoundsException(
+            "Index " + start + " is out of the list range <0," + (singleList.size() - 1) + ">");
+      }
+      listPartitions.add(new ArrayList<>(singleList.subList(start, end)));
+    }
+    return listPartitions;
+  }
+
+  /**
+   * delete batch of messages from queue.
+   */
+  protected void deleteBatchOfMessages(
+      AmazonSQS sqs, String queueUrl, List<Message> messagesToBeDeleted) {
+    DeleteMessageBatchRequest deleteBatchReq =
+        new DeleteMessageBatchRequest().withQueueUrl(queueUrl);
+    List<DeleteMessageBatchRequestEntry> deleteEntries = deleteBatchReq.getEntries();
+
+    for (Message message : messagesToBeDeleted) {
+      deleteEntries.add(
+          new DeleteMessageBatchRequestEntry()
+              .withId(message.getMessageId())
+              .withReceiptHandle(message.getReceiptHandle()));
+    }
+    DeleteMessageBatchResult deleteResult = sqs.deleteMessageBatch(deleteBatchReq);
+    List<String> deleteFailures =
+        deleteResult.getFailed().stream()
+            .map(BatchResultErrorEntry::getId)
+            .collect(Collectors.toList());
+    System.out.println("Delete is" + deleteFailures.isEmpty() + "or ignoring it.");
+    if (!deleteFailures.isEmpty()) {
+      log.warn(
+          "Failed to delete "
+              + deleteFailures.size()
+              + " messages out of "
+              + deleteEntries.size()
+              + " from queue.");
+    } else {
+      log.info("Successfully deleted " + deleteEntries.size() + " messages from queue.");
+    }
+  }
+
+  /**
+   * Delete Queue Messages after hudi commit. This method will be invoked by source.onCommit.
+   */
+  public void onCommitDeleteProcessedMessages(

Review comment:
       minor. rename to "deleteProcessedMessages". Every caller is already calling this from within onCommit(). so its understandable. 

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+    super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties props) {
+    String sourceSelectorClass =
+        props.getString(
+            CloudObjectsMetaSelector.Config.SOURCE_INPUT_SELECTOR,
+            CloudObjectsMetaSelector.class.getName());
+    try {
+      CloudObjectsMetaSelector selector =
+          (CloudObjectsMetaSelector)
+              ReflectionUtils.loadClass(
+                  sourceSelectorClass, new Class<?>[] {TypedProperties.class}, props);
+
+      log.info("Using path selector " + selector.getClass().getName());
+      return selector;
+    } catch (Exception e) {
+      throw new HoodieException("Could not load source selector class " + sourceSelectorClass, e);
+    }
+  }
+
+  /**
+   * List messages from queue, filter out illegible events while doing so. It will also delete the
+   * ineligible messages from queue.
+   *
+   * @param processedMessages array of processed messages to add more messages
+   * @return the list of eligible records
+   */
+  protected List<Map<String, Object>> getEligibleEvents(
+      AmazonSQS sqs, List<Message> processedMessages) throws IOException {
+
+    List<Map<String, Object>> eligibleRecords = new ArrayList<>();
+    List<Message> ineligibleMessages = new ArrayList<>();
+
+    ReceiveMessageRequest receiveMessageRequest =
+        new ReceiveMessageRequest()
+            .withQueueUrl(this.queueUrl)
+            .withWaitTimeSeconds(this.longPollWait)
+            .withVisibilityTimeout(this.visibilityTimeout);
+    receiveMessageRequest.setMaxNumberOfMessages(this.maxMessagesEachRequest);
+
+    List<Message> messages =
+        getMessagesToProcess(
+            sqs,
+            this.queueUrl,
+            receiveMessageRequest,
+            this.maxMessageEachBatch,
+            this.maxMessagesEachRequest);
+
+    for (Message message : messages) {
+      boolean isMessageDelete = Boolean.TRUE;
+
+      JSONObject messageBody = new JSONObject(message.getBody());
+      Map<String, Object> messageMap;
+      ObjectMapper mapper = new ObjectMapper();
+
+      if (messageBody.has("Message")) {
+        // If this messages is from S3Event -> SNS -> SQS
+        messageMap =
+            (Map<String, Object>) mapper.readValue(messageBody.getString("Message"), Map.class);
+      } else {
+        // If this messages is from S3Event -> SQS
+        messageMap = (Map<String, Object>) mapper.readValue(messageBody.toString(), Map.class);
+      }
+      if (messageMap.containsKey("Records")) {
+        List<Map<String, Object>> records = (List<Map<String, Object>>) messageMap.get("Records");
+        for (Map<String, Object> record : records) {
+          String eventName = (String) record.get("eventName");
+
+          // filter only allowed s3 event types
+          if (ALLOWED_S3_EVENT_PREFIX.stream().anyMatch(eventName::startsWith)) {
+            eligibleRecords.add(record);
+            isMessageDelete = Boolean.FALSE;
+            processedMessages.add(message);
+
+          } else {
+            log.info("This S3 event " + eventName + " is not allowed, so ignoring it.");
+          }
+        }
+      } else {
+        log.info("Message is not expected format or it's s3:TestEvent");
+      }
+      if (isMessageDelete) {
+        ineligibleMessages.add(message);
+      }
+    }
+    if (!ineligibleMessages.isEmpty()) {
+      deleteBatchOfMessages(sqs, queueUrl, ineligibleMessages);
+    }
+
+    return eligibleRecords;
+  }
+
+  /**
+   * Get the list of events from queue.
+   *
+   * @param sparkContext JavaSparkContext to help parallelize certain operations
+   * @param lastCheckpointStr the last checkpoint time string, empty if first run
+   * @return the list of events
+   */
+  public Pair<List<String>, String> getNextEventsFromQueue(
+      AmazonSQS sqs,
+      JavaSparkContext sparkContext,

Review comment:
       Looks like jsc is not used.

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+    super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties props) {

Review comment:
       why modelled as a factory? I see this is instantiated only from within MetaSource. So, why not directly call new CloudObjectsMetaSelector(TypedProperties props) from within the constructor of CloudObjectsMetaSource.

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelector.java
##########
@@ -0,0 +1,285 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+
+import com.amazonaws.regions.Regions;
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.AmazonSQSClientBuilder;
+import com.amazonaws.services.sqs.model.BatchResultErrorEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequest;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequestEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchResult;
+import com.amazonaws.services.sqs.model.GetQueueAttributesRequest;
+import com.amazonaws.services.sqs.model.GetQueueAttributesResult;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.json.JSONObject;
+
+import java.io.UnsupportedEncodingException;
+import java.net.URLDecoder;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ * Cloud Objects Selector Class. This class has methods for processing cloud objects. It currently
+ * supports only AWS S3 objects and AWS SQS queue.
+ */
+public class CloudObjectsSelector {
+  public static final List<String> ALLOWED_S3_EVENT_PREFIX =
+      Collections.singletonList("ObjectCreated");
+  public static volatile Logger log = LogManager.getLogger(CloudObjectsSelector.class);
+  public final String queueUrl;
+  public final int longPollWait;
+  public final int maxMessagesEachRequest;
+  public final int maxMessageEachBatch;
+  public final int visibilityTimeout;
+  public final TypedProperties props;
+  public final String fsName;
+  private final String regionName;
+
+  /**
+   * Cloud Objects Selector Class. {@link CloudObjectsSelector}
+   */
+  public CloudObjectsSelector(TypedProperties props) {
+    DataSourceUtils.checkRequiredProperties(props, Arrays.asList(Config.QUEUE_URL_PROP, Config.QUEUE_REGION));
+    this.props = props;
+    this.queueUrl = props.getString(Config.QUEUE_URL_PROP);
+    this.regionName = props.getString(Config.QUEUE_REGION);
+    this.fsName = props.getString(Config.SOURCE_QUEUE_FS_PROP, "s3").toLowerCase();
+    this.longPollWait = props.getInteger(Config.QUEUE_LONGPOLLWAIT_PROP, 20);
+    this.maxMessageEachBatch = props.getInteger(Config.QUEUE_MAXMESSAGESEACHBATCH_PROP, 5);
+    this.visibilityTimeout = props.getInteger(Config.QUEUE_VISIBILITYTIMEOUT_PROP, 30);
+    this.maxMessagesEachRequest = 10;
+  }
+
+  /**
+   * Get SQS queue attributes.
+   *
+   * @param sqsClient AWSClient for sqsClient
+   * @param queueUrl  queue full url
+   * @return map of attributes needed
+   */
+  protected Map<String, String> getSqsQueueAttributes(AmazonSQS sqsClient, String queueUrl) {
+    GetQueueAttributesResult queueAttributesResult =
+        sqsClient.getQueueAttributes(
+            new GetQueueAttributesRequest(queueUrl)
+                .withAttributeNames("ApproximateNumberOfMessages"));
+    return queueAttributesResult.getAttributes();
+  }
+
+  /**
+   * Get the file attributes filePath, eventTime and size from JSONObject record.
+   *
+   * @param record of object event
+   * @return map of file attribute
+   */
+  protected Map<String, Object> getFileAttributesFromRecord(JSONObject record)
+      throws UnsupportedEncodingException {
+
+    Map<String, Object> fileRecord = new HashMap<>();
+    String eventTimeStr = record.getString("eventTime");
+    long eventTime =
+        Date.from(Instant.from(DateTimeFormatter.ISO_INSTANT.parse(eventTimeStr))).getTime();
+
+    JSONObject s3Object = record.getJSONObject("s3").getJSONObject("object");
+    String bucket =
+        URLDecoder.decode(
+            record.getJSONObject("s3").getJSONObject("bucket").getString("name"), "UTF-8");
+    String key = URLDecoder.decode(s3Object.getString("key"), "UTF-8");
+    String filePath = this.fsName + "://" + bucket + "/" + key;
+
+    fileRecord.put("eventTime", eventTime);
+    fileRecord.put("fileSize", s3Object.getLong("size"));
+    fileRecord.put("filePath", filePath);
+    return fileRecord;
+  }
+
+  /**
+   * Amazon SQS Client Builder.
+   */
+  public AmazonSQS createAmazonSqsClient() {
+    return AmazonSQSClientBuilder.standard().withRegion(Regions.fromName(regionName)).build();
+  }
+
+  /**
+   * List messages from queue.
+   */
+  protected List<Message> getMessagesToProcess(
+      AmazonSQS sqsClient,
+      String queueUrl,
+      ReceiveMessageRequest receiveMessageRequest,
+      int maxMessageEachBatch,
+      int maxMessagesEachRequest) {
+    List<Message> messagesToProcess = new ArrayList<>();
+
+    // Get count for available messages
+    Map<String, String> queueAttributesResult = getSqsQueueAttributes(sqsClient, queueUrl);
+    long approxMessagesAvailable =
+        Long.parseLong(queueAttributesResult.get("ApproximateNumberOfMessages"));
+    log.info("Approx. " + approxMessagesAvailable + " messages available in queue.");
+
+    for (int i = 0;
+         i < (int) Math.ceil((double) approxMessagesAvailable / maxMessagesEachRequest);
+         ++i) {
+      List<Message> messages = sqsClient.receiveMessage(receiveMessageRequest).getMessages();
+      log.debug("Messages size: " + messages.size());
+
+      for (Message message : messages) {
+        log.debug("message id: " + message.getMessageId());
+        messagesToProcess.add(message);
+      }
+      log.debug("total fetched messages size: " + messagesToProcess.size());
+      if (messages.isEmpty() || (messagesToProcess.size() >= maxMessageEachBatch)) {
+        break;
+      }
+    }
+    return messagesToProcess;
+  }
+
+  /**
+   * create partitions of list using specific batch size. we can't use third party API for this
+   * functionality, due to https://github.com/apache/hudi/blob/master/style/checkstyle.xml#L270
+   */
+  protected List<List<Message>> createListPartitions(List<Message> singleList, int eachBatchSize) {
+    List<List<Message>> listPartitions = new ArrayList<>();
+
+    if (singleList.size() == 0 || eachBatchSize < 1) {
+      return listPartitions;
+    }
+
+    for (int start = 0; start < singleList.size(); start += eachBatchSize) {
+      int end = Math.min(start + eachBatchSize, singleList.size());
+
+      if (start > end) {
+        throw new IndexOutOfBoundsException(
+            "Index " + start + " is out of the list range <0," + (singleList.size() - 1) + ">");
+      }
+      listPartitions.add(new ArrayList<>(singleList.subList(start, end)));
+    }
+    return listPartitions;
+  }
+
+  /**
+   * delete batch of messages from queue.
+   */
+  protected void deleteBatchOfMessages(
+      AmazonSQS sqs, String queueUrl, List<Message> messagesToBeDeleted) {
+    DeleteMessageBatchRequest deleteBatchReq =
+        new DeleteMessageBatchRequest().withQueueUrl(queueUrl);
+    List<DeleteMessageBatchRequestEntry> deleteEntries = deleteBatchReq.getEntries();
+
+    for (Message message : messagesToBeDeleted) {
+      deleteEntries.add(
+          new DeleteMessageBatchRequestEntry()
+              .withId(message.getMessageId())
+              .withReceiptHandle(message.getReceiptHandle()));
+    }
+    DeleteMessageBatchResult deleteResult = sqs.deleteMessageBatch(deleteBatchReq);
+    List<String> deleteFailures =
+        deleteResult.getFailed().stream()
+            .map(BatchResultErrorEntry::getId)
+            .collect(Collectors.toList());
+    System.out.println("Delete is" + deleteFailures.isEmpty() + "or ignoring it.");
+    if (!deleteFailures.isEmpty()) {
+      log.warn(
+          "Failed to delete "
+              + deleteFailures.size()
+              + " messages out of "
+              + deleteEntries.size()
+              + " from queue.");
+    } else {
+      log.info("Successfully deleted " + deleteEntries.size() + " messages from queue.");
+    }
+  }
+
+  /**
+   * Delete Queue Messages after hudi commit. This method will be invoked by source.onCommit.
+   */
+  public void onCommitDeleteProcessedMessages(
+      AmazonSQS sqs, String queueUrl, List<Message> processedMessages) {
+
+    if (!processedMessages.isEmpty()) {
+
+      // create batch for deletion, SES DeleteMessageBatchRequest only accept max 10 entries
+      List<List<Message>> deleteBatches = createListPartitions(processedMessages, 10);
+      for (List<Message> deleteBatch : deleteBatches) {
+        deleteBatchOfMessages(sqs, queueUrl, deleteBatch);
+      }
+    }
+  }
+
+  /**
+   * Configs supported.
+   */
+  public static class Config {
+    /**
+     * {@value #QUEUE_URL_PROP} is the queue url for cloud object events.
+     */
+    public static final String QUEUE_URL_PROP = "hoodie.deltastreamer.source.queue.url";

Review comment:
       we might need to fix the naming convention of all these configs. 
   "hoodie.deltastreamer.cloud.source...." or something on similar lines. Wdyt?

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+    super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties props) {
+    String sourceSelectorClass =
+        props.getString(
+            CloudObjectsMetaSelector.Config.SOURCE_INPUT_SELECTOR,
+            CloudObjectsMetaSelector.class.getName());
+    try {
+      CloudObjectsMetaSelector selector =
+          (CloudObjectsMetaSelector)
+              ReflectionUtils.loadClass(
+                  sourceSelectorClass, new Class<?>[] {TypedProperties.class}, props);
+
+      log.info("Using path selector " + selector.getClass().getName());
+      return selector;
+    } catch (Exception e) {
+      throw new HoodieException("Could not load source selector class " + sourceSelectorClass, e);
+    }
+  }
+
+  /**
+   * List messages from queue, filter out illegible events while doing so. It will also delete the
+   * ineligible messages from queue.
+   *
+   * @param processedMessages array of processed messages to add more messages
+   * @return the list of eligible records
+   */
+  protected List<Map<String, Object>> getEligibleEvents(
+      AmazonSQS sqs, List<Message> processedMessages) throws IOException {
+
+    List<Map<String, Object>> eligibleRecords = new ArrayList<>();
+    List<Message> ineligibleMessages = new ArrayList<>();
+
+    ReceiveMessageRequest receiveMessageRequest =
+        new ReceiveMessageRequest()
+            .withQueueUrl(this.queueUrl)
+            .withWaitTimeSeconds(this.longPollWait)
+            .withVisibilityTimeout(this.visibilityTimeout);
+    receiveMessageRequest.setMaxNumberOfMessages(this.maxMessagesEachRequest);
+
+    List<Message> messages =
+        getMessagesToProcess(
+            sqs,
+            this.queueUrl,
+            receiveMessageRequest,
+            this.maxMessageEachBatch,
+            this.maxMessagesEachRequest);
+
+    for (Message message : messages) {
+      boolean isMessageDelete = Boolean.TRUE;
+
+      JSONObject messageBody = new JSONObject(message.getBody());
+      Map<String, Object> messageMap;
+      ObjectMapper mapper = new ObjectMapper();
+
+      if (messageBody.has("Message")) {
+        // If this messages is from S3Event -> SNS -> SQS
+        messageMap =
+            (Map<String, Object>) mapper.readValue(messageBody.getString("Message"), Map.class);
+      } else {
+        // If this messages is from S3Event -> SQS
+        messageMap = (Map<String, Object>) mapper.readValue(messageBody.toString(), Map.class);
+      }
+      if (messageMap.containsKey("Records")) {
+        List<Map<String, Object>> records = (List<Map<String, Object>>) messageMap.get("Records");
+        for (Map<String, Object> record : records) {
+          String eventName = (String) record.get("eventName");
+
+          // filter only allowed s3 event types
+          if (ALLOWED_S3_EVENT_PREFIX.stream().anyMatch(eventName::startsWith)) {
+            eligibleRecords.add(record);
+            isMessageDelete = Boolean.FALSE;
+            processedMessages.add(message);
+
+          } else {
+            log.info("This S3 event " + eventName + " is not allowed, so ignoring it.");
+          }
+        }
+      } else {
+        log.info("Message is not expected format or it's s3:TestEvent");
+      }
+      if (isMessageDelete) {
+        ineligibleMessages.add(message);
+      }
+    }
+    if (!ineligibleMessages.isEmpty()) {
+      deleteBatchOfMessages(sqs, queueUrl, ineligibleMessages);
+    }
+
+    return eligibleRecords;
+  }
+
+  /**
+   * Get the list of events from queue.
+   *
+   * @param sparkContext JavaSparkContext to help parallelize certain operations
+   * @param lastCheckpointStr the last checkpoint time string, empty if first run
+   * @return the list of events
+   */
+  public Pair<List<String>, String> getNextEventsFromQueue(
+      AmazonSQS sqs,
+      JavaSparkContext sparkContext,
+      Option<String> lastCheckpointStr,
+      List<Message> processedMessages) {
+
+    processedMessages.clear();
+
+    log.info("Reading messages....");
+
+    try {
+      log.info("Start Checkpoint : " + lastCheckpointStr);
+
+      long lastCheckpointTime = lastCheckpointStr.map(Long::parseLong).orElse(Long.MIN_VALUE);
+
+      List<Map<String, Object>> eligibleEventRecords = getEligibleEvents(sqs, processedMessages);
+      log.info("eligible events size: " + eligibleEventRecords.size());
+
+      // sort all events by event time.
+      eligibleEventRecords.sort(
+          Comparator.comparingLong(

Review comment:
       may I know why do we need this sorting? If Hoodie is going to do preCombine anyways, are we not duplicating the efforts here. 

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+    super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties props) {
+    String sourceSelectorClass =
+        props.getString(
+            CloudObjectsMetaSelector.Config.SOURCE_INPUT_SELECTOR,
+            CloudObjectsMetaSelector.class.getName());
+    try {
+      CloudObjectsMetaSelector selector =
+          (CloudObjectsMetaSelector)
+              ReflectionUtils.loadClass(
+                  sourceSelectorClass, new Class<?>[] {TypedProperties.class}, props);
+
+      log.info("Using path selector " + selector.getClass().getName());
+      return selector;
+    } catch (Exception e) {
+      throw new HoodieException("Could not load source selector class " + sourceSelectorClass, e);
+    }
+  }
+
+  /**
+   * List messages from queue, filter out illegible events while doing so. It will also delete the
+   * ineligible messages from queue.
+   *
+   * @param processedMessages array of processed messages to add more messages
+   * @return the list of eligible records
+   */
+  protected List<Map<String, Object>> getEligibleEvents(
+      AmazonSQS sqs, List<Message> processedMessages) throws IOException {
+
+    List<Map<String, Object>> eligibleRecords = new ArrayList<>();
+    List<Message> ineligibleMessages = new ArrayList<>();
+
+    ReceiveMessageRequest receiveMessageRequest =
+        new ReceiveMessageRequest()
+            .withQueueUrl(this.queueUrl)
+            .withWaitTimeSeconds(this.longPollWait)
+            .withVisibilityTimeout(this.visibilityTimeout);
+    receiveMessageRequest.setMaxNumberOfMessages(this.maxMessagesEachRequest);
+
+    List<Message> messages =
+        getMessagesToProcess(
+            sqs,
+            this.queueUrl,
+            receiveMessageRequest,
+            this.maxMessageEachBatch,
+            this.maxMessagesEachRequest);
+
+    for (Message message : messages) {
+      boolean isMessageDelete = Boolean.TRUE;
+
+      JSONObject messageBody = new JSONObject(message.getBody());
+      Map<String, Object> messageMap;
+      ObjectMapper mapper = new ObjectMapper();
+
+      if (messageBody.has("Message")) {
+        // If this messages is from S3Event -> SNS -> SQS
+        messageMap =
+            (Map<String, Object>) mapper.readValue(messageBody.getString("Message"), Map.class);
+      } else {
+        // If this messages is from S3Event -> SQS
+        messageMap = (Map<String, Object>) mapper.readValue(messageBody.toString(), Map.class);
+      }
+      if (messageMap.containsKey("Records")) {
+        List<Map<String, Object>> records = (List<Map<String, Object>>) messageMap.get("Records");
+        for (Map<String, Object> record : records) {
+          String eventName = (String) record.get("eventName");
+
+          // filter only allowed s3 event types
+          if (ALLOWED_S3_EVENT_PREFIX.stream().anyMatch(eventName::startsWith)) {
+            eligibleRecords.add(record);
+            isMessageDelete = Boolean.FALSE;
+            processedMessages.add(message);
+
+          } else {
+            log.info("This S3 event " + eventName + " is not allowed, so ignoring it.");
+          }
+        }
+      } else {
+        log.info("Message is not expected format or it's s3:TestEvent");
+      }
+      if (isMessageDelete) {
+        ineligibleMessages.add(message);
+      }
+    }
+    if (!ineligibleMessages.isEmpty()) {
+      deleteBatchOfMessages(sqs, queueUrl, ineligibleMessages);
+    }
+
+    return eligibleRecords;
+  }
+
+  /**
+   * Get the list of events from queue.
+   *
+   * @param sparkContext JavaSparkContext to help parallelize certain operations
+   * @param lastCheckpointStr the last checkpoint time string, empty if first run
+   * @return the list of events
+   */
+  public Pair<List<String>, String> getNextEventsFromQueue(
+      AmazonSQS sqs,
+      JavaSparkContext sparkContext,
+      Option<String> lastCheckpointStr,
+      List<Message> processedMessages) {
+
+    processedMessages.clear();
+

Review comment:
       please avoid unnecessary line breaks.

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsMetaSource.java
##########
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.CloudObjectsMetaSelector;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Encoders;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * This source provides capability to create the hoodie table for cloudObject Metadata (eg. s3
+ * events data). It will use the cloud queue for receiving the object key events. This can be useful
+ * for check cloud file activity over time and consuming this to create other hoodie table from
+ * cloud object data.
+ */
+public class CloudObjectsMetaSource extends RowSource {
+
+  private final CloudObjectsMetaSelector pathSelector;
+  private final List<Message> processedMessages = new ArrayList<>();
+  AmazonSQS sqs;
+
+  /**
+   * Cloud Objects Meta Source Class.
+   */
+  public CloudObjectsMetaSource(
+      TypedProperties props,
+      JavaSparkContext sparkContext,
+      SparkSession sparkSession,
+      SchemaProvider schemaProvider) {
+    super(props, sparkContext, sparkSession, schemaProvider);
+    this.pathSelector = CloudObjectsMetaSelector.createSourceSelector(props);
+    this.sqs = this.pathSelector.createAmazonSqsClient();
+  }
+
+  @Override
+  public Pair<Option<Dataset<Row>>, String> fetchNextBatch(
+      Option<String> lastCkptStr, long sourceLimit) {
+
+    Pair<List<String>, String> selectPathsWithLatestSqsMessage =

Review comment:
       can you add a java doc here to explain what are the components in this pair. 

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsMetaSource.java
##########
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.CloudObjectsMetaSelector;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Encoders;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * This source provides capability to create the hoodie table for cloudObject Metadata (eg. s3
+ * events data). It will use the cloud queue for receiving the object key events. This can be useful
+ * for check cloud file activity over time and consuming this to create other hoodie table from
+ * cloud object data.
+ */
+public class CloudObjectsMetaSource extends RowSource {
+
+  private final CloudObjectsMetaSelector pathSelector;
+  private final List<Message> processedMessages = new ArrayList<>();
+  AmazonSQS sqs;
+
+  /**
+   * Cloud Objects Meta Source Class.
+   */
+  public CloudObjectsMetaSource(
+      TypedProperties props,
+      JavaSparkContext sparkContext,
+      SparkSession sparkSession,
+      SchemaProvider schemaProvider) {
+    super(props, sparkContext, sparkSession, schemaProvider);
+    this.pathSelector = CloudObjectsMetaSelector.createSourceSelector(props);
+    this.sqs = this.pathSelector.createAmazonSqsClient();
+  }
+
+  @Override
+  public Pair<Option<Dataset<Row>>, String> fetchNextBatch(
+      Option<String> lastCkptStr, long sourceLimit) {
+
+    Pair<List<String>, String> selectPathsWithLatestSqsMessage =
+        pathSelector.getNextEventsFromQueue(sqs, sparkContext, lastCkptStr, processedMessages);
+    if (selectPathsWithLatestSqsMessage.getLeft().isEmpty()) {
+      return Pair.of(Option.empty(), selectPathsWithLatestSqsMessage.getRight());
+    } else {
+      return Pair.of(
+          Option.of(fromEventRecords(selectPathsWithLatestSqsMessage.getLeft())),
+          selectPathsWithLatestSqsMessage.getRight());
+    }
+  }
+
+  private Dataset<Row> fromEventRecords(List<String> jsonData) {

Review comment:
       If not going to be re-used, may be we can just inline this 2 lines in line 72 ish. 

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+    super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties props) {
+    String sourceSelectorClass =
+        props.getString(
+            CloudObjectsMetaSelector.Config.SOURCE_INPUT_SELECTOR,
+            CloudObjectsMetaSelector.class.getName());
+    try {
+      CloudObjectsMetaSelector selector =
+          (CloudObjectsMetaSelector)
+              ReflectionUtils.loadClass(
+                  sourceSelectorClass, new Class<?>[] {TypedProperties.class}, props);
+
+      log.info("Using path selector " + selector.getClass().getName());
+      return selector;
+    } catch (Exception e) {
+      throw new HoodieException("Could not load source selector class " + sourceSelectorClass, e);
+    }
+  }
+
+  /**
+   * List messages from queue, filter out illegible events while doing so. It will also delete the
+   * ineligible messages from queue.
+   *
+   * @param processedMessages array of processed messages to add more messages
+   * @return the list of eligible records
+   */
+  protected List<Map<String, Object>> getEligibleEvents(
+      AmazonSQS sqs, List<Message> processedMessages) throws IOException {
+
+    List<Map<String, Object>> eligibleRecords = new ArrayList<>();
+    List<Message> ineligibleMessages = new ArrayList<>();
+
+    ReceiveMessageRequest receiveMessageRequest =
+        new ReceiveMessageRequest()
+            .withQueueUrl(this.queueUrl)
+            .withWaitTimeSeconds(this.longPollWait)
+            .withVisibilityTimeout(this.visibilityTimeout);
+    receiveMessageRequest.setMaxNumberOfMessages(this.maxMessagesEachRequest);
+
+    List<Message> messages =
+        getMessagesToProcess(
+            sqs,
+            this.queueUrl,
+            receiveMessageRequest,
+            this.maxMessageEachBatch,
+            this.maxMessagesEachRequest);
+
+    for (Message message : messages) {
+      boolean isMessageDelete = Boolean.TRUE;
+
+      JSONObject messageBody = new JSONObject(message.getBody());
+      Map<String, Object> messageMap;
+      ObjectMapper mapper = new ObjectMapper();
+
+      if (messageBody.has("Message")) {
+        // If this messages is from S3Event -> SNS -> SQS
+        messageMap =
+            (Map<String, Object>) mapper.readValue(messageBody.getString("Message"), Map.class);
+      } else {
+        // If this messages is from S3Event -> SQS
+        messageMap = (Map<String, Object>) mapper.readValue(messageBody.toString(), Map.class);
+      }
+      if (messageMap.containsKey("Records")) {
+        List<Map<String, Object>> records = (List<Map<String, Object>>) messageMap.get("Records");
+        for (Map<String, Object> record : records) {
+          String eventName = (String) record.get("eventName");
+
+          // filter only allowed s3 event types
+          if (ALLOWED_S3_EVENT_PREFIX.stream().anyMatch(eventName::startsWith)) {
+            eligibleRecords.add(record);
+            isMessageDelete = Boolean.FALSE;
+            processedMessages.add(message);
+
+          } else {
+            log.info("This S3 event " + eventName + " is not allowed, so ignoring it.");
+          }
+        }
+      } else {
+        log.info("Message is not expected format or it's s3:TestEvent");
+      }
+      if (isMessageDelete) {
+        ineligibleMessages.add(message);
+      }
+    }
+    if (!ineligibleMessages.isEmpty()) {
+      deleteBatchOfMessages(sqs, queueUrl, ineligibleMessages);
+    }
+
+    return eligibleRecords;
+  }
+
+  /**
+   * Get the list of events from queue.
+   *
+   * @param sparkContext JavaSparkContext to help parallelize certain operations
+   * @param lastCheckpointStr the last checkpoint time string, empty if first run
+   * @return the list of events
+   */
+  public Pair<List<String>, String> getNextEventsFromQueue(
+      AmazonSQS sqs,
+      JavaSparkContext sparkContext,
+      Option<String> lastCheckpointStr,
+      List<Message> processedMessages) {
+
+    processedMessages.clear();
+
+    log.info("Reading messages....");
+
+    try {
+      log.info("Start Checkpoint : " + lastCheckpointStr);
+
+      long lastCheckpointTime = lastCheckpointStr.map(Long::parseLong).orElse(Long.MIN_VALUE);

Review comment:
       If parsing fails, why set to min_value. negative value does not makes sense. atleast we should set to 0. Or if its epoch, we should set it to the earliest time in epoch (1970/01/01...)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on a change in pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
codope commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688590383



##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsHoodieIncrSource.java
##########
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.DataSourceReadOptions;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.IncrSourceHelper;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * Cloud Objects Hoodie Incr Source Class. {@link CloudObjectsHoodieIncrSource}.This source will use
+ * the cloud files meta information form cloud meta hoodie table generate by CloudObjectsMetaSource.
+ */
+public class CloudObjectsHoodieIncrSource extends HoodieIncrSource {
+
+  private static final Logger LOG = LogManager.getLogger(CloudObjectsHoodieIncrSource.class);
+
+  public CloudObjectsHoodieIncrSource(
+      TypedProperties props,
+      JavaSparkContext sparkContext,
+      SparkSession sparkSession,
+      SchemaProvider schemaProvider) {
+    super(props, sparkContext, sparkSession, schemaProvider);
+  }
+
+  @Override
+  public Pair<Option<Dataset<Row>>, String> fetchNextBatch(
+      Option<String> lastCkptStr, long sourceLimit) {
+
+    DataSourceUtils.checkRequiredProperties(
+        props, Collections.singletonList(Config.HOODIE_SRC_BASE_PATH));
+
+    String srcPath = props.getString(Config.HOODIE_SRC_BASE_PATH);
+    int numInstantsPerFetch =
+        props.getInteger(Config.NUM_INSTANTS_PER_FETCH, Config.DEFAULT_NUM_INSTANTS_PER_FETCH);
+    boolean readLatestOnMissingCkpt =
+        props.getBoolean(
+            Config.READ_LATEST_INSTANT_ON_MISSING_CKPT,
+            Config.DEFAULT_READ_LATEST_INSTANT_ON_MISSING_CKPT);
+
+    // Use begin Instant if set and non-empty
+    Option<String> beginInstant =
+        lastCkptStr.isPresent()
+            ? lastCkptStr.get().isEmpty() ? Option.empty() : lastCkptStr
+            : Option.empty();
+
+    Pair<String, String> instantEndpts =
+        IncrSourceHelper.calculateBeginAndEndInstants(
+            sparkContext, srcPath, numInstantsPerFetch, beginInstant, readLatestOnMissingCkpt);
+
+    if (instantEndpts.getKey().equals(instantEndpts.getValue())) {
+      LOG.warn("Already caught up. Begin Checkpoint was :" + instantEndpts.getKey());
+      return Pair.of(Option.empty(), instantEndpts.getKey());
+    }
+
+    // Do Incr pull. Set end instant if available
+    DataFrameReader reader =
+        sparkSession
+            .read()
+            .format("org.apache.hudi")
+            .option(
+                DataSourceReadOptions.QUERY_TYPE().key(),
+                DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL())
+            .option(
+                DataSourceReadOptions.BEGIN_INSTANTTIME().key(), instantEndpts.getLeft())
+            .option(
+                DataSourceReadOptions.END_INSTANTTIME().key(), instantEndpts.getRight());
+
+    Dataset<Row> source = reader.load(srcPath);
+
+    // Extract distinct file keys from cloud meta hoodie table
+    final List<Row> cloudMetaDf =
+        source
+            .filter("s3.object.size > 0")
+            .select("s3.bucket.name", "s3.object.key")
+            .distinct()
+            .collectAsList();
+
+    // Create S3 paths
+    List<String> cloudFiles = new ArrayList<>();
+    for (Row row : cloudMetaDf) {
+      String bucket = row.getString(0);
+      String key = row.getString(1);
+      String filePath = "s3://" + bucket + "/" + key;
+      cloudFiles.add(filePath);
+    }
+    String pathStr = String.join(",", cloudFiles);

Review comment:
       Good catch. Done away with this. Also introduced a check which will add filePath to cloudFiles only if it exists.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a change in pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r687987187



##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsHoodieIncrSource.java
##########
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.DataSourceReadOptions;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.IncrSourceHelper;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * Cloud Objects Hoodie Incr Source Class. {@link CloudObjectsHoodieIncrSource}.This source will use
+ * the cloud files meta information form cloud meta hoodie table generate by CloudObjectsMetaSource.
+ */
+public class CloudObjectsHoodieIncrSource extends HoodieIncrSource {
+
+  private static final Logger LOG = LogManager.getLogger(CloudObjectsHoodieIncrSource.class);
+
+  public CloudObjectsHoodieIncrSource(
+      TypedProperties props,
+      JavaSparkContext sparkContext,
+      SparkSession sparkSession,
+      SchemaProvider schemaProvider) {
+    super(props, sparkContext, sparkSession, schemaProvider);
+  }
+
+  @Override
+  public Pair<Option<Dataset<Row>>, String> fetchNextBatch(
+      Option<String> lastCkptStr, long sourceLimit) {
+
+    DataSourceUtils.checkRequiredProperties(
+        props, Collections.singletonList(Config.HOODIE_SRC_BASE_PATH));
+
+    String srcPath = props.getString(Config.HOODIE_SRC_BASE_PATH);
+    int numInstantsPerFetch =
+        props.getInteger(Config.NUM_INSTANTS_PER_FETCH, Config.DEFAULT_NUM_INSTANTS_PER_FETCH);
+    boolean readLatestOnMissingCkpt =
+        props.getBoolean(
+            Config.READ_LATEST_INSTANT_ON_MISSING_CKPT,
+            Config.DEFAULT_READ_LATEST_INSTANT_ON_MISSING_CKPT);
+
+    // Use begin Instant if set and non-empty
+    Option<String> beginInstant =
+        lastCkptStr.isPresent()
+            ? lastCkptStr.get().isEmpty() ? Option.empty() : lastCkptStr
+            : Option.empty();
+
+    Pair<String, String> instantEndpts =
+        IncrSourceHelper.calculateBeginAndEndInstants(
+            sparkContext, srcPath, numInstantsPerFetch, beginInstant, readLatestOnMissingCkpt);
+
+    if (instantEndpts.getKey().equals(instantEndpts.getValue())) {
+      LOG.warn("Already caught up. Begin Checkpoint was :" + instantEndpts.getKey());
+      return Pair.of(Option.empty(), instantEndpts.getKey());
+    }
+
+    // Do Incr pull. Set end instant if available
+    DataFrameReader reader =
+        sparkSession
+            .read()
+            .format("org.apache.hudi")
+            .option(
+                DataSourceReadOptions.QUERY_TYPE().key(),
+                DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL())
+            .option(
+                DataSourceReadOptions.BEGIN_INSTANTTIME().key(), instantEndpts.getLeft())
+            .option(
+                DataSourceReadOptions.END_INSTANTTIME().key(), instantEndpts.getRight());
+
+    Dataset<Row> source = reader.load(srcPath);
+
+    // Extract distinct file keys from cloud meta hoodie table
+    final List<Row> cloudMetaDf =
+        source
+            .filter("s3.object.size > 0")

Review comment:
       can we declare constants for these.

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsHoodieIncrSource.java
##########
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.DataSourceReadOptions;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.IncrSourceHelper;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * Cloud Objects Hoodie Incr Source Class. {@link CloudObjectsHoodieIncrSource}.This source will use
+ * the cloud files meta information form cloud meta hoodie table generate by CloudObjectsMetaSource.
+ */
+public class CloudObjectsHoodieIncrSource extends HoodieIncrSource {
+
+  private static final Logger LOG = LogManager.getLogger(CloudObjectsHoodieIncrSource.class);
+
+  public CloudObjectsHoodieIncrSource(
+      TypedProperties props,
+      JavaSparkContext sparkContext,
+      SparkSession sparkSession,
+      SchemaProvider schemaProvider) {
+    super(props, sparkContext, sparkSession, schemaProvider);
+  }
+
+  @Override
+  public Pair<Option<Dataset<Row>>, String> fetchNextBatch(
+      Option<String> lastCkptStr, long sourceLimit) {
+
+    DataSourceUtils.checkRequiredProperties(
+        props, Collections.singletonList(Config.HOODIE_SRC_BASE_PATH));
+
+    String srcPath = props.getString(Config.HOODIE_SRC_BASE_PATH);
+    int numInstantsPerFetch =
+        props.getInteger(Config.NUM_INSTANTS_PER_FETCH, Config.DEFAULT_NUM_INSTANTS_PER_FETCH);
+    boolean readLatestOnMissingCkpt =
+        props.getBoolean(
+            Config.READ_LATEST_INSTANT_ON_MISSING_CKPT,
+            Config.DEFAULT_READ_LATEST_INSTANT_ON_MISSING_CKPT);
+
+    // Use begin Instant if set and non-empty
+    Option<String> beginInstant =
+        lastCkptStr.isPresent()
+            ? lastCkptStr.get().isEmpty() ? Option.empty() : lastCkptStr
+            : Option.empty();
+
+    Pair<String, String> instantEndpts =
+        IncrSourceHelper.calculateBeginAndEndInstants(
+            sparkContext, srcPath, numInstantsPerFetch, beginInstant, readLatestOnMissingCkpt);
+
+    if (instantEndpts.getKey().equals(instantEndpts.getValue())) {
+      LOG.warn("Already caught up. Begin Checkpoint was :" + instantEndpts.getKey());
+      return Pair.of(Option.empty(), instantEndpts.getKey());
+    }
+
+    // Do Incr pull. Set end instant if available
+    DataFrameReader reader =
+        sparkSession
+            .read()
+            .format("org.apache.hudi")
+            .option(
+                DataSourceReadOptions.QUERY_TYPE().key(),
+                DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL())
+            .option(
+                DataSourceReadOptions.BEGIN_INSTANTTIME().key(), instantEndpts.getLeft())
+            .option(
+                DataSourceReadOptions.END_INSTANTTIME().key(), instantEndpts.getRight());
+
+    Dataset<Row> source = reader.load(srcPath);
+
+    // Extract distinct file keys from cloud meta hoodie table
+    final List<Row> cloudMetaDf =
+        source
+            .filter("s3.object.size > 0")
+            .select("s3.bucket.name", "s3.object.key")
+            .distinct()
+            .collectAsList();
+
+    // Create S3 paths
+    List<String> cloudFiles = new ArrayList<>();
+    for (Row row : cloudMetaDf) {
+      String bucket = row.getString(0);
+      String key = row.getString(1);
+      String filePath = "s3://" + bucket + "/" + key;
+      cloudFiles.add(filePath);
+    }
+    String pathStr = String.join(",", cloudFiles);
+
+    return Pair.of(Option.of(fromFiles(pathStr)), instantEndpts.getRight());
+  }
+
+  /**
+   * Function to create Dataset from parquet files.
+   */
+  private Dataset<Row> fromFiles(String pathStr) {
+    return sparkSession.read().parquet(pathStr.split(","));

Review comment:
       not sure if we really need a method for one line and its called only by one caller.

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsHoodieIncrSource.java
##########
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.DataSourceReadOptions;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.IncrSourceHelper;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * Cloud Objects Hoodie Incr Source Class. {@link CloudObjectsHoodieIncrSource}.This source will use
+ * the cloud files meta information form cloud meta hoodie table generate by CloudObjectsMetaSource.
+ */
+public class CloudObjectsHoodieIncrSource extends HoodieIncrSource {
+
+  private static final Logger LOG = LogManager.getLogger(CloudObjectsHoodieIncrSource.class);
+
+  public CloudObjectsHoodieIncrSource(
+      TypedProperties props,
+      JavaSparkContext sparkContext,
+      SparkSession sparkSession,
+      SchemaProvider schemaProvider) {
+    super(props, sparkContext, sparkSession, schemaProvider);
+  }
+
+  @Override
+  public Pair<Option<Dataset<Row>>, String> fetchNextBatch(
+      Option<String> lastCkptStr, long sourceLimit) {
+
+    DataSourceUtils.checkRequiredProperties(
+        props, Collections.singletonList(Config.HOODIE_SRC_BASE_PATH));
+
+    String srcPath = props.getString(Config.HOODIE_SRC_BASE_PATH);
+    int numInstantsPerFetch =
+        props.getInteger(Config.NUM_INSTANTS_PER_FETCH, Config.DEFAULT_NUM_INSTANTS_PER_FETCH);
+    boolean readLatestOnMissingCkpt =
+        props.getBoolean(
+            Config.READ_LATEST_INSTANT_ON_MISSING_CKPT,
+            Config.DEFAULT_READ_LATEST_INSTANT_ON_MISSING_CKPT);
+
+    // Use begin Instant if set and non-empty
+    Option<String> beginInstant =
+        lastCkptStr.isPresent()
+            ? lastCkptStr.get().isEmpty() ? Option.empty() : lastCkptStr
+            : Option.empty();
+
+    Pair<String, String> instantEndpts =
+        IncrSourceHelper.calculateBeginAndEndInstants(
+            sparkContext, srcPath, numInstantsPerFetch, beginInstant, readLatestOnMissingCkpt);
+
+    if (instantEndpts.getKey().equals(instantEndpts.getValue())) {
+      LOG.warn("Already caught up. Begin Checkpoint was :" + instantEndpts.getKey());
+      return Pair.of(Option.empty(), instantEndpts.getKey());
+    }
+
+    // Do Incr pull. Set end instant if available
+    DataFrameReader reader =
+        sparkSession
+            .read()
+            .format("org.apache.hudi")
+            .option(
+                DataSourceReadOptions.QUERY_TYPE().key(),
+                DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL())
+            .option(
+                DataSourceReadOptions.BEGIN_INSTANTTIME().key(), instantEndpts.getLeft())
+            .option(
+                DataSourceReadOptions.END_INSTANTTIME().key(), instantEndpts.getRight());
+
+    Dataset<Row> source = reader.load(srcPath);
+
+    // Extract distinct file keys from cloud meta hoodie table
+    final List<Row> cloudMetaDf =
+        source
+            .filter("s3.object.size > 0")
+            .select("s3.bucket.name", "s3.object.key")

Review comment:
       Can you help me understand, how exactly deletes in S3 are handled by these 2 sources? 

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsHoodieIncrSource.java
##########
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.DataSourceReadOptions;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.IncrSourceHelper;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * Cloud Objects Hoodie Incr Source Class. {@link CloudObjectsHoodieIncrSource}.This source will use
+ * the cloud files meta information form cloud meta hoodie table generate by CloudObjectsMetaSource.
+ */
+public class CloudObjectsHoodieIncrSource extends HoodieIncrSource {
+
+  private static final Logger LOG = LogManager.getLogger(CloudObjectsHoodieIncrSource.class);
+
+  public CloudObjectsHoodieIncrSource(
+      TypedProperties props,
+      JavaSparkContext sparkContext,
+      SparkSession sparkSession,
+      SchemaProvider schemaProvider) {
+    super(props, sparkContext, sparkSession, schemaProvider);
+  }
+
+  @Override
+  public Pair<Option<Dataset<Row>>, String> fetchNextBatch(
+      Option<String> lastCkptStr, long sourceLimit) {
+
+    DataSourceUtils.checkRequiredProperties(
+        props, Collections.singletonList(Config.HOODIE_SRC_BASE_PATH));
+
+    String srcPath = props.getString(Config.HOODIE_SRC_BASE_PATH);
+    int numInstantsPerFetch =
+        props.getInteger(Config.NUM_INSTANTS_PER_FETCH, Config.DEFAULT_NUM_INSTANTS_PER_FETCH);
+    boolean readLatestOnMissingCkpt =
+        props.getBoolean(
+            Config.READ_LATEST_INSTANT_ON_MISSING_CKPT,
+            Config.DEFAULT_READ_LATEST_INSTANT_ON_MISSING_CKPT);
+
+    // Use begin Instant if set and non-empty
+    Option<String> beginInstant =
+        lastCkptStr.isPresent()
+            ? lastCkptStr.get().isEmpty() ? Option.empty() : lastCkptStr
+            : Option.empty();
+
+    Pair<String, String> instantEndpts =
+        IncrSourceHelper.calculateBeginAndEndInstants(
+            sparkContext, srcPath, numInstantsPerFetch, beginInstant, readLatestOnMissingCkpt);
+
+    if (instantEndpts.getKey().equals(instantEndpts.getValue())) {
+      LOG.warn("Already caught up. Begin Checkpoint was :" + instantEndpts.getKey());
+      return Pair.of(Option.empty(), instantEndpts.getKey());
+    }
+
+    // Do Incr pull. Set end instant if available
+    DataFrameReader reader =

Review comment:
       minor: hoodieCloudMetaReader

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelector.java
##########
@@ -0,0 +1,285 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+
+import com.amazonaws.regions.Regions;
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.AmazonSQSClientBuilder;
+import com.amazonaws.services.sqs.model.BatchResultErrorEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequest;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequestEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchResult;
+import com.amazonaws.services.sqs.model.GetQueueAttributesRequest;
+import com.amazonaws.services.sqs.model.GetQueueAttributesResult;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.json.JSONObject;
+
+import java.io.UnsupportedEncodingException;
+import java.net.URLDecoder;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ * Cloud Objects Selector Class. This class has methods for processing cloud objects. It currently
+ * supports only AWS S3 objects and AWS SQS queue.
+ */
+public class CloudObjectsSelector {
+  public static final List<String> ALLOWED_S3_EVENT_PREFIX =
+      Collections.singletonList("ObjectCreated");
+  public static volatile Logger log = LogManager.getLogger(CloudObjectsSelector.class);
+  public final String queueUrl;
+  public final int longPollWait;
+  public final int maxMessagesEachRequest;
+  public final int maxMessageEachBatch;
+  public final int visibilityTimeout;
+  public final TypedProperties props;
+  public final String fsName;
+  private final String regionName;
+
+  /**
+   * Cloud Objects Selector Class. {@link CloudObjectsSelector}
+   */
+  public CloudObjectsSelector(TypedProperties props) {
+    DataSourceUtils.checkRequiredProperties(props, Arrays.asList(Config.QUEUE_URL_PROP, Config.QUEUE_REGION));
+    this.props = props;
+    this.queueUrl = props.getString(Config.QUEUE_URL_PROP);
+    this.regionName = props.getString(Config.QUEUE_REGION);
+    this.fsName = props.getString(Config.SOURCE_QUEUE_FS_PROP, "s3").toLowerCase();
+    this.longPollWait = props.getInteger(Config.QUEUE_LONGPOLLWAIT_PROP, 20);
+    this.maxMessageEachBatch = props.getInteger(Config.QUEUE_MAXMESSAGESEACHBATCH_PROP, 5);
+    this.visibilityTimeout = props.getInteger(Config.QUEUE_VISIBILITYTIMEOUT_PROP, 30);
+    this.maxMessagesEachRequest = 10;
+  }
+
+  /**
+   * Get SQS queue attributes.
+   *
+   * @param sqsClient AWSClient for sqsClient
+   * @param queueUrl  queue full url
+   * @return map of attributes needed
+   */
+  protected Map<String, String> getSqsQueueAttributes(AmazonSQS sqsClient, String queueUrl) {
+    GetQueueAttributesResult queueAttributesResult =
+        sqsClient.getQueueAttributes(
+            new GetQueueAttributesRequest(queueUrl)
+                .withAttributeNames("ApproximateNumberOfMessages"));
+    return queueAttributesResult.getAttributes();
+  }
+
+  /**
+   * Get the file attributes filePath, eventTime and size from JSONObject record.
+   *
+   * @param record of object event
+   * @return map of file attribute
+   */
+  protected Map<String, Object> getFileAttributesFromRecord(JSONObject record)
+      throws UnsupportedEncodingException {
+
+    Map<String, Object> fileRecord = new HashMap<>();
+    String eventTimeStr = record.getString("eventTime");
+    long eventTime =
+        Date.from(Instant.from(DateTimeFormatter.ISO_INSTANT.parse(eventTimeStr))).getTime();
+
+    JSONObject s3Object = record.getJSONObject("s3").getJSONObject("object");
+    String bucket =
+        URLDecoder.decode(
+            record.getJSONObject("s3").getJSONObject("bucket").getString("name"), "UTF-8");
+    String key = URLDecoder.decode(s3Object.getString("key"), "UTF-8");
+    String filePath = this.fsName + "://" + bucket + "/" + key;
+
+    fileRecord.put("eventTime", eventTime);
+    fileRecord.put("fileSize", s3Object.getLong("size"));
+    fileRecord.put("filePath", filePath);
+    return fileRecord;
+  }
+
+  /**
+   * Amazon SQS Client Builder.
+   */
+  public AmazonSQS createAmazonSqsClient() {
+    return AmazonSQSClientBuilder.standard().withRegion(Regions.fromName(regionName)).build();
+  }
+
+  /**
+   * List messages from queue.
+   */
+  protected List<Message> getMessagesToProcess(
+      AmazonSQS sqsClient,
+      String queueUrl,
+      ReceiveMessageRequest receiveMessageRequest,
+      int maxMessageEachBatch,
+      int maxMessagesEachRequest) {
+    List<Message> messagesToProcess = new ArrayList<>();
+
+    // Get count for available messages
+    Map<String, String> queueAttributesResult = getSqsQueueAttributes(sqsClient, queueUrl);
+    long approxMessagesAvailable =
+        Long.parseLong(queueAttributesResult.get("ApproximateNumberOfMessages"));
+    log.info("Approx. " + approxMessagesAvailable + " messages available in queue.");
+
+    for (int i = 0;
+         i < (int) Math.ceil((double) approxMessagesAvailable / maxMessagesEachRequest);
+         ++i) {
+      List<Message> messages = sqsClient.receiveMessage(receiveMessageRequest).getMessages();
+      log.debug("Messages size: " + messages.size());
+
+      for (Message message : messages) {
+        log.debug("message id: " + message.getMessageId());
+        messagesToProcess.add(message);
+      }
+      log.debug("total fetched messages size: " + messagesToProcess.size());
+      if (messages.isEmpty() || (messagesToProcess.size() >= maxMessageEachBatch)) {
+        break;
+      }
+    }
+    return messagesToProcess;
+  }
+
+  /**
+   * create partitions of list using specific batch size. we can't use third party API for this
+   * functionality, due to https://github.com/apache/hudi/blob/master/style/checkstyle.xml#L270
+   */
+  protected List<List<Message>> createListPartitions(List<Message> singleList, int eachBatchSize) {
+    List<List<Message>> listPartitions = new ArrayList<>();
+
+    if (singleList.size() == 0 || eachBatchSize < 1) {
+      return listPartitions;
+    }
+
+    for (int start = 0; start < singleList.size(); start += eachBatchSize) {
+      int end = Math.min(start + eachBatchSize, singleList.size());
+
+      if (start > end) {
+        throw new IndexOutOfBoundsException(
+            "Index " + start + " is out of the list range <0," + (singleList.size() - 1) + ">");
+      }
+      listPartitions.add(new ArrayList<>(singleList.subList(start, end)));
+    }
+    return listPartitions;
+  }
+
+  /**
+   * delete batch of messages from queue.
+   */
+  protected void deleteBatchOfMessages(
+      AmazonSQS sqs, String queueUrl, List<Message> messagesToBeDeleted) {
+    DeleteMessageBatchRequest deleteBatchReq =
+        new DeleteMessageBatchRequest().withQueueUrl(queueUrl);
+    List<DeleteMessageBatchRequestEntry> deleteEntries = deleteBatchReq.getEntries();
+
+    for (Message message : messagesToBeDeleted) {
+      deleteEntries.add(
+          new DeleteMessageBatchRequestEntry()
+              .withId(message.getMessageId())
+              .withReceiptHandle(message.getReceiptHandle()));
+    }
+    DeleteMessageBatchResult deleteResult = sqs.deleteMessageBatch(deleteBatchReq);
+    List<String> deleteFailures =
+        deleteResult.getFailed().stream()
+            .map(BatchResultErrorEntry::getId)
+            .collect(Collectors.toList());
+    System.out.println("Delete is" + deleteFailures.isEmpty() + "or ignoring it.");
+    if (!deleteFailures.isEmpty()) {
+      log.warn(

Review comment:
       If deletes aren't succeeding, we might keep going in a loop right? 

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelector.java
##########
@@ -0,0 +1,285 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+
+import com.amazonaws.regions.Regions;
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.AmazonSQSClientBuilder;
+import com.amazonaws.services.sqs.model.BatchResultErrorEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequest;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequestEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchResult;
+import com.amazonaws.services.sqs.model.GetQueueAttributesRequest;
+import com.amazonaws.services.sqs.model.GetQueueAttributesResult;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.json.JSONObject;
+
+import java.io.UnsupportedEncodingException;
+import java.net.URLDecoder;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ * Cloud Objects Selector Class. This class has methods for processing cloud objects. It currently
+ * supports only AWS S3 objects and AWS SQS queue.
+ */
+public class CloudObjectsSelector {
+  public static final List<String> ALLOWED_S3_EVENT_PREFIX =
+      Collections.singletonList("ObjectCreated");
+  public static volatile Logger log = LogManager.getLogger(CloudObjectsSelector.class);
+  public final String queueUrl;
+  public final int longPollWait;
+  public final int maxMessagesEachRequest;
+  public final int maxMessageEachBatch;
+  public final int visibilityTimeout;
+  public final TypedProperties props;
+  public final String fsName;
+  private final String regionName;
+
+  /**
+   * Cloud Objects Selector Class. {@link CloudObjectsSelector}
+   */
+  public CloudObjectsSelector(TypedProperties props) {
+    DataSourceUtils.checkRequiredProperties(props, Arrays.asList(Config.QUEUE_URL_PROP, Config.QUEUE_REGION));
+    this.props = props;
+    this.queueUrl = props.getString(Config.QUEUE_URL_PROP);
+    this.regionName = props.getString(Config.QUEUE_REGION);
+    this.fsName = props.getString(Config.SOURCE_QUEUE_FS_PROP, "s3").toLowerCase();
+    this.longPollWait = props.getInteger(Config.QUEUE_LONGPOLLWAIT_PROP, 20);
+    this.maxMessageEachBatch = props.getInteger(Config.QUEUE_MAXMESSAGESEACHBATCH_PROP, 5);
+    this.visibilityTimeout = props.getInteger(Config.QUEUE_VISIBILITYTIMEOUT_PROP, 30);
+    this.maxMessagesEachRequest = 10;
+  }
+
+  /**
+   * Get SQS queue attributes.
+   *
+   * @param sqsClient AWSClient for sqsClient
+   * @param queueUrl  queue full url
+   * @return map of attributes needed
+   */
+  protected Map<String, String> getSqsQueueAttributes(AmazonSQS sqsClient, String queueUrl) {
+    GetQueueAttributesResult queueAttributesResult =
+        sqsClient.getQueueAttributes(
+            new GetQueueAttributesRequest(queueUrl)
+                .withAttributeNames("ApproximateNumberOfMessages"));
+    return queueAttributesResult.getAttributes();
+  }
+
+  /**
+   * Get the file attributes filePath, eventTime and size from JSONObject record.
+   *
+   * @param record of object event
+   * @return map of file attribute
+   */
+  protected Map<String, Object> getFileAttributesFromRecord(JSONObject record)
+      throws UnsupportedEncodingException {
+
+    Map<String, Object> fileRecord = new HashMap<>();
+    String eventTimeStr = record.getString("eventTime");
+    long eventTime =
+        Date.from(Instant.from(DateTimeFormatter.ISO_INSTANT.parse(eventTimeStr))).getTime();
+
+    JSONObject s3Object = record.getJSONObject("s3").getJSONObject("object");
+    String bucket =
+        URLDecoder.decode(
+            record.getJSONObject("s3").getJSONObject("bucket").getString("name"), "UTF-8");
+    String key = URLDecoder.decode(s3Object.getString("key"), "UTF-8");
+    String filePath = this.fsName + "://" + bucket + "/" + key;
+
+    fileRecord.put("eventTime", eventTime);
+    fileRecord.put("fileSize", s3Object.getLong("size"));
+    fileRecord.put("filePath", filePath);
+    return fileRecord;
+  }
+
+  /**
+   * Amazon SQS Client Builder.
+   */
+  public AmazonSQS createAmazonSqsClient() {
+    return AmazonSQSClientBuilder.standard().withRegion(Regions.fromName(regionName)).build();
+  }
+
+  /**
+   * List messages from queue.
+   */
+  protected List<Message> getMessagesToProcess(
+      AmazonSQS sqsClient,
+      String queueUrl,
+      ReceiveMessageRequest receiveMessageRequest,
+      int maxMessageEachBatch,
+      int maxMessagesEachRequest) {
+    List<Message> messagesToProcess = new ArrayList<>();
+
+    // Get count for available messages
+    Map<String, String> queueAttributesResult = getSqsQueueAttributes(sqsClient, queueUrl);
+    long approxMessagesAvailable =
+        Long.parseLong(queueAttributesResult.get("ApproximateNumberOfMessages"));
+    log.info("Approx. " + approxMessagesAvailable + " messages available in queue.");
+
+    for (int i = 0;
+         i < (int) Math.ceil((double) approxMessagesAvailable / maxMessagesEachRequest);
+         ++i) {
+      List<Message> messages = sqsClient.receiveMessage(receiveMessageRequest).getMessages();
+      log.debug("Messages size: " + messages.size());
+
+      for (Message message : messages) {
+        log.debug("message id: " + message.getMessageId());
+        messagesToProcess.add(message);
+      }
+      log.debug("total fetched messages size: " + messagesToProcess.size());
+      if (messages.isEmpty() || (messagesToProcess.size() >= maxMessageEachBatch)) {
+        break;
+      }
+    }
+    return messagesToProcess;
+  }
+
+  /**
+   * create partitions of list using specific batch size. we can't use third party API for this
+   * functionality, due to https://github.com/apache/hudi/blob/master/style/checkstyle.xml#L270
+   */
+  protected List<List<Message>> createListPartitions(List<Message> singleList, int eachBatchSize) {
+    List<List<Message>> listPartitions = new ArrayList<>();
+
+    if (singleList.size() == 0 || eachBatchSize < 1) {
+      return listPartitions;
+    }
+
+    for (int start = 0; start < singleList.size(); start += eachBatchSize) {
+      int end = Math.min(start + eachBatchSize, singleList.size());
+
+      if (start > end) {
+        throw new IndexOutOfBoundsException(
+            "Index " + start + " is out of the list range <0," + (singleList.size() - 1) + ">");
+      }
+      listPartitions.add(new ArrayList<>(singleList.subList(start, end)));
+    }
+    return listPartitions;
+  }
+
+  /**
+   * delete batch of messages from queue.
+   */
+  protected void deleteBatchOfMessages(
+      AmazonSQS sqs, String queueUrl, List<Message> messagesToBeDeleted) {
+    DeleteMessageBatchRequest deleteBatchReq =
+        new DeleteMessageBatchRequest().withQueueUrl(queueUrl);
+    List<DeleteMessageBatchRequestEntry> deleteEntries = deleteBatchReq.getEntries();
+
+    for (Message message : messagesToBeDeleted) {
+      deleteEntries.add(
+          new DeleteMessageBatchRequestEntry()
+              .withId(message.getMessageId())
+              .withReceiptHandle(message.getReceiptHandle()));
+    }
+    DeleteMessageBatchResult deleteResult = sqs.deleteMessageBatch(deleteBatchReq);
+    List<String> deleteFailures =
+        deleteResult.getFailed().stream()
+            .map(BatchResultErrorEntry::getId)
+            .collect(Collectors.toList());
+    System.out.println("Delete is" + deleteFailures.isEmpty() + "or ignoring it.");

Review comment:
       remove SOPs

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelector.java
##########
@@ -0,0 +1,285 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+
+import com.amazonaws.regions.Regions;
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.AmazonSQSClientBuilder;
+import com.amazonaws.services.sqs.model.BatchResultErrorEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequest;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequestEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchResult;
+import com.amazonaws.services.sqs.model.GetQueueAttributesRequest;
+import com.amazonaws.services.sqs.model.GetQueueAttributesResult;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.json.JSONObject;
+
+import java.io.UnsupportedEncodingException;
+import java.net.URLDecoder;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ * Cloud Objects Selector Class. This class has methods for processing cloud objects. It currently
+ * supports only AWS S3 objects and AWS SQS queue.
+ */
+public class CloudObjectsSelector {
+  public static final List<String> ALLOWED_S3_EVENT_PREFIX =
+      Collections.singletonList("ObjectCreated");
+  public static volatile Logger log = LogManager.getLogger(CloudObjectsSelector.class);
+  public final String queueUrl;
+  public final int longPollWait;
+  public final int maxMessagesEachRequest;
+  public final int maxMessageEachBatch;
+  public final int visibilityTimeout;
+  public final TypedProperties props;
+  public final String fsName;
+  private final String regionName;
+
+  /**
+   * Cloud Objects Selector Class. {@link CloudObjectsSelector}
+   */
+  public CloudObjectsSelector(TypedProperties props) {
+    DataSourceUtils.checkRequiredProperties(props, Arrays.asList(Config.QUEUE_URL_PROP, Config.QUEUE_REGION));
+    this.props = props;
+    this.queueUrl = props.getString(Config.QUEUE_URL_PROP);
+    this.regionName = props.getString(Config.QUEUE_REGION);
+    this.fsName = props.getString(Config.SOURCE_QUEUE_FS_PROP, "s3").toLowerCase();
+    this.longPollWait = props.getInteger(Config.QUEUE_LONGPOLLWAIT_PROP, 20);
+    this.maxMessageEachBatch = props.getInteger(Config.QUEUE_MAXMESSAGESEACHBATCH_PROP, 5);
+    this.visibilityTimeout = props.getInteger(Config.QUEUE_VISIBILITYTIMEOUT_PROP, 30);
+    this.maxMessagesEachRequest = 10;
+  }
+
+  /**
+   * Get SQS queue attributes.
+   *
+   * @param sqsClient AWSClient for sqsClient
+   * @param queueUrl  queue full url
+   * @return map of attributes needed
+   */
+  protected Map<String, String> getSqsQueueAttributes(AmazonSQS sqsClient, String queueUrl) {
+    GetQueueAttributesResult queueAttributesResult =
+        sqsClient.getQueueAttributes(
+            new GetQueueAttributesRequest(queueUrl)
+                .withAttributeNames("ApproximateNumberOfMessages"));
+    return queueAttributesResult.getAttributes();
+  }
+
+  /**
+   * Get the file attributes filePath, eventTime and size from JSONObject record.
+   *
+   * @param record of object event
+   * @return map of file attribute
+   */
+  protected Map<String, Object> getFileAttributesFromRecord(JSONObject record)
+      throws UnsupportedEncodingException {
+
+    Map<String, Object> fileRecord = new HashMap<>();
+    String eventTimeStr = record.getString("eventTime");
+    long eventTime =
+        Date.from(Instant.from(DateTimeFormatter.ISO_INSTANT.parse(eventTimeStr))).getTime();
+
+    JSONObject s3Object = record.getJSONObject("s3").getJSONObject("object");
+    String bucket =
+        URLDecoder.decode(
+            record.getJSONObject("s3").getJSONObject("bucket").getString("name"), "UTF-8");
+    String key = URLDecoder.decode(s3Object.getString("key"), "UTF-8");
+    String filePath = this.fsName + "://" + bucket + "/" + key;
+
+    fileRecord.put("eventTime", eventTime);
+    fileRecord.put("fileSize", s3Object.getLong("size"));
+    fileRecord.put("filePath", filePath);
+    return fileRecord;
+  }
+
+  /**
+   * Amazon SQS Client Builder.
+   */
+  public AmazonSQS createAmazonSqsClient() {
+    return AmazonSQSClientBuilder.standard().withRegion(Regions.fromName(regionName)).build();
+  }
+
+  /**
+   * List messages from queue.
+   */
+  protected List<Message> getMessagesToProcess(
+      AmazonSQS sqsClient,
+      String queueUrl,
+      ReceiveMessageRequest receiveMessageRequest,
+      int maxMessageEachBatch,
+      int maxMessagesEachRequest) {
+    List<Message> messagesToProcess = new ArrayList<>();
+
+    // Get count for available messages
+    Map<String, String> queueAttributesResult = getSqsQueueAttributes(sqsClient, queueUrl);
+    long approxMessagesAvailable =
+        Long.parseLong(queueAttributesResult.get("ApproximateNumberOfMessages"));
+    log.info("Approx. " + approxMessagesAvailable + " messages available in queue.");
+
+    for (int i = 0;
+         i < (int) Math.ceil((double) approxMessagesAvailable / maxMessagesEachRequest);
+         ++i) {
+      List<Message> messages = sqsClient.receiveMessage(receiveMessageRequest).getMessages();
+      log.debug("Messages size: " + messages.size());
+
+      for (Message message : messages) {
+        log.debug("message id: " + message.getMessageId());
+        messagesToProcess.add(message);
+      }
+      log.debug("total fetched messages size: " + messagesToProcess.size());
+      if (messages.isEmpty() || (messagesToProcess.size() >= maxMessageEachBatch)) {
+        break;
+      }
+    }
+    return messagesToProcess;
+  }
+
+  /**
+   * create partitions of list using specific batch size. we can't use third party API for this
+   * functionality, due to https://github.com/apache/hudi/blob/master/style/checkstyle.xml#L270
+   */
+  protected List<List<Message>> createListPartitions(List<Message> singleList, int eachBatchSize) {
+    List<List<Message>> listPartitions = new ArrayList<>();
+
+    if (singleList.size() == 0 || eachBatchSize < 1) {
+      return listPartitions;
+    }
+
+    for (int start = 0; start < singleList.size(); start += eachBatchSize) {
+      int end = Math.min(start + eachBatchSize, singleList.size());
+

Review comment:
       too many line breaks. Can we please fix that.

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelector.java
##########
@@ -0,0 +1,285 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+
+import com.amazonaws.regions.Regions;
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.AmazonSQSClientBuilder;
+import com.amazonaws.services.sqs.model.BatchResultErrorEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequest;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequestEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchResult;
+import com.amazonaws.services.sqs.model.GetQueueAttributesRequest;
+import com.amazonaws.services.sqs.model.GetQueueAttributesResult;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.json.JSONObject;
+
+import java.io.UnsupportedEncodingException;
+import java.net.URLDecoder;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ * Cloud Objects Selector Class. This class has methods for processing cloud objects. It currently
+ * supports only AWS S3 objects and AWS SQS queue.
+ */
+public class CloudObjectsSelector {
+  public static final List<String> ALLOWED_S3_EVENT_PREFIX =
+      Collections.singletonList("ObjectCreated");
+  public static volatile Logger log = LogManager.getLogger(CloudObjectsSelector.class);
+  public final String queueUrl;
+  public final int longPollWait;
+  public final int maxMessagesEachRequest;
+  public final int maxMessageEachBatch;
+  public final int visibilityTimeout;
+  public final TypedProperties props;
+  public final String fsName;
+  private final String regionName;
+
+  /**
+   * Cloud Objects Selector Class. {@link CloudObjectsSelector}
+   */
+  public CloudObjectsSelector(TypedProperties props) {
+    DataSourceUtils.checkRequiredProperties(props, Arrays.asList(Config.QUEUE_URL_PROP, Config.QUEUE_REGION));
+    this.props = props;
+    this.queueUrl = props.getString(Config.QUEUE_URL_PROP);
+    this.regionName = props.getString(Config.QUEUE_REGION);
+    this.fsName = props.getString(Config.SOURCE_QUEUE_FS_PROP, "s3").toLowerCase();
+    this.longPollWait = props.getInteger(Config.QUEUE_LONGPOLLWAIT_PROP, 20);
+    this.maxMessageEachBatch = props.getInteger(Config.QUEUE_MAXMESSAGESEACHBATCH_PROP, 5);
+    this.visibilityTimeout = props.getInteger(Config.QUEUE_VISIBILITYTIMEOUT_PROP, 30);
+    this.maxMessagesEachRequest = 10;
+  }
+
+  /**
+   * Get SQS queue attributes.
+   *
+   * @param sqsClient AWSClient for sqsClient
+   * @param queueUrl  queue full url
+   * @return map of attributes needed
+   */
+  protected Map<String, String> getSqsQueueAttributes(AmazonSQS sqsClient, String queueUrl) {
+    GetQueueAttributesResult queueAttributesResult =
+        sqsClient.getQueueAttributes(
+            new GetQueueAttributesRequest(queueUrl)
+                .withAttributeNames("ApproximateNumberOfMessages"));
+    return queueAttributesResult.getAttributes();
+  }
+
+  /**
+   * Get the file attributes filePath, eventTime and size from JSONObject record.
+   *
+   * @param record of object event
+   * @return map of file attribute
+   */
+  protected Map<String, Object> getFileAttributesFromRecord(JSONObject record)
+      throws UnsupportedEncodingException {
+
+    Map<String, Object> fileRecord = new HashMap<>();
+    String eventTimeStr = record.getString("eventTime");
+    long eventTime =
+        Date.from(Instant.from(DateTimeFormatter.ISO_INSTANT.parse(eventTimeStr))).getTime();
+
+    JSONObject s3Object = record.getJSONObject("s3").getJSONObject("object");
+    String bucket =
+        URLDecoder.decode(
+            record.getJSONObject("s3").getJSONObject("bucket").getString("name"), "UTF-8");
+    String key = URLDecoder.decode(s3Object.getString("key"), "UTF-8");
+    String filePath = this.fsName + "://" + bucket + "/" + key;
+
+    fileRecord.put("eventTime", eventTime);
+    fileRecord.put("fileSize", s3Object.getLong("size"));
+    fileRecord.put("filePath", filePath);
+    return fileRecord;
+  }
+
+  /**
+   * Amazon SQS Client Builder.
+   */
+  public AmazonSQS createAmazonSqsClient() {
+    return AmazonSQSClientBuilder.standard().withRegion(Regions.fromName(regionName)).build();
+  }
+
+  /**
+   * List messages from queue.
+   */
+  protected List<Message> getMessagesToProcess(
+      AmazonSQS sqsClient,
+      String queueUrl,
+      ReceiveMessageRequest receiveMessageRequest,
+      int maxMessageEachBatch,

Review comment:
       maxMessagePerBatch

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsHoodieIncrSource.java
##########
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.DataSourceReadOptions;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.IncrSourceHelper;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * Cloud Objects Hoodie Incr Source Class. {@link CloudObjectsHoodieIncrSource}.This source will use
+ * the cloud files meta information form cloud meta hoodie table generate by CloudObjectsMetaSource.
+ */
+public class CloudObjectsHoodieIncrSource extends HoodieIncrSource {
+
+  private static final Logger LOG = LogManager.getLogger(CloudObjectsHoodieIncrSource.class);
+
+  public CloudObjectsHoodieIncrSource(
+      TypedProperties props,
+      JavaSparkContext sparkContext,
+      SparkSession sparkSession,
+      SchemaProvider schemaProvider) {
+    super(props, sparkContext, sparkSession, schemaProvider);
+  }
+
+  @Override
+  public Pair<Option<Dataset<Row>>, String> fetchNextBatch(
+      Option<String> lastCkptStr, long sourceLimit) {
+
+    DataSourceUtils.checkRequiredProperties(
+        props, Collections.singletonList(Config.HOODIE_SRC_BASE_PATH));
+
+    String srcPath = props.getString(Config.HOODIE_SRC_BASE_PATH);
+    int numInstantsPerFetch =
+        props.getInteger(Config.NUM_INSTANTS_PER_FETCH, Config.DEFAULT_NUM_INSTANTS_PER_FETCH);
+    boolean readLatestOnMissingCkpt =
+        props.getBoolean(
+            Config.READ_LATEST_INSTANT_ON_MISSING_CKPT,
+            Config.DEFAULT_READ_LATEST_INSTANT_ON_MISSING_CKPT);
+
+    // Use begin Instant if set and non-empty
+    Option<String> beginInstant =
+        lastCkptStr.isPresent()
+            ? lastCkptStr.get().isEmpty() ? Option.empty() : lastCkptStr
+            : Option.empty();
+
+    Pair<String, String> instantEndpts =
+        IncrSourceHelper.calculateBeginAndEndInstants(
+            sparkContext, srcPath, numInstantsPerFetch, beginInstant, readLatestOnMissingCkpt);
+
+    if (instantEndpts.getKey().equals(instantEndpts.getValue())) {
+      LOG.warn("Already caught up. Begin Checkpoint was :" + instantEndpts.getKey());
+      return Pair.of(Option.empty(), instantEndpts.getKey());
+    }
+
+    // Do Incr pull. Set end instant if available
+    DataFrameReader reader =
+        sparkSession
+            .read()
+            .format("org.apache.hudi")
+            .option(
+                DataSourceReadOptions.QUERY_TYPE().key(),
+                DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL())
+            .option(
+                DataSourceReadOptions.BEGIN_INSTANTTIME().key(), instantEndpts.getLeft())
+            .option(
+                DataSourceReadOptions.END_INSTANTTIME().key(), instantEndpts.getRight());
+
+    Dataset<Row> source = reader.load(srcPath);
+
+    // Extract distinct file keys from cloud meta hoodie table
+    final List<Row> cloudMetaDf =
+        source
+            .filter("s3.object.size > 0")
+            .select("s3.bucket.name", "s3.object.key")
+            .distinct()
+            .collectAsList();
+
+    // Create S3 paths
+    List<String> cloudFiles = new ArrayList<>();
+    for (Row row : cloudMetaDf) {
+      String bucket = row.getString(0);
+      String key = row.getString(1);
+      String filePath = "s3://" + bucket + "/" + key;
+      cloudFiles.add(filePath);
+    }
+    String pathStr = String.join(",", cloudFiles);
+
+    return Pair.of(Option.of(fromFiles(pathStr)), instantEndpts.getRight());
+  }
+
+  /**
+   * Function to create Dataset from parquet files.
+   */
+  private Dataset<Row> fromFiles(String pathStr) {
+    return sparkSession.read().parquet(pathStr.split(","));

Review comment:
       if we keep it as read().format(), we can support any file formats right? add in another config to this source and make "parquet" as default. 

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsHoodieIncrSource.java
##########
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.DataSourceReadOptions;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.IncrSourceHelper;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * Cloud Objects Hoodie Incr Source Class. {@link CloudObjectsHoodieIncrSource}.This source will use
+ * the cloud files meta information form cloud meta hoodie table generate by CloudObjectsMetaSource.
+ */
+public class CloudObjectsHoodieIncrSource extends HoodieIncrSource {
+
+  private static final Logger LOG = LogManager.getLogger(CloudObjectsHoodieIncrSource.class);
+
+  public CloudObjectsHoodieIncrSource(
+      TypedProperties props,
+      JavaSparkContext sparkContext,
+      SparkSession sparkSession,
+      SchemaProvider schemaProvider) {
+    super(props, sparkContext, sparkSession, schemaProvider);
+  }
+
+  @Override
+  public Pair<Option<Dataset<Row>>, String> fetchNextBatch(
+      Option<String> lastCkptStr, long sourceLimit) {
+
+    DataSourceUtils.checkRequiredProperties(
+        props, Collections.singletonList(Config.HOODIE_SRC_BASE_PATH));
+
+    String srcPath = props.getString(Config.HOODIE_SRC_BASE_PATH);
+    int numInstantsPerFetch =
+        props.getInteger(Config.NUM_INSTANTS_PER_FETCH, Config.DEFAULT_NUM_INSTANTS_PER_FETCH);
+    boolean readLatestOnMissingCkpt =
+        props.getBoolean(
+            Config.READ_LATEST_INSTANT_ON_MISSING_CKPT,
+            Config.DEFAULT_READ_LATEST_INSTANT_ON_MISSING_CKPT);
+
+    // Use begin Instant if set and non-empty
+    Option<String> beginInstant =
+        lastCkptStr.isPresent()
+            ? lastCkptStr.get().isEmpty() ? Option.empty() : lastCkptStr
+            : Option.empty();
+
+    Pair<String, String> instantEndpts =
+        IncrSourceHelper.calculateBeginAndEndInstants(
+            sparkContext, srcPath, numInstantsPerFetch, beginInstant, readLatestOnMissingCkpt);
+
+    if (instantEndpts.getKey().equals(instantEndpts.getValue())) {
+      LOG.warn("Already caught up. Begin Checkpoint was :" + instantEndpts.getKey());
+      return Pair.of(Option.empty(), instantEndpts.getKey());
+    }
+
+    // Do Incr pull. Set end instant if available
+    DataFrameReader reader =
+        sparkSession
+            .read()
+            .format("org.apache.hudi")
+            .option(
+                DataSourceReadOptions.QUERY_TYPE().key(),
+                DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL())
+            .option(
+                DataSourceReadOptions.BEGIN_INSTANTTIME().key(), instantEndpts.getLeft())
+            .option(
+                DataSourceReadOptions.END_INSTANTTIME().key(), instantEndpts.getRight());
+
+    Dataset<Row> source = reader.load(srcPath);
+
+    // Extract distinct file keys from cloud meta hoodie table
+    final List<Row> cloudMetaDf =
+        source
+            .filter("s3.object.size > 0")
+            .select("s3.bucket.name", "s3.object.key")
+            .distinct()
+            .collectAsList();
+
+    // Create S3 paths
+    List<String> cloudFiles = new ArrayList<>();
+    for (Row row : cloudMetaDf) {
+      String bucket = row.getString(0);
+      String key = row.getString(1);
+      String filePath = "s3://" + bucket + "/" + key;
+      cloudFiles.add(filePath);
+    }
+    String pathStr = String.join(",", cloudFiles);
+
+    return Pair.of(Option.of(fromFiles(pathStr)), instantEndpts.getRight());
+  }
+
+  /**
+   * Function to create Dataset from parquet files.
+   */
+  private Dataset<Row> fromFiles(String pathStr) {
+    return sparkSession.read().parquet(pathStr.split(","));

Review comment:
       what happens if one of the file is non existant (deleted)? will the read silently ignore and fetch the rest ? 

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelector.java
##########
@@ -0,0 +1,285 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+
+import com.amazonaws.regions.Regions;
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.AmazonSQSClientBuilder;
+import com.amazonaws.services.sqs.model.BatchResultErrorEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequest;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequestEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchResult;
+import com.amazonaws.services.sqs.model.GetQueueAttributesRequest;
+import com.amazonaws.services.sqs.model.GetQueueAttributesResult;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.json.JSONObject;
+
+import java.io.UnsupportedEncodingException;
+import java.net.URLDecoder;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ * Cloud Objects Selector Class. This class has methods for processing cloud objects. It currently
+ * supports only AWS S3 objects and AWS SQS queue.
+ */
+public class CloudObjectsSelector {
+  public static final List<String> ALLOWED_S3_EVENT_PREFIX =
+      Collections.singletonList("ObjectCreated");
+  public static volatile Logger log = LogManager.getLogger(CloudObjectsSelector.class);
+  public final String queueUrl;
+  public final int longPollWait;
+  public final int maxMessagesEachRequest;
+  public final int maxMessageEachBatch;
+  public final int visibilityTimeout;
+  public final TypedProperties props;
+  public final String fsName;
+  private final String regionName;
+
+  /**
+   * Cloud Objects Selector Class. {@link CloudObjectsSelector}
+   */
+  public CloudObjectsSelector(TypedProperties props) {
+    DataSourceUtils.checkRequiredProperties(props, Arrays.asList(Config.QUEUE_URL_PROP, Config.QUEUE_REGION));
+    this.props = props;
+    this.queueUrl = props.getString(Config.QUEUE_URL_PROP);
+    this.regionName = props.getString(Config.QUEUE_REGION);
+    this.fsName = props.getString(Config.SOURCE_QUEUE_FS_PROP, "s3").toLowerCase();
+    this.longPollWait = props.getInteger(Config.QUEUE_LONGPOLLWAIT_PROP, 20);
+    this.maxMessageEachBatch = props.getInteger(Config.QUEUE_MAXMESSAGESEACHBATCH_PROP, 5);
+    this.visibilityTimeout = props.getInteger(Config.QUEUE_VISIBILITYTIMEOUT_PROP, 30);
+    this.maxMessagesEachRequest = 10;
+  }
+
+  /**
+   * Get SQS queue attributes.
+   *
+   * @param sqsClient AWSClient for sqsClient
+   * @param queueUrl  queue full url
+   * @return map of attributes needed
+   */
+  protected Map<String, String> getSqsQueueAttributes(AmazonSQS sqsClient, String queueUrl) {
+    GetQueueAttributesResult queueAttributesResult =
+        sqsClient.getQueueAttributes(
+            new GetQueueAttributesRequest(queueUrl)
+                .withAttributeNames("ApproximateNumberOfMessages"));
+    return queueAttributesResult.getAttributes();
+  }
+
+  /**
+   * Get the file attributes filePath, eventTime and size from JSONObject record.
+   *
+   * @param record of object event
+   * @return map of file attribute
+   */
+  protected Map<String, Object> getFileAttributesFromRecord(JSONObject record)
+      throws UnsupportedEncodingException {
+
+    Map<String, Object> fileRecord = new HashMap<>();
+    String eventTimeStr = record.getString("eventTime");
+    long eventTime =
+        Date.from(Instant.from(DateTimeFormatter.ISO_INSTANT.parse(eventTimeStr))).getTime();
+
+    JSONObject s3Object = record.getJSONObject("s3").getJSONObject("object");
+    String bucket =
+        URLDecoder.decode(
+            record.getJSONObject("s3").getJSONObject("bucket").getString("name"), "UTF-8");
+    String key = URLDecoder.decode(s3Object.getString("key"), "UTF-8");
+    String filePath = this.fsName + "://" + bucket + "/" + key;
+
+    fileRecord.put("eventTime", eventTime);
+    fileRecord.put("fileSize", s3Object.getLong("size"));
+    fileRecord.put("filePath", filePath);
+    return fileRecord;
+  }
+
+  /**
+   * Amazon SQS Client Builder.
+   */
+  public AmazonSQS createAmazonSqsClient() {
+    return AmazonSQSClientBuilder.standard().withRegion(Regions.fromName(regionName)).build();
+  }
+
+  /**
+   * List messages from queue.
+   */
+  protected List<Message> getMessagesToProcess(
+      AmazonSQS sqsClient,
+      String queueUrl,
+      ReceiveMessageRequest receiveMessageRequest,
+      int maxMessageEachBatch,
+      int maxMessagesEachRequest) {

Review comment:
       maxMessagePerRequest

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelector.java
##########
@@ -0,0 +1,285 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+
+import com.amazonaws.regions.Regions;
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.AmazonSQSClientBuilder;
+import com.amazonaws.services.sqs.model.BatchResultErrorEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequest;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequestEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchResult;
+import com.amazonaws.services.sqs.model.GetQueueAttributesRequest;
+import com.amazonaws.services.sqs.model.GetQueueAttributesResult;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.json.JSONObject;
+
+import java.io.UnsupportedEncodingException;
+import java.net.URLDecoder;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ * Cloud Objects Selector Class. This class has methods for processing cloud objects. It currently
+ * supports only AWS S3 objects and AWS SQS queue.
+ */
+public class CloudObjectsSelector {
+  public static final List<String> ALLOWED_S3_EVENT_PREFIX =
+      Collections.singletonList("ObjectCreated");
+  public static volatile Logger log = LogManager.getLogger(CloudObjectsSelector.class);
+  public final String queueUrl;
+  public final int longPollWait;
+  public final int maxMessagesEachRequest;
+  public final int maxMessageEachBatch;
+  public final int visibilityTimeout;
+  public final TypedProperties props;
+  public final String fsName;
+  private final String regionName;
+
+  /**
+   * Cloud Objects Selector Class. {@link CloudObjectsSelector}
+   */
+  public CloudObjectsSelector(TypedProperties props) {
+    DataSourceUtils.checkRequiredProperties(props, Arrays.asList(Config.QUEUE_URL_PROP, Config.QUEUE_REGION));
+    this.props = props;
+    this.queueUrl = props.getString(Config.QUEUE_URL_PROP);
+    this.regionName = props.getString(Config.QUEUE_REGION);
+    this.fsName = props.getString(Config.SOURCE_QUEUE_FS_PROP, "s3").toLowerCase();
+    this.longPollWait = props.getInteger(Config.QUEUE_LONGPOLLWAIT_PROP, 20);
+    this.maxMessageEachBatch = props.getInteger(Config.QUEUE_MAXMESSAGESEACHBATCH_PROP, 5);
+    this.visibilityTimeout = props.getInteger(Config.QUEUE_VISIBILITYTIMEOUT_PROP, 30);
+    this.maxMessagesEachRequest = 10;
+  }
+
+  /**
+   * Get SQS queue attributes.
+   *
+   * @param sqsClient AWSClient for sqsClient
+   * @param queueUrl  queue full url
+   * @return map of attributes needed
+   */
+  protected Map<String, String> getSqsQueueAttributes(AmazonSQS sqsClient, String queueUrl) {
+    GetQueueAttributesResult queueAttributesResult =
+        sqsClient.getQueueAttributes(
+            new GetQueueAttributesRequest(queueUrl)
+                .withAttributeNames("ApproximateNumberOfMessages"));
+    return queueAttributesResult.getAttributes();
+  }
+
+  /**
+   * Get the file attributes filePath, eventTime and size from JSONObject record.
+   *
+   * @param record of object event
+   * @return map of file attribute
+   */
+  protected Map<String, Object> getFileAttributesFromRecord(JSONObject record)
+      throws UnsupportedEncodingException {
+
+    Map<String, Object> fileRecord = new HashMap<>();
+    String eventTimeStr = record.getString("eventTime");
+    long eventTime =
+        Date.from(Instant.from(DateTimeFormatter.ISO_INSTANT.parse(eventTimeStr))).getTime();
+
+    JSONObject s3Object = record.getJSONObject("s3").getJSONObject("object");
+    String bucket =
+        URLDecoder.decode(
+            record.getJSONObject("s3").getJSONObject("bucket").getString("name"), "UTF-8");
+    String key = URLDecoder.decode(s3Object.getString("key"), "UTF-8");
+    String filePath = this.fsName + "://" + bucket + "/" + key;
+
+    fileRecord.put("eventTime", eventTime);
+    fileRecord.put("fileSize", s3Object.getLong("size"));
+    fileRecord.put("filePath", filePath);
+    return fileRecord;
+  }
+
+  /**
+   * Amazon SQS Client Builder.
+   */
+  public AmazonSQS createAmazonSqsClient() {
+    return AmazonSQSClientBuilder.standard().withRegion(Regions.fromName(regionName)).build();
+  }
+
+  /**
+   * List messages from queue.
+   */
+  protected List<Message> getMessagesToProcess(
+      AmazonSQS sqsClient,
+      String queueUrl,
+      ReceiveMessageRequest receiveMessageRequest,
+      int maxMessageEachBatch,
+      int maxMessagesEachRequest) {
+    List<Message> messagesToProcess = new ArrayList<>();
+
+    // Get count for available messages
+    Map<String, String> queueAttributesResult = getSqsQueueAttributes(sqsClient, queueUrl);
+    long approxMessagesAvailable =
+        Long.parseLong(queueAttributesResult.get("ApproximateNumberOfMessages"));
+    log.info("Approx. " + approxMessagesAvailable + " messages available in queue.");
+
+    for (int i = 0;
+         i < (int) Math.ceil((double) approxMessagesAvailable / maxMessagesEachRequest);
+         ++i) {
+      List<Message> messages = sqsClient.receiveMessage(receiveMessageRequest).getMessages();
+      log.debug("Messages size: " + messages.size());
+
+      for (Message message : messages) {
+        log.debug("message id: " + message.getMessageId());

Review comment:
       lets avoid any (debug) logs at a per message level. 

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelector.java
##########
@@ -0,0 +1,285 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+
+import com.amazonaws.regions.Regions;
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.AmazonSQSClientBuilder;
+import com.amazonaws.services.sqs.model.BatchResultErrorEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequest;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequestEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchResult;
+import com.amazonaws.services.sqs.model.GetQueueAttributesRequest;
+import com.amazonaws.services.sqs.model.GetQueueAttributesResult;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.json.JSONObject;
+
+import java.io.UnsupportedEncodingException;
+import java.net.URLDecoder;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ * Cloud Objects Selector Class. This class has methods for processing cloud objects. It currently
+ * supports only AWS S3 objects and AWS SQS queue.
+ */
+public class CloudObjectsSelector {
+  public static final List<String> ALLOWED_S3_EVENT_PREFIX =
+      Collections.singletonList("ObjectCreated");
+  public static volatile Logger log = LogManager.getLogger(CloudObjectsSelector.class);
+  public final String queueUrl;
+  public final int longPollWait;
+  public final int maxMessagesEachRequest;
+  public final int maxMessageEachBatch;
+  public final int visibilityTimeout;
+  public final TypedProperties props;
+  public final String fsName;
+  private final String regionName;
+
+  /**
+   * Cloud Objects Selector Class. {@link CloudObjectsSelector}
+   */
+  public CloudObjectsSelector(TypedProperties props) {
+    DataSourceUtils.checkRequiredProperties(props, Arrays.asList(Config.QUEUE_URL_PROP, Config.QUEUE_REGION));
+    this.props = props;
+    this.queueUrl = props.getString(Config.QUEUE_URL_PROP);
+    this.regionName = props.getString(Config.QUEUE_REGION);
+    this.fsName = props.getString(Config.SOURCE_QUEUE_FS_PROP, "s3").toLowerCase();
+    this.longPollWait = props.getInteger(Config.QUEUE_LONGPOLLWAIT_PROP, 20);
+    this.maxMessageEachBatch = props.getInteger(Config.QUEUE_MAXMESSAGESEACHBATCH_PROP, 5);
+    this.visibilityTimeout = props.getInteger(Config.QUEUE_VISIBILITYTIMEOUT_PROP, 30);
+    this.maxMessagesEachRequest = 10;
+  }
+
+  /**
+   * Get SQS queue attributes.
+   *
+   * @param sqsClient AWSClient for sqsClient
+   * @param queueUrl  queue full url
+   * @return map of attributes needed
+   */
+  protected Map<String, String> getSqsQueueAttributes(AmazonSQS sqsClient, String queueUrl) {
+    GetQueueAttributesResult queueAttributesResult =
+        sqsClient.getQueueAttributes(
+            new GetQueueAttributesRequest(queueUrl)
+                .withAttributeNames("ApproximateNumberOfMessages"));
+    return queueAttributesResult.getAttributes();
+  }
+
+  /**
+   * Get the file attributes filePath, eventTime and size from JSONObject record.
+   *
+   * @param record of object event
+   * @return map of file attribute
+   */
+  protected Map<String, Object> getFileAttributesFromRecord(JSONObject record)
+      throws UnsupportedEncodingException {
+
+    Map<String, Object> fileRecord = new HashMap<>();
+    String eventTimeStr = record.getString("eventTime");
+    long eventTime =
+        Date.from(Instant.from(DateTimeFormatter.ISO_INSTANT.parse(eventTimeStr))).getTime();
+
+    JSONObject s3Object = record.getJSONObject("s3").getJSONObject("object");
+    String bucket =
+        URLDecoder.decode(
+            record.getJSONObject("s3").getJSONObject("bucket").getString("name"), "UTF-8");
+    String key = URLDecoder.decode(s3Object.getString("key"), "UTF-8");
+    String filePath = this.fsName + "://" + bucket + "/" + key;
+
+    fileRecord.put("eventTime", eventTime);
+    fileRecord.put("fileSize", s3Object.getLong("size"));
+    fileRecord.put("filePath", filePath);
+    return fileRecord;
+  }
+
+  /**
+   * Amazon SQS Client Builder.
+   */
+  public AmazonSQS createAmazonSqsClient() {
+    return AmazonSQSClientBuilder.standard().withRegion(Regions.fromName(regionName)).build();
+  }
+
+  /**
+   * List messages from queue.
+   */
+  protected List<Message> getMessagesToProcess(
+      AmazonSQS sqsClient,
+      String queueUrl,
+      ReceiveMessageRequest receiveMessageRequest,
+      int maxMessageEachBatch,
+      int maxMessagesEachRequest) {
+    List<Message> messagesToProcess = new ArrayList<>();
+
+    // Get count for available messages
+    Map<String, String> queueAttributesResult = getSqsQueueAttributes(sqsClient, queueUrl);
+    long approxMessagesAvailable =
+        Long.parseLong(queueAttributesResult.get("ApproximateNumberOfMessages"));

Review comment:
       declare constants and reuse "ApproximateNumberOfMessages"

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelector.java
##########
@@ -0,0 +1,285 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+
+import com.amazonaws.regions.Regions;
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.AmazonSQSClientBuilder;
+import com.amazonaws.services.sqs.model.BatchResultErrorEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequest;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequestEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchResult;
+import com.amazonaws.services.sqs.model.GetQueueAttributesRequest;
+import com.amazonaws.services.sqs.model.GetQueueAttributesResult;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.json.JSONObject;
+
+import java.io.UnsupportedEncodingException;
+import java.net.URLDecoder;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ * Cloud Objects Selector Class. This class has methods for processing cloud objects. It currently
+ * supports only AWS S3 objects and AWS SQS queue.
+ */
+public class CloudObjectsSelector {
+  public static final List<String> ALLOWED_S3_EVENT_PREFIX =
+      Collections.singletonList("ObjectCreated");
+  public static volatile Logger log = LogManager.getLogger(CloudObjectsSelector.class);
+  public final String queueUrl;
+  public final int longPollWait;
+  public final int maxMessagesEachRequest;
+  public final int maxMessageEachBatch;
+  public final int visibilityTimeout;
+  public final TypedProperties props;
+  public final String fsName;
+  private final String regionName;
+
+  /**
+   * Cloud Objects Selector Class. {@link CloudObjectsSelector}
+   */
+  public CloudObjectsSelector(TypedProperties props) {
+    DataSourceUtils.checkRequiredProperties(props, Arrays.asList(Config.QUEUE_URL_PROP, Config.QUEUE_REGION));
+    this.props = props;
+    this.queueUrl = props.getString(Config.QUEUE_URL_PROP);
+    this.regionName = props.getString(Config.QUEUE_REGION);
+    this.fsName = props.getString(Config.SOURCE_QUEUE_FS_PROP, "s3").toLowerCase();
+    this.longPollWait = props.getInteger(Config.QUEUE_LONGPOLLWAIT_PROP, 20);
+    this.maxMessageEachBatch = props.getInteger(Config.QUEUE_MAXMESSAGESEACHBATCH_PROP, 5);
+    this.visibilityTimeout = props.getInteger(Config.QUEUE_VISIBILITYTIMEOUT_PROP, 30);
+    this.maxMessagesEachRequest = 10;
+  }
+
+  /**
+   * Get SQS queue attributes.
+   *
+   * @param sqsClient AWSClient for sqsClient
+   * @param queueUrl  queue full url
+   * @return map of attributes needed
+   */
+  protected Map<String, String> getSqsQueueAttributes(AmazonSQS sqsClient, String queueUrl) {
+    GetQueueAttributesResult queueAttributesResult =
+        sqsClient.getQueueAttributes(
+            new GetQueueAttributesRequest(queueUrl)
+                .withAttributeNames("ApproximateNumberOfMessages"));
+    return queueAttributesResult.getAttributes();
+  }
+
+  /**
+   * Get the file attributes filePath, eventTime and size from JSONObject record.
+   *
+   * @param record of object event
+   * @return map of file attribute
+   */
+  protected Map<String, Object> getFileAttributesFromRecord(JSONObject record)
+      throws UnsupportedEncodingException {
+
+    Map<String, Object> fileRecord = new HashMap<>();
+    String eventTimeStr = record.getString("eventTime");
+    long eventTime =
+        Date.from(Instant.from(DateTimeFormatter.ISO_INSTANT.parse(eventTimeStr))).getTime();
+
+    JSONObject s3Object = record.getJSONObject("s3").getJSONObject("object");
+    String bucket =
+        URLDecoder.decode(
+            record.getJSONObject("s3").getJSONObject("bucket").getString("name"), "UTF-8");
+    String key = URLDecoder.decode(s3Object.getString("key"), "UTF-8");
+    String filePath = this.fsName + "://" + bucket + "/" + key;
+
+    fileRecord.put("eventTime", eventTime);
+    fileRecord.put("fileSize", s3Object.getLong("size"));
+    fileRecord.put("filePath", filePath);
+    return fileRecord;
+  }
+
+  /**
+   * Amazon SQS Client Builder.
+   */
+  public AmazonSQS createAmazonSqsClient() {
+    return AmazonSQSClientBuilder.standard().withRegion(Regions.fromName(regionName)).build();
+  }
+
+  /**
+   * List messages from queue.
+   */
+  protected List<Message> getMessagesToProcess(
+      AmazonSQS sqsClient,
+      String queueUrl,
+      ReceiveMessageRequest receiveMessageRequest,
+      int maxMessageEachBatch,
+      int maxMessagesEachRequest) {
+    List<Message> messagesToProcess = new ArrayList<>();
+
+    // Get count for available messages
+    Map<String, String> queueAttributesResult = getSqsQueueAttributes(sqsClient, queueUrl);
+    long approxMessagesAvailable =
+        Long.parseLong(queueAttributesResult.get("ApproximateNumberOfMessages"));
+    log.info("Approx. " + approxMessagesAvailable + " messages available in queue.");
+
+    for (int i = 0;
+         i < (int) Math.ceil((double) approxMessagesAvailable / maxMessagesEachRequest);
+         ++i) {
+      List<Message> messages = sqsClient.receiveMessage(receiveMessageRequest).getMessages();
+      log.debug("Messages size: " + messages.size());
+
+      for (Message message : messages) {
+        log.debug("message id: " + message.getMessageId());
+        messagesToProcess.add(message);
+      }
+      log.debug("total fetched messages size: " + messagesToProcess.size());
+      if (messages.isEmpty() || (messagesToProcess.size() >= maxMessageEachBatch)) {
+        break;
+      }
+    }
+    return messagesToProcess;
+  }
+
+  /**
+   * create partitions of list using specific batch size. we can't use third party API for this
+   * functionality, due to https://github.com/apache/hudi/blob/master/style/checkstyle.xml#L270
+   */
+  protected List<List<Message>> createListPartitions(List<Message> singleList, int eachBatchSize) {
+    List<List<Message>> listPartitions = new ArrayList<>();
+
+    if (singleList.size() == 0 || eachBatchSize < 1) {
+      return listPartitions;
+    }
+
+    for (int start = 0; start < singleList.size(); start += eachBatchSize) {
+      int end = Math.min(start + eachBatchSize, singleList.size());
+
+      if (start > end) {
+        throw new IndexOutOfBoundsException(
+            "Index " + start + " is out of the list range <0," + (singleList.size() - 1) + ">");
+      }
+      listPartitions.add(new ArrayList<>(singleList.subList(start, end)));
+    }
+    return listPartitions;
+  }
+
+  /**
+   * delete batch of messages from queue.
+   */
+  protected void deleteBatchOfMessages(
+      AmazonSQS sqs, String queueUrl, List<Message> messagesToBeDeleted) {
+    DeleteMessageBatchRequest deleteBatchReq =
+        new DeleteMessageBatchRequest().withQueueUrl(queueUrl);
+    List<DeleteMessageBatchRequestEntry> deleteEntries = deleteBatchReq.getEntries();
+
+    for (Message message : messagesToBeDeleted) {
+      deleteEntries.add(
+          new DeleteMessageBatchRequestEntry()
+              .withId(message.getMessageId())
+              .withReceiptHandle(message.getReceiptHandle()));
+    }
+    DeleteMessageBatchResult deleteResult = sqs.deleteMessageBatch(deleteBatchReq);
+    List<String> deleteFailures =
+        deleteResult.getFailed().stream()
+            .map(BatchResultErrorEntry::getId)
+            .collect(Collectors.toList());
+    System.out.println("Delete is" + deleteFailures.isEmpty() + "or ignoring it.");
+    if (!deleteFailures.isEmpty()) {
+      log.warn(
+          "Failed to delete "
+              + deleteFailures.size()
+              + " messages out of "
+              + deleteEntries.size()
+              + " from queue.");
+    } else {
+      log.info("Successfully deleted " + deleteEntries.size() + " messages from queue.");
+    }
+  }
+
+  /**
+   * Delete Queue Messages after hudi commit. This method will be invoked by source.onCommit.
+   */
+  public void onCommitDeleteProcessedMessages(
+      AmazonSQS sqs, String queueUrl, List<Message> processedMessages) {
+
+    if (!processedMessages.isEmpty()) {
+
+      // create batch for deletion, SES DeleteMessageBatchRequest only accept max 10 entries
+      List<List<Message>> deleteBatches = createListPartitions(processedMessages, 10);
+      for (List<Message> deleteBatch : deleteBatches) {
+        deleteBatchOfMessages(sqs, queueUrl, deleteBatch);
+      }
+    }
+  }
+
+  /**
+   * Configs supported.
+   */
+  public static class Config {
+    /**
+     * {@value #QUEUE_URL_PROP} is the queue url for cloud object events.
+     */
+    public static final String QUEUE_URL_PROP = "hoodie.deltastreamer.source.queue.url";
+
+    /**
+     * {@value #QUEUE_REGION} is the case-sensitive region name of the cloud provider for the queue. For example, "us-east-1".
+     */
+    public static final String QUEUE_REGION = "hoodie.deltastreamer.source.queue.region";
+
+    /**
+     * {@value #SOURCE_QUEUE_FS_PROP} is file system corresponding to queue. For example, for AWS SQS it is s3/s3a.
+     */
+    public static final String SOURCE_QUEUE_FS_PROP = "hoodie.deltastreamer.source.queue.fs";
+
+    /**
+     * {@value #QUEUE_LONGPOLLWAIT_PROP} is the long poll wait time in seconds If set as 0 then
+     * client will fetch on short poll basis.
+     */
+    public static final String QUEUE_LONGPOLLWAIT_PROP =

Review comment:
       QUEUE_LONG_POLL_WAIT_PROP

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelector.java
##########
@@ -0,0 +1,285 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+
+import com.amazonaws.regions.Regions;
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.AmazonSQSClientBuilder;
+import com.amazonaws.services.sqs.model.BatchResultErrorEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequest;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequestEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchResult;
+import com.amazonaws.services.sqs.model.GetQueueAttributesRequest;
+import com.amazonaws.services.sqs.model.GetQueueAttributesResult;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.json.JSONObject;
+
+import java.io.UnsupportedEncodingException;
+import java.net.URLDecoder;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ * Cloud Objects Selector Class. This class has methods for processing cloud objects. It currently
+ * supports only AWS S3 objects and AWS SQS queue.
+ */
+public class CloudObjectsSelector {
+  public static final List<String> ALLOWED_S3_EVENT_PREFIX =
+      Collections.singletonList("ObjectCreated");
+  public static volatile Logger log = LogManager.getLogger(CloudObjectsSelector.class);
+  public final String queueUrl;
+  public final int longPollWait;
+  public final int maxMessagesEachRequest;
+  public final int maxMessageEachBatch;
+  public final int visibilityTimeout;
+  public final TypedProperties props;
+  public final String fsName;
+  private final String regionName;
+
+  /**
+   * Cloud Objects Selector Class. {@link CloudObjectsSelector}
+   */
+  public CloudObjectsSelector(TypedProperties props) {
+    DataSourceUtils.checkRequiredProperties(props, Arrays.asList(Config.QUEUE_URL_PROP, Config.QUEUE_REGION));
+    this.props = props;
+    this.queueUrl = props.getString(Config.QUEUE_URL_PROP);
+    this.regionName = props.getString(Config.QUEUE_REGION);
+    this.fsName = props.getString(Config.SOURCE_QUEUE_FS_PROP, "s3").toLowerCase();
+    this.longPollWait = props.getInteger(Config.QUEUE_LONGPOLLWAIT_PROP, 20);
+    this.maxMessageEachBatch = props.getInteger(Config.QUEUE_MAXMESSAGESEACHBATCH_PROP, 5);
+    this.visibilityTimeout = props.getInteger(Config.QUEUE_VISIBILITYTIMEOUT_PROP, 30);
+    this.maxMessagesEachRequest = 10;
+  }
+
+  /**
+   * Get SQS queue attributes.
+   *
+   * @param sqsClient AWSClient for sqsClient
+   * @param queueUrl  queue full url
+   * @return map of attributes needed
+   */
+  protected Map<String, String> getSqsQueueAttributes(AmazonSQS sqsClient, String queueUrl) {
+    GetQueueAttributesResult queueAttributesResult =
+        sqsClient.getQueueAttributes(
+            new GetQueueAttributesRequest(queueUrl)
+                .withAttributeNames("ApproximateNumberOfMessages"));
+    return queueAttributesResult.getAttributes();
+  }
+
+  /**
+   * Get the file attributes filePath, eventTime and size from JSONObject record.
+   *
+   * @param record of object event
+   * @return map of file attribute
+   */
+  protected Map<String, Object> getFileAttributesFromRecord(JSONObject record)
+      throws UnsupportedEncodingException {
+
+    Map<String, Object> fileRecord = new HashMap<>();
+    String eventTimeStr = record.getString("eventTime");
+    long eventTime =
+        Date.from(Instant.from(DateTimeFormatter.ISO_INSTANT.parse(eventTimeStr))).getTime();
+
+    JSONObject s3Object = record.getJSONObject("s3").getJSONObject("object");
+    String bucket =
+        URLDecoder.decode(
+            record.getJSONObject("s3").getJSONObject("bucket").getString("name"), "UTF-8");
+    String key = URLDecoder.decode(s3Object.getString("key"), "UTF-8");
+    String filePath = this.fsName + "://" + bucket + "/" + key;
+
+    fileRecord.put("eventTime", eventTime);
+    fileRecord.put("fileSize", s3Object.getLong("size"));
+    fileRecord.put("filePath", filePath);
+    return fileRecord;
+  }
+
+  /**
+   * Amazon SQS Client Builder.
+   */
+  public AmazonSQS createAmazonSqsClient() {
+    return AmazonSQSClientBuilder.standard().withRegion(Regions.fromName(regionName)).build();
+  }
+
+  /**
+   * List messages from queue.
+   */
+  protected List<Message> getMessagesToProcess(
+      AmazonSQS sqsClient,
+      String queueUrl,
+      ReceiveMessageRequest receiveMessageRequest,
+      int maxMessageEachBatch,
+      int maxMessagesEachRequest) {
+    List<Message> messagesToProcess = new ArrayList<>();
+
+    // Get count for available messages
+    Map<String, String> queueAttributesResult = getSqsQueueAttributes(sqsClient, queueUrl);
+    long approxMessagesAvailable =
+        Long.parseLong(queueAttributesResult.get("ApproximateNumberOfMessages"));
+    log.info("Approx. " + approxMessagesAvailable + " messages available in queue.");
+
+    for (int i = 0;
+         i < (int) Math.ceil((double) approxMessagesAvailable / maxMessagesEachRequest);
+         ++i) {
+      List<Message> messages = sqsClient.receiveMessage(receiveMessageRequest).getMessages();
+      log.debug("Messages size: " + messages.size());
+
+      for (Message message : messages) {
+        log.debug("message id: " + message.getMessageId());
+        messagesToProcess.add(message);
+      }
+      log.debug("total fetched messages size: " + messagesToProcess.size());
+      if (messages.isEmpty() || (messagesToProcess.size() >= maxMessageEachBatch)) {
+        break;
+      }
+    }
+    return messagesToProcess;
+  }
+
+  /**
+   * create partitions of list using specific batch size. we can't use third party API for this
+   * functionality, due to https://github.com/apache/hudi/blob/master/style/checkstyle.xml#L270
+   */
+  protected List<List<Message>> createListPartitions(List<Message> singleList, int eachBatchSize) {
+    List<List<Message>> listPartitions = new ArrayList<>();
+
+    if (singleList.size() == 0 || eachBatchSize < 1) {
+      return listPartitions;
+    }
+
+    for (int start = 0; start < singleList.size(); start += eachBatchSize) {
+      int end = Math.min(start + eachBatchSize, singleList.size());
+
+      if (start > end) {
+        throw new IndexOutOfBoundsException(
+            "Index " + start + " is out of the list range <0," + (singleList.size() - 1) + ">");
+      }
+      listPartitions.add(new ArrayList<>(singleList.subList(start, end)));
+    }
+    return listPartitions;
+  }
+
+  /**
+   * delete batch of messages from queue.
+   */
+  protected void deleteBatchOfMessages(
+      AmazonSQS sqs, String queueUrl, List<Message> messagesToBeDeleted) {
+    DeleteMessageBatchRequest deleteBatchReq =
+        new DeleteMessageBatchRequest().withQueueUrl(queueUrl);
+    List<DeleteMessageBatchRequestEntry> deleteEntries = deleteBatchReq.getEntries();
+
+    for (Message message : messagesToBeDeleted) {
+      deleteEntries.add(
+          new DeleteMessageBatchRequestEntry()
+              .withId(message.getMessageId())
+              .withReceiptHandle(message.getReceiptHandle()));
+    }
+    DeleteMessageBatchResult deleteResult = sqs.deleteMessageBatch(deleteBatchReq);
+    List<String> deleteFailures =
+        deleteResult.getFailed().stream()
+            .map(BatchResultErrorEntry::getId)
+            .collect(Collectors.toList());
+    System.out.println("Delete is" + deleteFailures.isEmpty() + "or ignoring it.");
+    if (!deleteFailures.isEmpty()) {
+      log.warn(
+          "Failed to delete "
+              + deleteFailures.size()
+              + " messages out of "
+              + deleteEntries.size()
+              + " from queue.");
+    } else {
+      log.info("Successfully deleted " + deleteEntries.size() + " messages from queue.");
+    }
+  }
+
+  /**
+   * Delete Queue Messages after hudi commit. This method will be invoked by source.onCommit.
+   */
+  public void onCommitDeleteProcessedMessages(
+      AmazonSQS sqs, String queueUrl, List<Message> processedMessages) {
+
+    if (!processedMessages.isEmpty()) {
+
+      // create batch for deletion, SES DeleteMessageBatchRequest only accept max 10 entries
+      List<List<Message>> deleteBatches = createListPartitions(processedMessages, 10);
+      for (List<Message> deleteBatch : deleteBatches) {
+        deleteBatchOfMessages(sqs, queueUrl, deleteBatch);
+      }
+    }
+  }
+
+  /**
+   * Configs supported.
+   */
+  public static class Config {
+    /**
+     * {@value #QUEUE_URL_PROP} is the queue url for cloud object events.
+     */
+    public static final String QUEUE_URL_PROP = "hoodie.deltastreamer.source.queue.url";
+
+    /**
+     * {@value #QUEUE_REGION} is the case-sensitive region name of the cloud provider for the queue. For example, "us-east-1".
+     */
+    public static final String QUEUE_REGION = "hoodie.deltastreamer.source.queue.region";
+
+    /**
+     * {@value #SOURCE_QUEUE_FS_PROP} is file system corresponding to queue. For example, for AWS SQS it is s3/s3a.
+     */
+    public static final String SOURCE_QUEUE_FS_PROP = "hoodie.deltastreamer.source.queue.fs";
+
+    /**
+     * {@value #QUEUE_LONGPOLLWAIT_PROP} is the long poll wait time in seconds If set as 0 then
+     * client will fetch on short poll basis.
+     */
+    public static final String QUEUE_LONGPOLLWAIT_PROP =
+        "hoodie.deltastreamer.source.queue.longpoll.wait";
+
+    /**
+     * {@value #QUEUE_MAXMESSAGESEACHBATCH_PROP} is max messages for each batch of delta streamer
+     * run. Source will process these maximum number of message at a time.
+     */
+    public static final String QUEUE_MAXMESSAGESEACHBATCH_PROP =
+        "hoodie.deltastreamer.source.queue.max.messages.eachbatch";
+
+    /**
+     * {@value #QUEUE_VISIBILITYTIMEOUT_PROP} is visibility timeout for messages in queue. After we
+     * consume the message, queue will move the consumed messages to in-flight state, these messages
+     * can't be consumed again by source for this timeout period.
+     */
+    public static final String QUEUE_VISIBILITYTIMEOUT_PROP =

Review comment:
       QUEUE_VISIBILITY_TIMEOUT_PROP

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelector.java
##########
@@ -0,0 +1,285 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+
+import com.amazonaws.regions.Regions;
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.AmazonSQSClientBuilder;
+import com.amazonaws.services.sqs.model.BatchResultErrorEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequest;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequestEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchResult;
+import com.amazonaws.services.sqs.model.GetQueueAttributesRequest;
+import com.amazonaws.services.sqs.model.GetQueueAttributesResult;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.json.JSONObject;
+
+import java.io.UnsupportedEncodingException;
+import java.net.URLDecoder;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ * Cloud Objects Selector Class. This class has methods for processing cloud objects. It currently
+ * supports only AWS S3 objects and AWS SQS queue.
+ */
+public class CloudObjectsSelector {
+  public static final List<String> ALLOWED_S3_EVENT_PREFIX =
+      Collections.singletonList("ObjectCreated");
+  public static volatile Logger log = LogManager.getLogger(CloudObjectsSelector.class);
+  public final String queueUrl;
+  public final int longPollWait;
+  public final int maxMessagesEachRequest;
+  public final int maxMessageEachBatch;
+  public final int visibilityTimeout;
+  public final TypedProperties props;
+  public final String fsName;
+  private final String regionName;
+
+  /**
+   * Cloud Objects Selector Class. {@link CloudObjectsSelector}
+   */
+  public CloudObjectsSelector(TypedProperties props) {
+    DataSourceUtils.checkRequiredProperties(props, Arrays.asList(Config.QUEUE_URL_PROP, Config.QUEUE_REGION));
+    this.props = props;
+    this.queueUrl = props.getString(Config.QUEUE_URL_PROP);
+    this.regionName = props.getString(Config.QUEUE_REGION);
+    this.fsName = props.getString(Config.SOURCE_QUEUE_FS_PROP, "s3").toLowerCase();
+    this.longPollWait = props.getInteger(Config.QUEUE_LONGPOLLWAIT_PROP, 20);
+    this.maxMessageEachBatch = props.getInteger(Config.QUEUE_MAXMESSAGESEACHBATCH_PROP, 5);
+    this.visibilityTimeout = props.getInteger(Config.QUEUE_VISIBILITYTIMEOUT_PROP, 30);
+    this.maxMessagesEachRequest = 10;
+  }
+
+  /**
+   * Get SQS queue attributes.
+   *
+   * @param sqsClient AWSClient for sqsClient
+   * @param queueUrl  queue full url
+   * @return map of attributes needed
+   */
+  protected Map<String, String> getSqsQueueAttributes(AmazonSQS sqsClient, String queueUrl) {
+    GetQueueAttributesResult queueAttributesResult =
+        sqsClient.getQueueAttributes(
+            new GetQueueAttributesRequest(queueUrl)
+                .withAttributeNames("ApproximateNumberOfMessages"));
+    return queueAttributesResult.getAttributes();
+  }
+
+  /**
+   * Get the file attributes filePath, eventTime and size from JSONObject record.
+   *
+   * @param record of object event
+   * @return map of file attribute
+   */
+  protected Map<String, Object> getFileAttributesFromRecord(JSONObject record)
+      throws UnsupportedEncodingException {
+
+    Map<String, Object> fileRecord = new HashMap<>();
+    String eventTimeStr = record.getString("eventTime");
+    long eventTime =
+        Date.from(Instant.from(DateTimeFormatter.ISO_INSTANT.parse(eventTimeStr))).getTime();
+
+    JSONObject s3Object = record.getJSONObject("s3").getJSONObject("object");
+    String bucket =
+        URLDecoder.decode(
+            record.getJSONObject("s3").getJSONObject("bucket").getString("name"), "UTF-8");
+    String key = URLDecoder.decode(s3Object.getString("key"), "UTF-8");
+    String filePath = this.fsName + "://" + bucket + "/" + key;
+
+    fileRecord.put("eventTime", eventTime);
+    fileRecord.put("fileSize", s3Object.getLong("size"));
+    fileRecord.put("filePath", filePath);
+    return fileRecord;
+  }
+
+  /**
+   * Amazon SQS Client Builder.
+   */
+  public AmazonSQS createAmazonSqsClient() {
+    return AmazonSQSClientBuilder.standard().withRegion(Regions.fromName(regionName)).build();
+  }
+
+  /**
+   * List messages from queue.
+   */
+  protected List<Message> getMessagesToProcess(
+      AmazonSQS sqsClient,
+      String queueUrl,
+      ReceiveMessageRequest receiveMessageRequest,
+      int maxMessageEachBatch,
+      int maxMessagesEachRequest) {
+    List<Message> messagesToProcess = new ArrayList<>();
+
+    // Get count for available messages
+    Map<String, String> queueAttributesResult = getSqsQueueAttributes(sqsClient, queueUrl);
+    long approxMessagesAvailable =
+        Long.parseLong(queueAttributesResult.get("ApproximateNumberOfMessages"));
+    log.info("Approx. " + approxMessagesAvailable + " messages available in queue.");
+
+    for (int i = 0;
+         i < (int) Math.ceil((double) approxMessagesAvailable / maxMessagesEachRequest);
+         ++i) {
+      List<Message> messages = sqsClient.receiveMessage(receiveMessageRequest).getMessages();
+      log.debug("Messages size: " + messages.size());
+
+      for (Message message : messages) {
+        log.debug("message id: " + message.getMessageId());
+        messagesToProcess.add(message);
+      }
+      log.debug("total fetched messages size: " + messagesToProcess.size());
+      if (messages.isEmpty() || (messagesToProcess.size() >= maxMessageEachBatch)) {
+        break;
+      }
+    }
+    return messagesToProcess;
+  }
+
+  /**
+   * create partitions of list using specific batch size. we can't use third party API for this
+   * functionality, due to https://github.com/apache/hudi/blob/master/style/checkstyle.xml#L270
+   */
+  protected List<List<Message>> createListPartitions(List<Message> singleList, int eachBatchSize) {
+    List<List<Message>> listPartitions = new ArrayList<>();
+
+    if (singleList.size() == 0 || eachBatchSize < 1) {
+      return listPartitions;
+    }
+
+    for (int start = 0; start < singleList.size(); start += eachBatchSize) {
+      int end = Math.min(start + eachBatchSize, singleList.size());
+
+      if (start > end) {
+        throw new IndexOutOfBoundsException(
+            "Index " + start + " is out of the list range <0," + (singleList.size() - 1) + ">");
+      }
+      listPartitions.add(new ArrayList<>(singleList.subList(start, end)));
+    }
+    return listPartitions;
+  }
+
+  /**
+   * delete batch of messages from queue.
+   */
+  protected void deleteBatchOfMessages(
+      AmazonSQS sqs, String queueUrl, List<Message> messagesToBeDeleted) {
+    DeleteMessageBatchRequest deleteBatchReq =
+        new DeleteMessageBatchRequest().withQueueUrl(queueUrl);
+    List<DeleteMessageBatchRequestEntry> deleteEntries = deleteBatchReq.getEntries();
+
+    for (Message message : messagesToBeDeleted) {
+      deleteEntries.add(
+          new DeleteMessageBatchRequestEntry()
+              .withId(message.getMessageId())
+              .withReceiptHandle(message.getReceiptHandle()));
+    }
+    DeleteMessageBatchResult deleteResult = sqs.deleteMessageBatch(deleteBatchReq);
+    List<String> deleteFailures =
+        deleteResult.getFailed().stream()
+            .map(BatchResultErrorEntry::getId)
+            .collect(Collectors.toList());
+    System.out.println("Delete is" + deleteFailures.isEmpty() + "or ignoring it.");
+    if (!deleteFailures.isEmpty()) {
+      log.warn(
+          "Failed to delete "
+              + deleteFailures.size()
+              + " messages out of "
+              + deleteEntries.size()
+              + " from queue.");
+    } else {
+      log.info("Successfully deleted " + deleteEntries.size() + " messages from queue.");
+    }
+  }
+
+  /**
+   * Delete Queue Messages after hudi commit. This method will be invoked by source.onCommit.
+   */
+  public void onCommitDeleteProcessedMessages(
+      AmazonSQS sqs, String queueUrl, List<Message> processedMessages) {
+
+    if (!processedMessages.isEmpty()) {
+
+      // create batch for deletion, SES DeleteMessageBatchRequest only accept max 10 entries
+      List<List<Message>> deleteBatches = createListPartitions(processedMessages, 10);
+      for (List<Message> deleteBatch : deleteBatches) {
+        deleteBatchOfMessages(sqs, queueUrl, deleteBatch);
+      }
+    }
+  }
+
+  /**
+   * Configs supported.
+   */
+  public static class Config {
+    /**
+     * {@value #QUEUE_URL_PROP} is the queue url for cloud object events.
+     */
+    public static final String QUEUE_URL_PROP = "hoodie.deltastreamer.source.queue.url";
+
+    /**
+     * {@value #QUEUE_REGION} is the case-sensitive region name of the cloud provider for the queue. For example, "us-east-1".
+     */
+    public static final String QUEUE_REGION = "hoodie.deltastreamer.source.queue.region";
+
+    /**
+     * {@value #SOURCE_QUEUE_FS_PROP} is file system corresponding to queue. For example, for AWS SQS it is s3/s3a.
+     */
+    public static final String SOURCE_QUEUE_FS_PROP = "hoodie.deltastreamer.source.queue.fs";
+
+    /**
+     * {@value #QUEUE_LONGPOLLWAIT_PROP} is the long poll wait time in seconds If set as 0 then
+     * client will fetch on short poll basis.
+     */
+    public static final String QUEUE_LONGPOLLWAIT_PROP =
+        "hoodie.deltastreamer.source.queue.longpoll.wait";
+
+    /**
+     * {@value #QUEUE_MAXMESSAGESEACHBATCH_PROP} is max messages for each batch of delta streamer
+     * run. Source will process these maximum number of message at a time.
+     */
+    public static final String QUEUE_MAXMESSAGESEACHBATCH_PROP =

Review comment:
       QUEUE_MAX_MESSAGE_PER_BATCH_PROP
   "hoodie.deltastreamer.cloud.source.queue.max.messages.per.batch"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on a change in pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
codope commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688589429



##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+    super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties props) {
+    String sourceSelectorClass =
+        props.getString(
+            CloudObjectsMetaSelector.Config.SOURCE_INPUT_SELECTOR,
+            CloudObjectsMetaSelector.class.getName());
+    try {
+      CloudObjectsMetaSelector selector =
+          (CloudObjectsMetaSelector)
+              ReflectionUtils.loadClass(
+                  sourceSelectorClass, new Class<?>[] {TypedProperties.class}, props);
+
+      log.info("Using path selector " + selector.getClass().getName());
+      return selector;
+    } catch (Exception e) {
+      throw new HoodieException("Could not load source selector class " + sourceSelectorClass, e);
+    }
+  }
+
+  /**
+   * List messages from queue, filter out illegible events while doing so. It will also delete the
+   * ineligible messages from queue.
+   *
+   * @param processedMessages array of processed messages to add more messages
+   * @return the list of eligible records
+   */
+  protected List<Map<String, Object>> getEligibleEvents(
+      AmazonSQS sqs, List<Message> processedMessages) throws IOException {
+
+    List<Map<String, Object>> eligibleRecords = new ArrayList<>();
+    List<Message> ineligibleMessages = new ArrayList<>();
+
+    ReceiveMessageRequest receiveMessageRequest =
+        new ReceiveMessageRequest()
+            .withQueueUrl(this.queueUrl)
+            .withWaitTimeSeconds(this.longPollWait)
+            .withVisibilityTimeout(this.visibilityTimeout);
+    receiveMessageRequest.setMaxNumberOfMessages(this.maxMessagesEachRequest);
+
+    List<Message> messages =
+        getMessagesToProcess(
+            sqs,
+            this.queueUrl,
+            receiveMessageRequest,
+            this.maxMessageEachBatch,
+            this.maxMessagesEachRequest);
+
+    for (Message message : messages) {
+      boolean isMessageDelete = Boolean.TRUE;
+
+      JSONObject messageBody = new JSONObject(message.getBody());
+      Map<String, Object> messageMap;
+      ObjectMapper mapper = new ObjectMapper();
+
+      if (messageBody.has("Message")) {
+        // If this messages is from S3Event -> SNS -> SQS
+        messageMap =
+            (Map<String, Object>) mapper.readValue(messageBody.getString("Message"), Map.class);
+      } else {
+        // If this messages is from S3Event -> SQS
+        messageMap = (Map<String, Object>) mapper.readValue(messageBody.toString(), Map.class);
+      }
+      if (messageMap.containsKey("Records")) {
+        List<Map<String, Object>> records = (List<Map<String, Object>>) messageMap.get("Records");
+        for (Map<String, Object> record : records) {
+          String eventName = (String) record.get("eventName");
+
+          // filter only allowed s3 event types
+          if (ALLOWED_S3_EVENT_PREFIX.stream().anyMatch(eventName::startsWith)) {
+            eligibleRecords.add(record);
+            isMessageDelete = Boolean.FALSE;
+            processedMessages.add(message);
+
+          } else {
+            log.info("This S3 event " + eventName + " is not allowed, so ignoring it.");
+          }
+        }
+      } else {
+        log.info("Message is not expected format or it's s3:TestEvent");
+      }
+      if (isMessageDelete) {
+        ineligibleMessages.add(message);
+      }
+    }
+    if (!ineligibleMessages.isEmpty()) {
+      deleteBatchOfMessages(sqs, queueUrl, ineligibleMessages);
+    }
+
+    return eligibleRecords;
+  }
+
+  /**
+   * Get the list of events from queue.
+   *
+   * @param sparkContext JavaSparkContext to help parallelize certain operations
+   * @param lastCheckpointStr the last checkpoint time string, empty if first run
+   * @return the list of events
+   */
+  public Pair<List<String>, String> getNextEventsFromQueue(
+      AmazonSQS sqs,
+      JavaSparkContext sparkContext,
+      Option<String> lastCheckpointStr,
+      List<Message> processedMessages) {
+
+    processedMessages.clear();
+
+    log.info("Reading messages....");
+
+    try {
+      log.info("Start Checkpoint : " + lastCheckpointStr);
+
+      long lastCheckpointTime = lastCheckpointStr.map(Long::parseLong).orElse(Long.MIN_VALUE);
+
+      List<Map<String, Object>> eligibleEventRecords = getEligibleEvents(sqs, processedMessages);
+      log.info("eligible events size: " + eligibleEventRecords.size());
+
+      // sort all events by event time.
+      eligibleEventRecords.sort(
+          Comparator.comparingLong(
+              record ->
+                  Date.from(
+                          Instant.from(
+                              DateTimeFormatter.ISO_INSTANT.parse(
+                                  (String) record.get("eventTime"))))
+                      .getTime()));
+
+      List<String> filteredEventRecords = new ArrayList<>();
+      long newCheckpointTime = lastCheckpointTime;
+
+      for (Map<String, Object> eventRecord : eligibleEventRecords) {
+        newCheckpointTime =

Review comment:
       Yeah. Removed sort and setting the next checkpoint outside loop.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot edited a comment on pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
hudi-bot edited a comment on pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#issuecomment-895012652


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1572",
       "triggerID" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 076123328724c1ef5051208c57706ae09ba6c11e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1572) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run travis` re-run the last Travis build
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot edited a comment on pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
hudi-bot edited a comment on pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#issuecomment-895012652


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1572",
       "triggerID" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1590",
       "triggerID" : "895681502",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "33f7d78265f1a9635d6254e0dbfb40f161a3d4a7",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1607",
       "triggerID" : "33f7d78265f1a9635d6254e0dbfb40f161a3d4a7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bd9b7dd1e3f0a17fca7fab59650f6f0b03873dc1",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "bd9b7dd1e3f0a17fca7fab59650f6f0b03873dc1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 33f7d78265f1a9635d6254e0dbfb40f161a3d4a7 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1607) 
   * bd9b7dd1e3f0a17fca7fab59650f6f0b03873dc1 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run travis` re-run the last Travis build
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on a change in pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
codope commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688264818



##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+    super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties props) {
+    String sourceSelectorClass =
+        props.getString(
+            CloudObjectsMetaSelector.Config.SOURCE_INPUT_SELECTOR,
+            CloudObjectsMetaSelector.class.getName());
+    try {
+      CloudObjectsMetaSelector selector =
+          (CloudObjectsMetaSelector)
+              ReflectionUtils.loadClass(
+                  sourceSelectorClass, new Class<?>[] {TypedProperties.class}, props);
+
+      log.info("Using path selector " + selector.getClass().getName());
+      return selector;
+    } catch (Exception e) {
+      throw new HoodieException("Could not load source selector class " + sourceSelectorClass, e);
+    }
+  }
+
+  /**
+   * List messages from queue, filter out illegible events while doing so. It will also delete the
+   * ineligible messages from queue.
+   *
+   * @param processedMessages array of processed messages to add more messages
+   * @return the list of eligible records
+   */
+  protected List<Map<String, Object>> getEligibleEvents(

Review comment:
       Makes sense. Going with validEvents as records sounds too generic.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot edited a comment on pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
hudi-bot edited a comment on pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#issuecomment-895012652


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1572",
       "triggerID" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1590",
       "triggerID" : "895681502",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "33f7d78265f1a9635d6254e0dbfb40f161a3d4a7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1607",
       "triggerID" : "33f7d78265f1a9635d6254e0dbfb40f161a3d4a7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bd9b7dd1e3f0a17fca7fab59650f6f0b03873dc1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1712",
       "triggerID" : "bd9b7dd1e3f0a17fca7fab59650f6f0b03873dc1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ee8fbced2d229bd487794a19123c47417acbf306",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1719",
       "triggerID" : "ee8fbced2d229bd487794a19123c47417acbf306",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5ffdc82b5e08f83772f28c2bf844688bc3e9fc50",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1740",
       "triggerID" : "5ffdc82b5e08f83772f28c2bf844688bc3e9fc50",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5ffdc82b5e08f83772f28c2bf844688bc3e9fc50 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1740) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run travis` re-run the last Travis build
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on a change in pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
codope commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688206707



##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+    super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties props) {
+    String sourceSelectorClass =
+        props.getString(
+            CloudObjectsMetaSelector.Config.SOURCE_INPUT_SELECTOR,
+            CloudObjectsMetaSelector.class.getName());
+    try {
+      CloudObjectsMetaSelector selector =
+          (CloudObjectsMetaSelector)
+              ReflectionUtils.loadClass(
+                  sourceSelectorClass, new Class<?>[] {TypedProperties.class}, props);
+
+      log.info("Using path selector " + selector.getClass().getName());
+      return selector;
+    } catch (Exception e) {
+      throw new HoodieException("Could not load source selector class " + sourceSelectorClass, e);
+    }
+  }
+
+  /**
+   * List messages from queue, filter out illegible events while doing so. It will also delete the
+   * ineligible messages from queue.
+   *
+   * @param processedMessages array of processed messages to add more messages
+   * @return the list of eligible records
+   */
+  protected List<Map<String, Object>> getEligibleEvents(
+      AmazonSQS sqs, List<Message> processedMessages) throws IOException {
+
+    List<Map<String, Object>> eligibleRecords = new ArrayList<>();
+    List<Message> ineligibleMessages = new ArrayList<>();
+
+    ReceiveMessageRequest receiveMessageRequest =
+        new ReceiveMessageRequest()
+            .withQueueUrl(this.queueUrl)
+            .withWaitTimeSeconds(this.longPollWait)
+            .withVisibilityTimeout(this.visibilityTimeout);
+    receiveMessageRequest.setMaxNumberOfMessages(this.maxMessagesEachRequest);
+
+    List<Message> messages =
+        getMessagesToProcess(
+            sqs,
+            this.queueUrl,
+            receiveMessageRequest,
+            this.maxMessageEachBatch,
+            this.maxMessagesEachRequest);
+
+    for (Message message : messages) {
+      boolean isMessageDelete = Boolean.TRUE;
+
+      JSONObject messageBody = new JSONObject(message.getBody());
+      Map<String, Object> messageMap;
+      ObjectMapper mapper = new ObjectMapper();
+
+      if (messageBody.has("Message")) {
+        // If this messages is from S3Event -> SNS -> SQS
+        messageMap =
+            (Map<String, Object>) mapper.readValue(messageBody.getString("Message"), Map.class);
+      } else {
+        // If this messages is from S3Event -> SQS
+        messageMap = (Map<String, Object>) mapper.readValue(messageBody.toString(), Map.class);
+      }
+      if (messageMap.containsKey("Records")) {
+        List<Map<String, Object>> records = (List<Map<String, Object>>) messageMap.get("Records");
+        for (Map<String, Object> record : records) {
+          String eventName = (String) record.get("eventName");
+
+          // filter only allowed s3 event types
+          if (ALLOWED_S3_EVENT_PREFIX.stream().anyMatch(eventName::startsWith)) {
+            eligibleRecords.add(record);
+            isMessageDelete = Boolean.FALSE;
+            processedMessages.add(message);
+
+          } else {
+            log.info("This S3 event " + eventName + " is not allowed, so ignoring it.");
+          }
+        }
+      } else {
+        log.info("Message is not expected format or it's s3:TestEvent");
+      }
+      if (isMessageDelete) {
+        ineligibleMessages.add(message);
+      }
+    }
+    if (!ineligibleMessages.isEmpty()) {
+      deleteBatchOfMessages(sqs, queueUrl, ineligibleMessages);
+    }
+
+    return eligibleRecords;
+  }
+
+  /**
+   * Get the list of events from queue.
+   *
+   * @param sparkContext JavaSparkContext to help parallelize certain operations
+   * @param lastCheckpointStr the last checkpoint time string, empty if first run
+   * @return the list of events
+   */
+  public Pair<List<String>, String> getNextEventsFromQueue(
+      AmazonSQS sqs,
+      JavaSparkContext sparkContext,
+      Option<String> lastCheckpointStr,
+      List<Message> processedMessages) {
+
+    processedMessages.clear();
+
+    log.info("Reading messages....");
+
+    try {
+      log.info("Start Checkpoint : " + lastCheckpointStr);
+
+      long lastCheckpointTime = lastCheckpointStr.map(Long::parseLong).orElse(Long.MIN_VALUE);
+
+      List<Map<String, Object>> eligibleEventRecords = getEligibleEvents(sqs, processedMessages);
+      log.info("eligible events size: " + eligibleEventRecords.size());
+
+      // sort all events by event time.
+      eligibleEventRecords.sort(
+          Comparator.comparingLong(

Review comment:
       Valid point! We don't need this here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on a change in pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
codope commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688591036



##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsHoodieIncrSource.java
##########
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.DataSourceReadOptions;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.IncrSourceHelper;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * Cloud Objects Hoodie Incr Source Class. {@link CloudObjectsHoodieIncrSource}.This source will use
+ * the cloud files meta information form cloud meta hoodie table generate by CloudObjectsMetaSource.
+ */
+public class CloudObjectsHoodieIncrSource extends HoodieIncrSource {
+
+  private static final Logger LOG = LogManager.getLogger(CloudObjectsHoodieIncrSource.class);
+
+  public CloudObjectsHoodieIncrSource(
+      TypedProperties props,
+      JavaSparkContext sparkContext,
+      SparkSession sparkSession,
+      SchemaProvider schemaProvider) {
+    super(props, sparkContext, sparkSession, schemaProvider);
+  }
+
+  @Override
+  public Pair<Option<Dataset<Row>>, String> fetchNextBatch(
+      Option<String> lastCkptStr, long sourceLimit) {
+
+    DataSourceUtils.checkRequiredProperties(
+        props, Collections.singletonList(Config.HOODIE_SRC_BASE_PATH));
+
+    String srcPath = props.getString(Config.HOODIE_SRC_BASE_PATH);
+    int numInstantsPerFetch =
+        props.getInteger(Config.NUM_INSTANTS_PER_FETCH, Config.DEFAULT_NUM_INSTANTS_PER_FETCH);
+    boolean readLatestOnMissingCkpt =
+        props.getBoolean(
+            Config.READ_LATEST_INSTANT_ON_MISSING_CKPT,
+            Config.DEFAULT_READ_LATEST_INSTANT_ON_MISSING_CKPT);
+
+    // Use begin Instant if set and non-empty
+    Option<String> beginInstant =
+        lastCkptStr.isPresent()
+            ? lastCkptStr.get().isEmpty() ? Option.empty() : lastCkptStr
+            : Option.empty();
+
+    Pair<String, String> instantEndpts =
+        IncrSourceHelper.calculateBeginAndEndInstants(
+            sparkContext, srcPath, numInstantsPerFetch, beginInstant, readLatestOnMissingCkpt);
+
+    if (instantEndpts.getKey().equals(instantEndpts.getValue())) {
+      LOG.warn("Already caught up. Begin Checkpoint was :" + instantEndpts.getKey());
+      return Pair.of(Option.empty(), instantEndpts.getKey());
+    }
+
+    // Do Incr pull. Set end instant if available
+    DataFrameReader reader =
+        sparkSession
+            .read()
+            .format("org.apache.hudi")
+            .option(
+                DataSourceReadOptions.QUERY_TYPE().key(),
+                DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL())
+            .option(
+                DataSourceReadOptions.BEGIN_INSTANTTIME().key(), instantEndpts.getLeft())
+            .option(
+                DataSourceReadOptions.END_INSTANTTIME().key(), instantEndpts.getRight());
+
+    Dataset<Row> source = reader.load(srcPath);
+
+    // Extract distinct file keys from cloud meta hoodie table
+    final List<Row> cloudMetaDf =
+        source
+            .filter("s3.object.size > 0")
+            .select("s3.bucket.name", "s3.object.key")
+            .distinct()
+            .collectAsList();
+
+    // Create S3 paths
+    List<String> cloudFiles = new ArrayList<>();
+    for (Row row : cloudMetaDf) {
+      String bucket = row.getString(0);
+      String key = row.getString(1);
+      String filePath = "s3://" + bucket + "/" + key;
+      cloudFiles.add(filePath);
+    }
+    String pathStr = String.join(",", cloudFiles);
+
+    return Pair.of(Option.of(fromFiles(pathStr)), instantEndpts.getRight());
+  }
+
+  /**
+   * Function to create Dataset from parquet files.
+   */
+  private Dataset<Row> fromFiles(String pathStr) {
+    return sparkSession.read().parquet(pathStr.split(","));

Review comment:
       Introduced a format config and added a condition which checks that file exists before fetching.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on a change in pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688220583



##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsHoodieIncrSource.java
##########
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.DataSourceReadOptions;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.IncrSourceHelper;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * Cloud Objects Hoodie Incr Source Class. {@link CloudObjectsHoodieIncrSource}.This source will use
+ * the cloud files meta information form cloud meta hoodie table generate by CloudObjectsMetaSource.
+ */
+public class CloudObjectsHoodieIncrSource extends HoodieIncrSource {

Review comment:
       again, does the schema work in general for any cloud store? if not, we can call this just S3EventsHoodieIncrSource or sth

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsDfsSource.java
##########
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.CloudObjectsDfsSelector;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * This source provides capability to create the hoodie table from cloudObject data (eg. s3 events).
+ * It will primarily use cloud queue to fetch new object information and update hoodie table with
+ * cloud object data.
+ */
+public class CloudObjectsDfsSource extends RowSource {

Review comment:
       lets call this `S3EventSource` or `S3ActivitySource`? something specific to S3? It does not work with cloud object stores in general or sth right

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsHoodieIncrSource.java
##########
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.DataSourceReadOptions;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.IncrSourceHelper;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * Cloud Objects Hoodie Incr Source Class. {@link CloudObjectsHoodieIncrSource}.This source will use
+ * the cloud files meta information form cloud meta hoodie table generate by CloudObjectsMetaSource.
+ */
+public class CloudObjectsHoodieIncrSource extends HoodieIncrSource {
+
+  private static final Logger LOG = LogManager.getLogger(CloudObjectsHoodieIncrSource.class);
+
+  public CloudObjectsHoodieIncrSource(
+      TypedProperties props,
+      JavaSparkContext sparkContext,
+      SparkSession sparkSession,
+      SchemaProvider schemaProvider) {
+    super(props, sparkContext, sparkSession, schemaProvider);
+  }
+
+  @Override
+  public Pair<Option<Dataset<Row>>, String> fetchNextBatch(
+      Option<String> lastCkptStr, long sourceLimit) {
+
+    DataSourceUtils.checkRequiredProperties(
+        props, Collections.singletonList(Config.HOODIE_SRC_BASE_PATH));
+
+    String srcPath = props.getString(Config.HOODIE_SRC_BASE_PATH);
+    int numInstantsPerFetch =
+        props.getInteger(Config.NUM_INSTANTS_PER_FETCH, Config.DEFAULT_NUM_INSTANTS_PER_FETCH);
+    boolean readLatestOnMissingCkpt =
+        props.getBoolean(
+            Config.READ_LATEST_INSTANT_ON_MISSING_CKPT,
+            Config.DEFAULT_READ_LATEST_INSTANT_ON_MISSING_CKPT);
+
+    // Use begin Instant if set and non-empty
+    Option<String> beginInstant =
+        lastCkptStr.isPresent()
+            ? lastCkptStr.get().isEmpty() ? Option.empty() : lastCkptStr
+            : Option.empty();
+
+    Pair<String, String> instantEndpts =
+        IncrSourceHelper.calculateBeginAndEndInstants(
+            sparkContext, srcPath, numInstantsPerFetch, beginInstant, readLatestOnMissingCkpt);
+
+    if (instantEndpts.getKey().equals(instantEndpts.getValue())) {
+      LOG.warn("Already caught up. Begin Checkpoint was :" + instantEndpts.getKey());
+      return Pair.of(Option.empty(), instantEndpts.getKey());
+    }
+
+    // Do Incr pull. Set end instant if available
+    DataFrameReader reader =
+        sparkSession
+            .read()
+            .format("org.apache.hudi")
+            .option(
+                DataSourceReadOptions.QUERY_TYPE().key(),
+                DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL())
+            .option(
+                DataSourceReadOptions.BEGIN_INSTANTTIME().key(), instantEndpts.getLeft())
+            .option(
+                DataSourceReadOptions.END_INSTANTTIME().key(), instantEndpts.getRight());
+
+    Dataset<Row> source = reader.load(srcPath);
+
+    // Extract distinct file keys from cloud meta hoodie table
+    final List<Row> cloudMetaDf =
+        source
+            .filter("s3.object.size > 0")
+            .select("s3.bucket.name", "s3.object.key")
+            .distinct()
+            .collectAsList();
+
+    // Create S3 paths
+    List<String> cloudFiles = new ArrayList<>();
+    for (Row row : cloudMetaDf) {
+      String bucket = row.getString(0);
+      String key = row.getString(1);
+      String filePath = "s3://" + bucket + "/" + key;

Review comment:
       pull prefix etc into a string constant?

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsHoodieIncrSource.java
##########
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.DataSourceReadOptions;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.IncrSourceHelper;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * Cloud Objects Hoodie Incr Source Class. {@link CloudObjectsHoodieIncrSource}.This source will use
+ * the cloud files meta information form cloud meta hoodie table generate by CloudObjectsMetaSource.
+ */
+public class CloudObjectsHoodieIncrSource extends HoodieIncrSource {
+
+  private static final Logger LOG = LogManager.getLogger(CloudObjectsHoodieIncrSource.class);
+
+  public CloudObjectsHoodieIncrSource(
+      TypedProperties props,
+      JavaSparkContext sparkContext,
+      SparkSession sparkSession,
+      SchemaProvider schemaProvider) {
+    super(props, sparkContext, sparkSession, schemaProvider);
+  }
+
+  @Override
+  public Pair<Option<Dataset<Row>>, String> fetchNextBatch(
+      Option<String> lastCkptStr, long sourceLimit) {
+
+    DataSourceUtils.checkRequiredProperties(
+        props, Collections.singletonList(Config.HOODIE_SRC_BASE_PATH));
+
+    String srcPath = props.getString(Config.HOODIE_SRC_BASE_PATH);
+    int numInstantsPerFetch =
+        props.getInteger(Config.NUM_INSTANTS_PER_FETCH, Config.DEFAULT_NUM_INSTANTS_PER_FETCH);
+    boolean readLatestOnMissingCkpt =
+        props.getBoolean(
+            Config.READ_LATEST_INSTANT_ON_MISSING_CKPT,
+            Config.DEFAULT_READ_LATEST_INSTANT_ON_MISSING_CKPT);
+
+    // Use begin Instant if set and non-empty
+    Option<String> beginInstant =
+        lastCkptStr.isPresent()
+            ? lastCkptStr.get().isEmpty() ? Option.empty() : lastCkptStr
+            : Option.empty();
+
+    Pair<String, String> instantEndpts =
+        IncrSourceHelper.calculateBeginAndEndInstants(
+            sparkContext, srcPath, numInstantsPerFetch, beginInstant, readLatestOnMissingCkpt);
+
+    if (instantEndpts.getKey().equals(instantEndpts.getValue())) {
+      LOG.warn("Already caught up. Begin Checkpoint was :" + instantEndpts.getKey());
+      return Pair.of(Option.empty(), instantEndpts.getKey());
+    }
+
+    // Do Incr pull. Set end instant if available
+    DataFrameReader reader =

Review comment:
       I think `metaReader` is okay  :)  others are implicit from context

##########
File path: hudi-utilities/pom.xml
##########
@@ -402,6 +402,14 @@
       <scope>test</scope>
     </dependency>
 
+    <!-- AWS Services -->
+    <!-- https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-sqs -->
+    <dependency>
+      <groupId>com.amazonaws</groupId>
+      <artifactId>aws-java-sdk-sqs</artifactId>
+      <version>1.12.22</version>

Review comment:
       is this pretty much independent of hadoop versions and such?

##########
File path: hudi-utilities/pom.xml
##########
@@ -402,6 +402,14 @@
       <scope>test</scope>
     </dependency>
 
+    <!-- AWS Services -->
+    <!-- https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-sqs -->
+    <dependency>
+      <groupId>com.amazonaws</groupId>
+      <artifactId>aws-java-sdk-sqs</artifactId>
+      <version>1.12.22</version>

Review comment:
       Can we pull the version into a property. so it can be overridden if needed

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsMetaSource.java
##########
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.CloudObjectsMetaSelector;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Encoders;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * This source provides capability to create the hoodie table for cloudObject Metadata (eg. s3

Review comment:
       lets please avoid "cloud". its a very broad term. cloud != object storage. ceph is an obj store. minio is an obj store. nothing to do with clould. again, if this will work only with s3. lets just call it that?

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsDfsSource.java
##########
@@ -0,0 +1,87 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.CloudObjectsDfsSelector;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * This source provides capability to create the hoodie table from cloudObject data (eg. s3 events).
+ * It will primarily use cloud queue to fetch new object information and update hoodie table with
+ * cloud object data.
+ */
+public class CloudObjectsDfsSource extends RowSource {
+
+  private final CloudObjectsDfsSelector pathSelector;

Review comment:
       same with this class




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot edited a comment on pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
hudi-bot edited a comment on pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#issuecomment-895012652


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1572",
       "triggerID" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1590",
       "triggerID" : "895681502",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "33f7d78265f1a9635d6254e0dbfb40f161a3d4a7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1607",
       "triggerID" : "33f7d78265f1a9635d6254e0dbfb40f161a3d4a7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bd9b7dd1e3f0a17fca7fab59650f6f0b03873dc1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1712",
       "triggerID" : "bd9b7dd1e3f0a17fca7fab59650f6f0b03873dc1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ee8fbced2d229bd487794a19123c47417acbf306",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1719",
       "triggerID" : "ee8fbced2d229bd487794a19123c47417acbf306",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ee8fbced2d229bd487794a19123c47417acbf306 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1719) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run travis` re-run the last Travis build
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot edited a comment on pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
hudi-bot edited a comment on pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#issuecomment-895012652


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1572",
       "triggerID" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1590",
       "triggerID" : "895681502",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "33f7d78265f1a9635d6254e0dbfb40f161a3d4a7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1607",
       "triggerID" : "33f7d78265f1a9635d6254e0dbfb40f161a3d4a7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bd9b7dd1e3f0a17fca7fab59650f6f0b03873dc1",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1712",
       "triggerID" : "bd9b7dd1e3f0a17fca7fab59650f6f0b03873dc1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ee8fbced2d229bd487794a19123c47417acbf306",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1719",
       "triggerID" : "ee8fbced2d229bd487794a19123c47417acbf306",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bd9b7dd1e3f0a17fca7fab59650f6f0b03873dc1 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1712) 
   * ee8fbced2d229bd487794a19123c47417acbf306 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1719) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run travis` re-run the last Travis build
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on a change in pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
codope commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688589088



##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+    super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties props) {
+    String sourceSelectorClass =
+        props.getString(
+            CloudObjectsMetaSelector.Config.SOURCE_INPUT_SELECTOR,
+            CloudObjectsMetaSelector.class.getName());
+    try {
+      CloudObjectsMetaSelector selector =
+          (CloudObjectsMetaSelector)
+              ReflectionUtils.loadClass(
+                  sourceSelectorClass, new Class<?>[] {TypedProperties.class}, props);
+
+      log.info("Using path selector " + selector.getClass().getName());
+      return selector;
+    } catch (Exception e) {
+      throw new HoodieException("Could not load source selector class " + sourceSelectorClass, e);
+    }
+  }
+
+  /**
+   * List messages from queue, filter out illegible events while doing so. It will also delete the
+   * ineligible messages from queue.
+   *
+   * @param processedMessages array of processed messages to add more messages
+   * @return the list of eligible records
+   */
+  protected List<Map<String, Object>> getEligibleEvents(
+      AmazonSQS sqs, List<Message> processedMessages) throws IOException {
+
+    List<Map<String, Object>> eligibleRecords = new ArrayList<>();
+    List<Message> ineligibleMessages = new ArrayList<>();
+
+    ReceiveMessageRequest receiveMessageRequest =
+        new ReceiveMessageRequest()
+            .withQueueUrl(this.queueUrl)
+            .withWaitTimeSeconds(this.longPollWait)
+            .withVisibilityTimeout(this.visibilityTimeout);
+    receiveMessageRequest.setMaxNumberOfMessages(this.maxMessagesEachRequest);
+
+    List<Message> messages =
+        getMessagesToProcess(
+            sqs,
+            this.queueUrl,
+            receiveMessageRequest,
+            this.maxMessageEachBatch,
+            this.maxMessagesEachRequest);
+
+    for (Message message : messages) {
+      boolean isMessageDelete = Boolean.TRUE;
+
+      JSONObject messageBody = new JSONObject(message.getBody());
+      Map<String, Object> messageMap;
+      ObjectMapper mapper = new ObjectMapper();
+
+      if (messageBody.has("Message")) {
+        // If this messages is from S3Event -> SNS -> SQS
+        messageMap =
+            (Map<String, Object>) mapper.readValue(messageBody.getString("Message"), Map.class);
+      } else {
+        // If this messages is from S3Event -> SQS
+        messageMap = (Map<String, Object>) mapper.readValue(messageBody.toString(), Map.class);
+      }
+      if (messageMap.containsKey("Records")) {
+        List<Map<String, Object>> records = (List<Map<String, Object>>) messageMap.get("Records");
+        for (Map<String, Object> record : records) {
+          String eventName = (String) record.get("eventName");
+
+          // filter only allowed s3 event types
+          if (ALLOWED_S3_EVENT_PREFIX.stream().anyMatch(eventName::startsWith)) {
+            eligibleRecords.add(record);
+            isMessageDelete = Boolean.FALSE;
+            processedMessages.add(message);
+
+          } else {
+            log.info("This S3 event " + eventName + " is not allowed, so ignoring it.");
+          }
+        }
+      } else {
+        log.info("Message is not expected format or it's s3:TestEvent");
+      }
+      if (isMessageDelete) {
+        ineligibleMessages.add(message);
+      }
+    }
+    if (!ineligibleMessages.isEmpty()) {
+      deleteBatchOfMessages(sqs, queueUrl, ineligibleMessages);
+    }
+
+    return eligibleRecords;
+  }
+
+  /**
+   * Get the list of events from queue.
+   *
+   * @param sparkContext JavaSparkContext to help parallelize certain operations
+   * @param lastCheckpointStr the last checkpoint time string, empty if first run
+   * @return the list of events
+   */
+  public Pair<List<String>, String> getNextEventsFromQueue(
+      AmazonSQS sqs,
+      JavaSparkContext sparkContext,
+      Option<String> lastCheckpointStr,
+      List<Message> processedMessages) {
+
+    processedMessages.clear();
+
+    log.info("Reading messages....");
+
+    try {
+      log.info("Start Checkpoint : " + lastCheckpointStr);
+
+      long lastCheckpointTime = lastCheckpointStr.map(Long::parseLong).orElse(Long.MIN_VALUE);
+
+      List<Map<String, Object>> eligibleEventRecords = getEligibleEvents(sqs, processedMessages);
+      log.info("eligible events size: " + eligibleEventRecords.size());
+
+      // sort all events by event time.
+      eligibleEventRecords.sort(
+          Comparator.comparingLong(
+              record ->
+                  Date.from(
+                          Instant.from(
+                              DateTimeFormatter.ISO_INSTANT.parse(
+                                  (String) record.get("eventTime"))))
+                      .getTime()));
+
+      List<String> filteredEventRecords = new ArrayList<>();
+      long newCheckpointTime = lastCheckpointTime;
+
+      for (Map<String, Object> eventRecord : eligibleEventRecords) {
+        newCheckpointTime =
+            Date.from(
+                    Instant.from(
+                        DateTimeFormatter.ISO_INSTANT.parse((String) eventRecord.get("eventTime"))))
+                .getTime();
+
+        // Currently HUDI don't supports column names like request-amz-id-2
+        eventRecord.remove("responseElements");

Review comment:
       Moved to `getValidEvents`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot edited a comment on pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
hudi-bot edited a comment on pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#issuecomment-895012652


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1572",
       "triggerID" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1590",
       "triggerID" : "895681502",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "33f7d78265f1a9635d6254e0dbfb40f161a3d4a7",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1607",
       "triggerID" : "33f7d78265f1a9635d6254e0dbfb40f161a3d4a7",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 076123328724c1ef5051208c57706ae09ba6c11e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1572) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1590) 
   * 33f7d78265f1a9635d6254e0dbfb40f161a3d4a7 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1607) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run travis` re-run the last Travis build
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot edited a comment on pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
hudi-bot edited a comment on pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#issuecomment-895012652


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1572",
       "triggerID" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1590",
       "triggerID" : "895681502",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "33f7d78265f1a9635d6254e0dbfb40f161a3d4a7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "33f7d78265f1a9635d6254e0dbfb40f161a3d4a7",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 076123328724c1ef5051208c57706ae09ba6c11e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1572) Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1590) 
   * 33f7d78265f1a9635d6254e0dbfb40f161a3d4a7 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run travis` re-run the last Travis build
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot edited a comment on pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
hudi-bot edited a comment on pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#issuecomment-895012652


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1572",
       "triggerID" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1590",
       "triggerID" : "895681502",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "33f7d78265f1a9635d6254e0dbfb40f161a3d4a7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1607",
       "triggerID" : "33f7d78265f1a9635d6254e0dbfb40f161a3d4a7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bd9b7dd1e3f0a17fca7fab59650f6f0b03873dc1",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1712",
       "triggerID" : "bd9b7dd1e3f0a17fca7fab59650f6f0b03873dc1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ee8fbced2d229bd487794a19123c47417acbf306",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "ee8fbced2d229bd487794a19123c47417acbf306",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bd9b7dd1e3f0a17fca7fab59650f6f0b03873dc1 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1712) 
   * ee8fbced2d229bd487794a19123c47417acbf306 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run travis` re-run the last Travis build
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot edited a comment on pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
hudi-bot edited a comment on pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#issuecomment-895012652


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1572",
       "triggerID" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1590",
       "triggerID" : "895681502",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "33f7d78265f1a9635d6254e0dbfb40f161a3d4a7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1607",
       "triggerID" : "33f7d78265f1a9635d6254e0dbfb40f161a3d4a7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bd9b7dd1e3f0a17fca7fab59650f6f0b03873dc1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1712",
       "triggerID" : "bd9b7dd1e3f0a17fca7fab59650f6f0b03873dc1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ee8fbced2d229bd487794a19123c47417acbf306",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1719",
       "triggerID" : "ee8fbced2d229bd487794a19123c47417acbf306",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5ffdc82b5e08f83772f28c2bf844688bc3e9fc50",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1740",
       "triggerID" : "5ffdc82b5e08f83772f28c2bf844688bc3e9fc50",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ee8fbced2d229bd487794a19123c47417acbf306 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1719) 
   * 5ffdc82b5e08f83772f28c2bf844688bc3e9fc50 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1740) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run travis` re-run the last Travis build
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
codope commented on pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#issuecomment-895681502


   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#issuecomment-898823714


   https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=1719&view=results passed. So this can land per se


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan merged pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
nsivabalan merged pull request #3433:
URL: https://github.com/apache/hudi/pull/3433


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot edited a comment on pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
hudi-bot edited a comment on pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#issuecomment-895012652


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1572",
       "triggerID" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1590",
       "triggerID" : "895681502",
       "triggerType" : "MANUAL"
     } ]
   }-->
   ## CI report:
   
   * 076123328724c1ef5051208c57706ae09ba6c11e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1572) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1590) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run travis` re-run the last Travis build
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on a change in pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
codope commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688206771



##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelector.java
##########
@@ -0,0 +1,285 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+
+import com.amazonaws.regions.Regions;
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.AmazonSQSClientBuilder;
+import com.amazonaws.services.sqs.model.BatchResultErrorEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequest;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequestEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchResult;
+import com.amazonaws.services.sqs.model.GetQueueAttributesRequest;
+import com.amazonaws.services.sqs.model.GetQueueAttributesResult;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.json.JSONObject;
+
+import java.io.UnsupportedEncodingException;
+import java.net.URLDecoder;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ * Cloud Objects Selector Class. This class has methods for processing cloud objects. It currently
+ * supports only AWS S3 objects and AWS SQS queue.
+ */
+public class CloudObjectsSelector {
+  public static final List<String> ALLOWED_S3_EVENT_PREFIX =
+      Collections.singletonList("ObjectCreated");
+  public static volatile Logger log = LogManager.getLogger(CloudObjectsSelector.class);
+  public final String queueUrl;
+  public final int longPollWait;
+  public final int maxMessagesEachRequest;
+  public final int maxMessageEachBatch;
+  public final int visibilityTimeout;
+  public final TypedProperties props;
+  public final String fsName;
+  private final String regionName;
+
+  /**
+   * Cloud Objects Selector Class. {@link CloudObjectsSelector}
+   */
+  public CloudObjectsSelector(TypedProperties props) {
+    DataSourceUtils.checkRequiredProperties(props, Arrays.asList(Config.QUEUE_URL_PROP, Config.QUEUE_REGION));
+    this.props = props;
+    this.queueUrl = props.getString(Config.QUEUE_URL_PROP);
+    this.regionName = props.getString(Config.QUEUE_REGION);
+    this.fsName = props.getString(Config.SOURCE_QUEUE_FS_PROP, "s3").toLowerCase();
+    this.longPollWait = props.getInteger(Config.QUEUE_LONGPOLLWAIT_PROP, 20);
+    this.maxMessageEachBatch = props.getInteger(Config.QUEUE_MAXMESSAGESEACHBATCH_PROP, 5);
+    this.visibilityTimeout = props.getInteger(Config.QUEUE_VISIBILITYTIMEOUT_PROP, 30);
+    this.maxMessagesEachRequest = 10;
+  }
+
+  /**
+   * Get SQS queue attributes.
+   *
+   * @param sqsClient AWSClient for sqsClient
+   * @param queueUrl  queue full url
+   * @return map of attributes needed
+   */
+  protected Map<String, String> getSqsQueueAttributes(AmazonSQS sqsClient, String queueUrl) {
+    GetQueueAttributesResult queueAttributesResult =
+        sqsClient.getQueueAttributes(
+            new GetQueueAttributesRequest(queueUrl)
+                .withAttributeNames("ApproximateNumberOfMessages"));
+    return queueAttributesResult.getAttributes();
+  }
+
+  /**
+   * Get the file attributes filePath, eventTime and size from JSONObject record.
+   *
+   * @param record of object event
+   * @return map of file attribute
+   */
+  protected Map<String, Object> getFileAttributesFromRecord(JSONObject record)
+      throws UnsupportedEncodingException {
+
+    Map<String, Object> fileRecord = new HashMap<>();
+    String eventTimeStr = record.getString("eventTime");
+    long eventTime =
+        Date.from(Instant.from(DateTimeFormatter.ISO_INSTANT.parse(eventTimeStr))).getTime();
+
+    JSONObject s3Object = record.getJSONObject("s3").getJSONObject("object");
+    String bucket =
+        URLDecoder.decode(
+            record.getJSONObject("s3").getJSONObject("bucket").getString("name"), "UTF-8");
+    String key = URLDecoder.decode(s3Object.getString("key"), "UTF-8");
+    String filePath = this.fsName + "://" + bucket + "/" + key;
+
+    fileRecord.put("eventTime", eventTime);
+    fileRecord.put("fileSize", s3Object.getLong("size"));
+    fileRecord.put("filePath", filePath);
+    return fileRecord;
+  }
+
+  /**
+   * Amazon SQS Client Builder.
+   */
+  public AmazonSQS createAmazonSqsClient() {
+    return AmazonSQSClientBuilder.standard().withRegion(Regions.fromName(regionName)).build();
+  }
+
+  /**
+   * List messages from queue.
+   */
+  protected List<Message> getMessagesToProcess(
+      AmazonSQS sqsClient,
+      String queueUrl,
+      ReceiveMessageRequest receiveMessageRequest,
+      int maxMessageEachBatch,
+      int maxMessagesEachRequest) {
+    List<Message> messagesToProcess = new ArrayList<>();
+
+    // Get count for available messages
+    Map<String, String> queueAttributesResult = getSqsQueueAttributes(sqsClient, queueUrl);
+    long approxMessagesAvailable =
+        Long.parseLong(queueAttributesResult.get("ApproximateNumberOfMessages"));
+    log.info("Approx. " + approxMessagesAvailable + " messages available in queue.");
+
+    for (int i = 0;
+         i < (int) Math.ceil((double) approxMessagesAvailable / maxMessagesEachRequest);
+         ++i) {
+      List<Message> messages = sqsClient.receiveMessage(receiveMessageRequest).getMessages();
+      log.debug("Messages size: " + messages.size());
+
+      for (Message message : messages) {
+        log.debug("message id: " + message.getMessageId());
+        messagesToProcess.add(message);
+      }
+      log.debug("total fetched messages size: " + messagesToProcess.size());
+      if (messages.isEmpty() || (messagesToProcess.size() >= maxMessageEachBatch)) {
+        break;
+      }
+    }
+    return messagesToProcess;
+  }
+
+  /**
+   * create partitions of list using specific batch size. we can't use third party API for this
+   * functionality, due to https://github.com/apache/hudi/blob/master/style/checkstyle.xml#L270
+   */
+  protected List<List<Message>> createListPartitions(List<Message> singleList, int eachBatchSize) {
+    List<List<Message>> listPartitions = new ArrayList<>();
+
+    if (singleList.size() == 0 || eachBatchSize < 1) {
+      return listPartitions;
+    }
+
+    for (int start = 0; start < singleList.size(); start += eachBatchSize) {
+      int end = Math.min(start + eachBatchSize, singleList.size());
+
+      if (start > end) {
+        throw new IndexOutOfBoundsException(
+            "Index " + start + " is out of the list range <0," + (singleList.size() - 1) + ">");
+      }
+      listPartitions.add(new ArrayList<>(singleList.subList(start, end)));
+    }
+    return listPartitions;
+  }
+
+  /**
+   * delete batch of messages from queue.
+   */
+  protected void deleteBatchOfMessages(
+      AmazonSQS sqs, String queueUrl, List<Message> messagesToBeDeleted) {
+    DeleteMessageBatchRequest deleteBatchReq =
+        new DeleteMessageBatchRequest().withQueueUrl(queueUrl);
+    List<DeleteMessageBatchRequestEntry> deleteEntries = deleteBatchReq.getEntries();
+
+    for (Message message : messagesToBeDeleted) {
+      deleteEntries.add(
+          new DeleteMessageBatchRequestEntry()
+              .withId(message.getMessageId())
+              .withReceiptHandle(message.getReceiptHandle()));
+    }
+    DeleteMessageBatchResult deleteResult = sqs.deleteMessageBatch(deleteBatchReq);
+    List<String> deleteFailures =
+        deleteResult.getFailed().stream()
+            .map(BatchResultErrorEntry::getId)
+            .collect(Collectors.toList());
+    System.out.println("Delete is" + deleteFailures.isEmpty() + "or ignoring it.");
+    if (!deleteFailures.isEmpty()) {
+      log.warn(
+          "Failed to delete "
+              + deleteFailures.size()
+              + " messages out of "
+              + deleteEntries.size()
+              + " from queue.");
+    } else {
+      log.info("Successfully deleted " + deleteEntries.size() + " messages from queue.");
+    }
+  }
+
+  /**
+   * Delete Queue Messages after hudi commit. This method will be invoked by source.onCommit.
+   */
+  public void onCommitDeleteProcessedMessages(
+      AmazonSQS sqs, String queueUrl, List<Message> processedMessages) {
+
+    if (!processedMessages.isEmpty()) {
+
+      // create batch for deletion, SES DeleteMessageBatchRequest only accept max 10 entries
+      List<List<Message>> deleteBatches = createListPartitions(processedMessages, 10);
+      for (List<Message> deleteBatch : deleteBatches) {
+        deleteBatchOfMessages(sqs, queueUrl, deleteBatch);
+      }
+    }
+  }
+
+  /**
+   * Configs supported.
+   */
+  public static class Config {
+    /**
+     * {@value #QUEUE_URL_PROP} is the queue url for cloud object events.
+     */
+    public static final String QUEUE_URL_PROP = "hoodie.deltastreamer.source.queue.url";

Review comment:
       Sounds good. I'll change it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on a change in pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688864122



##########
File path: hudi-utilities/pom.xml
##########
@@ -402,6 +402,14 @@
       <scope>test</scope>
     </dependency>
 
+    <!-- AWS Services -->
+    <!-- https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-sqs -->
+    <dependency>
+      <groupId>com.amazonaws</groupId>
+      <artifactId>aws-java-sdk-sqs</artifactId>
+      <version>${aws.sdk.version}</version>

Review comment:
       we are not bundling this. So we should ensure to document --jars to add this for to work during runtime cc @codope 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on a change in pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
codope commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688484043



##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelector.java
##########
@@ -0,0 +1,285 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+
+import com.amazonaws.regions.Regions;
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.AmazonSQSClientBuilder;
+import com.amazonaws.services.sqs.model.BatchResultErrorEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequest;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequestEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchResult;
+import com.amazonaws.services.sqs.model.GetQueueAttributesRequest;
+import com.amazonaws.services.sqs.model.GetQueueAttributesResult;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.json.JSONObject;
+
+import java.io.UnsupportedEncodingException;
+import java.net.URLDecoder;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ * Cloud Objects Selector Class. This class has methods for processing cloud objects. It currently
+ * supports only AWS S3 objects and AWS SQS queue.
+ */
+public class CloudObjectsSelector {
+  public static final List<String> ALLOWED_S3_EVENT_PREFIX =
+      Collections.singletonList("ObjectCreated");
+  public static volatile Logger log = LogManager.getLogger(CloudObjectsSelector.class);
+  public final String queueUrl;
+  public final int longPollWait;
+  public final int maxMessagesEachRequest;
+  public final int maxMessageEachBatch;
+  public final int visibilityTimeout;
+  public final TypedProperties props;
+  public final String fsName;
+  private final String regionName;
+
+  /**
+   * Cloud Objects Selector Class. {@link CloudObjectsSelector}
+   */
+  public CloudObjectsSelector(TypedProperties props) {
+    DataSourceUtils.checkRequiredProperties(props, Arrays.asList(Config.QUEUE_URL_PROP, Config.QUEUE_REGION));
+    this.props = props;
+    this.queueUrl = props.getString(Config.QUEUE_URL_PROP);
+    this.regionName = props.getString(Config.QUEUE_REGION);
+    this.fsName = props.getString(Config.SOURCE_QUEUE_FS_PROP, "s3").toLowerCase();
+    this.longPollWait = props.getInteger(Config.QUEUE_LONGPOLLWAIT_PROP, 20);
+    this.maxMessageEachBatch = props.getInteger(Config.QUEUE_MAXMESSAGESEACHBATCH_PROP, 5);
+    this.visibilityTimeout = props.getInteger(Config.QUEUE_VISIBILITYTIMEOUT_PROP, 30);
+    this.maxMessagesEachRequest = 10;
+  }
+
+  /**
+   * Get SQS queue attributes.
+   *
+   * @param sqsClient AWSClient for sqsClient
+   * @param queueUrl  queue full url
+   * @return map of attributes needed
+   */
+  protected Map<String, String> getSqsQueueAttributes(AmazonSQS sqsClient, String queueUrl) {
+    GetQueueAttributesResult queueAttributesResult =
+        sqsClient.getQueueAttributes(
+            new GetQueueAttributesRequest(queueUrl)
+                .withAttributeNames("ApproximateNumberOfMessages"));
+    return queueAttributesResult.getAttributes();
+  }
+
+  /**
+   * Get the file attributes filePath, eventTime and size from JSONObject record.
+   *
+   * @param record of object event
+   * @return map of file attribute
+   */
+  protected Map<String, Object> getFileAttributesFromRecord(JSONObject record)
+      throws UnsupportedEncodingException {
+
+    Map<String, Object> fileRecord = new HashMap<>();
+    String eventTimeStr = record.getString("eventTime");
+    long eventTime =
+        Date.from(Instant.from(DateTimeFormatter.ISO_INSTANT.parse(eventTimeStr))).getTime();
+
+    JSONObject s3Object = record.getJSONObject("s3").getJSONObject("object");
+    String bucket =
+        URLDecoder.decode(
+            record.getJSONObject("s3").getJSONObject("bucket").getString("name"), "UTF-8");
+    String key = URLDecoder.decode(s3Object.getString("key"), "UTF-8");
+    String filePath = this.fsName + "://" + bucket + "/" + key;
+
+    fileRecord.put("eventTime", eventTime);
+    fileRecord.put("fileSize", s3Object.getLong("size"));
+    fileRecord.put("filePath", filePath);
+    return fileRecord;
+  }
+
+  /**
+   * Amazon SQS Client Builder.
+   */
+  public AmazonSQS createAmazonSqsClient() {
+    return AmazonSQSClientBuilder.standard().withRegion(Regions.fromName(regionName)).build();
+  }
+
+  /**
+   * List messages from queue.
+   */
+  protected List<Message> getMessagesToProcess(
+      AmazonSQS sqsClient,
+      String queueUrl,
+      ReceiveMessageRequest receiveMessageRequest,
+      int maxMessageEachBatch,
+      int maxMessagesEachRequest) {
+    List<Message> messagesToProcess = new ArrayList<>();
+
+    // Get count for available messages
+    Map<String, String> queueAttributesResult = getSqsQueueAttributes(sqsClient, queueUrl);
+    long approxMessagesAvailable =
+        Long.parseLong(queueAttributesResult.get("ApproximateNumberOfMessages"));
+    log.info("Approx. " + approxMessagesAvailable + " messages available in queue.");
+
+    for (int i = 0;
+         i < (int) Math.ceil((double) approxMessagesAvailable / maxMessagesEachRequest);
+         ++i) {
+      List<Message> messages = sqsClient.receiveMessage(receiveMessageRequest).getMessages();
+      log.debug("Messages size: " + messages.size());
+
+      for (Message message : messages) {
+        log.debug("message id: " + message.getMessageId());
+        messagesToProcess.add(message);
+      }
+      log.debug("total fetched messages size: " + messagesToProcess.size());
+      if (messages.isEmpty() || (messagesToProcess.size() >= maxMessageEachBatch)) {
+        break;
+      }
+    }
+    return messagesToProcess;
+  }
+
+  /**
+   * create partitions of list using specific batch size. we can't use third party API for this
+   * functionality, due to https://github.com/apache/hudi/blob/master/style/checkstyle.xml#L270
+   */
+  protected List<List<Message>> createListPartitions(List<Message> singleList, int eachBatchSize) {
+    List<List<Message>> listPartitions = new ArrayList<>();
+
+    if (singleList.size() == 0 || eachBatchSize < 1) {
+      return listPartitions;
+    }
+
+    for (int start = 0; start < singleList.size(); start += eachBatchSize) {
+      int end = Math.min(start + eachBatchSize, singleList.size());
+
+      if (start > end) {
+        throw new IndexOutOfBoundsException(
+            "Index " + start + " is out of the list range <0," + (singleList.size() - 1) + ">");
+      }
+      listPartitions.add(new ArrayList<>(singleList.subList(start, end)));
+    }
+    return listPartitions;
+  }
+
+  /**
+   * delete batch of messages from queue.
+   */
+  protected void deleteBatchOfMessages(
+      AmazonSQS sqs, String queueUrl, List<Message> messagesToBeDeleted) {
+    DeleteMessageBatchRequest deleteBatchReq =
+        new DeleteMessageBatchRequest().withQueueUrl(queueUrl);
+    List<DeleteMessageBatchRequestEntry> deleteEntries = deleteBatchReq.getEntries();
+
+    for (Message message : messagesToBeDeleted) {
+      deleteEntries.add(
+          new DeleteMessageBatchRequestEntry()
+              .withId(message.getMessageId())
+              .withReceiptHandle(message.getReceiptHandle()));
+    }
+    DeleteMessageBatchResult deleteResult = sqs.deleteMessageBatch(deleteBatchReq);
+    List<String> deleteFailures =
+        deleteResult.getFailed().stream()
+            .map(BatchResultErrorEntry::getId)
+            .collect(Collectors.toList());
+    System.out.println("Delete is" + deleteFailures.isEmpty() + "or ignoring it.");
+    if (!deleteFailures.isEmpty()) {
+      log.warn(

Review comment:
       Yes. These delete failures are the ones that are due to causes not in our control, e.g. transient failure on AWS end, so the assumption is that they will get eventually deleted. However, I think we should track whether we are processing new mesages. Will need to figure out some way either using hoodie stats or otherwise. I or Satish can take it up as a followup.  




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on a change in pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
codope commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688893849



##########
File path: hudi-utilities/pom.xml
##########
@@ -402,6 +402,14 @@
       <scope>test</scope>
     </dependency>
 
+    <!-- AWS Services -->
+    <!-- https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-sqs -->
+    <dependency>
+      <groupId>com.amazonaws</groupId>
+      <artifactId>aws-java-sdk-sqs</artifactId>
+      <version>${aws.sdk.version}</version>

Review comment:
       Ack.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot edited a comment on pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
hudi-bot edited a comment on pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#issuecomment-895012652


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1572",
       "triggerID" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 076123328724c1ef5051208c57706ae09ba6c11e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1572) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run travis` re-run the last Travis build
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on a change in pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
codope commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688593626



##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsHoodieIncrSource.java
##########
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.DataSourceReadOptions;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.IncrSourceHelper;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * Cloud Objects Hoodie Incr Source Class. {@link CloudObjectsHoodieIncrSource}.This source will use
+ * the cloud files meta information form cloud meta hoodie table generate by CloudObjectsMetaSource.
+ */
+public class CloudObjectsHoodieIncrSource extends HoodieIncrSource {
+
+  private static final Logger LOG = LogManager.getLogger(CloudObjectsHoodieIncrSource.class);
+
+  public CloudObjectsHoodieIncrSource(
+      TypedProperties props,
+      JavaSparkContext sparkContext,
+      SparkSession sparkSession,
+      SchemaProvider schemaProvider) {
+    super(props, sparkContext, sparkSession, schemaProvider);
+  }
+
+  @Override
+  public Pair<Option<Dataset<Row>>, String> fetchNextBatch(
+      Option<String> lastCkptStr, long sourceLimit) {
+
+    DataSourceUtils.checkRequiredProperties(
+        props, Collections.singletonList(Config.HOODIE_SRC_BASE_PATH));
+
+    String srcPath = props.getString(Config.HOODIE_SRC_BASE_PATH);
+    int numInstantsPerFetch =
+        props.getInteger(Config.NUM_INSTANTS_PER_FETCH, Config.DEFAULT_NUM_INSTANTS_PER_FETCH);
+    boolean readLatestOnMissingCkpt =
+        props.getBoolean(
+            Config.READ_LATEST_INSTANT_ON_MISSING_CKPT,
+            Config.DEFAULT_READ_LATEST_INSTANT_ON_MISSING_CKPT);
+
+    // Use begin Instant if set and non-empty
+    Option<String> beginInstant =
+        lastCkptStr.isPresent()
+            ? lastCkptStr.get().isEmpty() ? Option.empty() : lastCkptStr
+            : Option.empty();
+
+    Pair<String, String> instantEndpts =
+        IncrSourceHelper.calculateBeginAndEndInstants(
+            sparkContext, srcPath, numInstantsPerFetch, beginInstant, readLatestOnMissingCkpt);
+
+    if (instantEndpts.getKey().equals(instantEndpts.getValue())) {
+      LOG.warn("Already caught up. Begin Checkpoint was :" + instantEndpts.getKey());
+      return Pair.of(Option.empty(), instantEndpts.getKey());
+    }
+
+    // Do Incr pull. Set end instant if available
+    DataFrameReader reader =
+        sparkSession
+            .read()
+            .format("org.apache.hudi")
+            .option(
+                DataSourceReadOptions.QUERY_TYPE().key(),
+                DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL())
+            .option(
+                DataSourceReadOptions.BEGIN_INSTANTTIME().key(), instantEndpts.getLeft())
+            .option(
+                DataSourceReadOptions.END_INSTANTTIME().key(), instantEndpts.getRight());
+
+    Dataset<Row> source = reader.load(srcPath);
+
+    // Extract distinct file keys from cloud meta hoodie table
+    final List<Row> cloudMetaDf =
+        source
+            .filter("s3.object.size > 0")
+            .select("s3.bucket.name", "s3.object.key")

Review comment:
       We are not handling delete right now. It will need some work. I was thinking to capture the delete events and add a column in event meta table like is_deleted or something. I or Satish can take it up as a followup task.    




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on a change in pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
codope commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688202916



##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+    super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties props) {

Review comment:
       It is being used in `CloudObjectsDfsSource`. Now that we don't need that source, its usage is limited to `CloudObjectsMetaSource` only. However, I think it's better to keep it as a static factory method, a) semantics in line with DFSPathSelector, b) could be useful in future as we add more cloud provider sources.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on a change in pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r687849798



##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+    super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties props) {
+    String sourceSelectorClass =
+        props.getString(
+            CloudObjectsMetaSelector.Config.SOURCE_INPUT_SELECTOR,
+            CloudObjectsMetaSelector.class.getName());
+    try {
+      CloudObjectsMetaSelector selector =
+          (CloudObjectsMetaSelector)
+              ReflectionUtils.loadClass(
+                  sourceSelectorClass, new Class<?>[] {TypedProperties.class}, props);
+
+      log.info("Using path selector " + selector.getClass().getName());
+      return selector;
+    } catch (Exception e) {
+      throw new HoodieException("Could not load source selector class " + sourceSelectorClass, e);
+    }
+  }
+
+  /**
+   * List messages from queue, filter out illegible events while doing so. It will also delete the
+   * ineligible messages from queue.
+   *
+   * @param processedMessages array of processed messages to add more messages
+   * @return the list of eligible records
+   */
+  protected List<Map<String, Object>> getEligibleEvents(
+      AmazonSQS sqs, List<Message> processedMessages) throws IOException {
+
+    List<Map<String, Object>> eligibleRecords = new ArrayList<>();
+    List<Message> ineligibleMessages = new ArrayList<>();
+
+    ReceiveMessageRequest receiveMessageRequest =
+        new ReceiveMessageRequest()
+            .withQueueUrl(this.queueUrl)
+            .withWaitTimeSeconds(this.longPollWait)
+            .withVisibilityTimeout(this.visibilityTimeout);
+    receiveMessageRequest.setMaxNumberOfMessages(this.maxMessagesEachRequest);
+
+    List<Message> messages =
+        getMessagesToProcess(
+            sqs,
+            this.queueUrl,
+            receiveMessageRequest,
+            this.maxMessageEachBatch,
+            this.maxMessagesEachRequest);
+
+    for (Message message : messages) {
+      boolean isMessageDelete = Boolean.TRUE;
+
+      JSONObject messageBody = new JSONObject(message.getBody());
+      Map<String, Object> messageMap;
+      ObjectMapper mapper = new ObjectMapper();
+
+      if (messageBody.has("Message")) {

Review comment:
       can we create constants for these "Message", "Records"(L119) and use the variables. 

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+    super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties props) {
+    String sourceSelectorClass =
+        props.getString(
+            CloudObjectsMetaSelector.Config.SOURCE_INPUT_SELECTOR,
+            CloudObjectsMetaSelector.class.getName());
+    try {
+      CloudObjectsMetaSelector selector =
+          (CloudObjectsMetaSelector)
+              ReflectionUtils.loadClass(
+                  sourceSelectorClass, new Class<?>[] {TypedProperties.class}, props);
+
+      log.info("Using path selector " + selector.getClass().getName());
+      return selector;
+    } catch (Exception e) {
+      throw new HoodieException("Could not load source selector class " + sourceSelectorClass, e);
+    }
+  }
+
+  /**
+   * List messages from queue, filter out illegible events while doing so. It will also delete the
+   * ineligible messages from queue.
+   *
+   * @param processedMessages array of processed messages to add more messages
+   * @return the list of eligible records
+   */
+  protected List<Map<String, Object>> getEligibleEvents(
+      AmazonSQS sqs, List<Message> processedMessages) throws IOException {
+
+    List<Map<String, Object>> eligibleRecords = new ArrayList<>();
+    List<Message> ineligibleMessages = new ArrayList<>();
+
+    ReceiveMessageRequest receiveMessageRequest =
+        new ReceiveMessageRequest()
+            .withQueueUrl(this.queueUrl)
+            .withWaitTimeSeconds(this.longPollWait)
+            .withVisibilityTimeout(this.visibilityTimeout);
+    receiveMessageRequest.setMaxNumberOfMessages(this.maxMessagesEachRequest);
+
+    List<Message> messages =
+        getMessagesToProcess(
+            sqs,
+            this.queueUrl,
+            receiveMessageRequest,
+            this.maxMessageEachBatch,
+            this.maxMessagesEachRequest);
+
+    for (Message message : messages) {
+      boolean isMessageDelete = Boolean.TRUE;

Review comment:
       minor. rename to isEligibleMsg

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+    super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties props) {
+    String sourceSelectorClass =
+        props.getString(
+            CloudObjectsMetaSelector.Config.SOURCE_INPUT_SELECTOR,
+            CloudObjectsMetaSelector.class.getName());
+    try {
+      CloudObjectsMetaSelector selector =
+          (CloudObjectsMetaSelector)
+              ReflectionUtils.loadClass(
+                  sourceSelectorClass, new Class<?>[] {TypedProperties.class}, props);
+
+      log.info("Using path selector " + selector.getClass().getName());
+      return selector;
+    } catch (Exception e) {
+      throw new HoodieException("Could not load source selector class " + sourceSelectorClass, e);
+    }
+  }
+
+  /**
+   * List messages from queue, filter out illegible events while doing so. It will also delete the
+   * ineligible messages from queue.
+   *
+   * @param processedMessages array of processed messages to add more messages
+   * @return the list of eligible records
+   */
+  protected List<Map<String, Object>> getEligibleEvents(
+      AmazonSQS sqs, List<Message> processedMessages) throws IOException {
+
+    List<Map<String, Object>> eligibleRecords = new ArrayList<>();
+    List<Message> ineligibleMessages = new ArrayList<>();
+
+    ReceiveMessageRequest receiveMessageRequest =
+        new ReceiveMessageRequest()
+            .withQueueUrl(this.queueUrl)
+            .withWaitTimeSeconds(this.longPollWait)
+            .withVisibilityTimeout(this.visibilityTimeout);
+    receiveMessageRequest.setMaxNumberOfMessages(this.maxMessagesEachRequest);
+
+    List<Message> messages =
+        getMessagesToProcess(
+            sqs,
+            this.queueUrl,
+            receiveMessageRequest,
+            this.maxMessageEachBatch,
+            this.maxMessagesEachRequest);
+
+    for (Message message : messages) {
+      boolean isMessageDelete = Boolean.TRUE;
+
+      JSONObject messageBody = new JSONObject(message.getBody());
+      Map<String, Object> messageMap;
+      ObjectMapper mapper = new ObjectMapper();
+
+      if (messageBody.has("Message")) {
+        // If this messages is from S3Event -> SNS -> SQS
+        messageMap =
+            (Map<String, Object>) mapper.readValue(messageBody.getString("Message"), Map.class);
+      } else {
+        // If this messages is from S3Event -> SQS
+        messageMap = (Map<String, Object>) mapper.readValue(messageBody.toString(), Map.class);
+      }
+      if (messageMap.containsKey("Records")) {
+        List<Map<String, Object>> records = (List<Map<String, Object>>) messageMap.get("Records");
+        for (Map<String, Object> record : records) {
+          String eventName = (String) record.get("eventName");
+
+          // filter only allowed s3 event types
+          if (ALLOWED_S3_EVENT_PREFIX.stream().anyMatch(eventName::startsWith)) {
+            eligibleRecords.add(record);
+            isMessageDelete = Boolean.FALSE;
+            processedMessages.add(message);
+
+          } else {
+            log.info("This S3 event " + eventName + " is not allowed, so ignoring it.");

Review comment:
       move to debug.

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+    super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties props) {
+    String sourceSelectorClass =
+        props.getString(
+            CloudObjectsMetaSelector.Config.SOURCE_INPUT_SELECTOR,
+            CloudObjectsMetaSelector.class.getName());
+    try {
+      CloudObjectsMetaSelector selector =
+          (CloudObjectsMetaSelector)
+              ReflectionUtils.loadClass(
+                  sourceSelectorClass, new Class<?>[] {TypedProperties.class}, props);
+
+      log.info("Using path selector " + selector.getClass().getName());
+      return selector;
+    } catch (Exception e) {
+      throw new HoodieException("Could not load source selector class " + sourceSelectorClass, e);
+    }
+  }
+
+  /**
+   * List messages from queue, filter out illegible events while doing so. It will also delete the
+   * ineligible messages from queue.
+   *
+   * @param processedMessages array of processed messages to add more messages
+   * @return the list of eligible records
+   */
+  protected List<Map<String, Object>> getEligibleEvents(
+      AmazonSQS sqs, List<Message> processedMessages) throws IOException {
+
+    List<Map<String, Object>> eligibleRecords = new ArrayList<>();
+    List<Message> ineligibleMessages = new ArrayList<>();
+
+    ReceiveMessageRequest receiveMessageRequest =
+        new ReceiveMessageRequest()
+            .withQueueUrl(this.queueUrl)
+            .withWaitTimeSeconds(this.longPollWait)
+            .withVisibilityTimeout(this.visibilityTimeout);
+    receiveMessageRequest.setMaxNumberOfMessages(this.maxMessagesEachRequest);
+
+    List<Message> messages =
+        getMessagesToProcess(
+            sqs,
+            this.queueUrl,
+            receiveMessageRequest,
+            this.maxMessageEachBatch,
+            this.maxMessagesEachRequest);
+
+    for (Message message : messages) {
+      boolean isMessageDelete = Boolean.TRUE;
+
+      JSONObject messageBody = new JSONObject(message.getBody());
+      Map<String, Object> messageMap;
+      ObjectMapper mapper = new ObjectMapper();
+
+      if (messageBody.has("Message")) {
+        // If this messages is from S3Event -> SNS -> SQS
+        messageMap =
+            (Map<String, Object>) mapper.readValue(messageBody.getString("Message"), Map.class);
+      } else {
+        // If this messages is from S3Event -> SQS
+        messageMap = (Map<String, Object>) mapper.readValue(messageBody.toString(), Map.class);
+      }
+      if (messageMap.containsKey("Records")) {
+        List<Map<String, Object>> records = (List<Map<String, Object>>) messageMap.get("Records");
+        for (Map<String, Object> record : records) {
+          String eventName = (String) record.get("eventName");
+
+          // filter only allowed s3 event types
+          if (ALLOWED_S3_EVENT_PREFIX.stream().anyMatch(eventName::startsWith)) {
+            eligibleRecords.add(record);
+            isMessageDelete = Boolean.FALSE;
+            processedMessages.add(message);
+
+          } else {
+            log.info("This S3 event " + eventName + " is not allowed, so ignoring it.");
+          }
+        }
+      } else {
+        log.info("Message is not expected format or it's s3:TestEvent");
+      }
+      if (isMessageDelete) {
+        ineligibleMessages.add(message);
+      }
+    }
+    if (!ineligibleMessages.isEmpty()) {
+      deleteBatchOfMessages(sqs, queueUrl, ineligibleMessages);
+    }
+
+    return eligibleRecords;
+  }
+
+  /**
+   * Get the list of events from queue.
+   *
+   * @param sparkContext JavaSparkContext to help parallelize certain operations
+   * @param lastCheckpointStr the last checkpoint time string, empty if first run
+   * @return the list of events
+   */
+  public Pair<List<String>, String> getNextEventsFromQueue(
+      AmazonSQS sqs,
+      JavaSparkContext sparkContext,
+      Option<String> lastCheckpointStr,
+      List<Message> processedMessages) {
+
+    processedMessages.clear();
+
+    log.info("Reading messages....");
+
+    try {
+      log.info("Start Checkpoint : " + lastCheckpointStr);
+
+      long lastCheckpointTime = lastCheckpointStr.map(Long::parseLong).orElse(Long.MIN_VALUE);
+
+      List<Map<String, Object>> eligibleEventRecords = getEligibleEvents(sqs, processedMessages);

Review comment:
       we could just name this as "eventRecords" or just "records". We don't need to repeat "eligible" everywhere. its implicit 

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+    super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties props) {
+    String sourceSelectorClass =
+        props.getString(
+            CloudObjectsMetaSelector.Config.SOURCE_INPUT_SELECTOR,
+            CloudObjectsMetaSelector.class.getName());
+    try {
+      CloudObjectsMetaSelector selector =
+          (CloudObjectsMetaSelector)
+              ReflectionUtils.loadClass(
+                  sourceSelectorClass, new Class<?>[] {TypedProperties.class}, props);
+
+      log.info("Using path selector " + selector.getClass().getName());
+      return selector;
+    } catch (Exception e) {
+      throw new HoodieException("Could not load source selector class " + sourceSelectorClass, e);
+    }
+  }
+
+  /**
+   * List messages from queue, filter out illegible events while doing so. It will also delete the
+   * ineligible messages from queue.
+   *
+   * @param processedMessages array of processed messages to add more messages
+   * @return the list of eligible records
+   */
+  protected List<Map<String, Object>> getEligibleEvents(

Review comment:
       also, can we use valid and invalid instead of eligible and inEligible. valid and invalid are most commonly used terminologies.

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsHoodieIncrSource.java
##########
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.DataSourceReadOptions;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.IncrSourceHelper;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * Cloud Objects Hoodie Incr Source Class. {@link CloudObjectsHoodieIncrSource}.This source will use
+ * the cloud files meta information form cloud meta hoodie table generate by CloudObjectsMetaSource.
+ */
+public class CloudObjectsHoodieIncrSource extends HoodieIncrSource {
+
+  private static final Logger LOG = LogManager.getLogger(CloudObjectsHoodieIncrSource.class);
+
+  public CloudObjectsHoodieIncrSource(
+      TypedProperties props,
+      JavaSparkContext sparkContext,
+      SparkSession sparkSession,
+      SchemaProvider schemaProvider) {
+    super(props, sparkContext, sparkSession, schemaProvider);
+  }
+
+  @Override
+  public Pair<Option<Dataset<Row>>, String> fetchNextBatch(
+      Option<String> lastCkptStr, long sourceLimit) {
+
+    DataSourceUtils.checkRequiredProperties(
+        props, Collections.singletonList(Config.HOODIE_SRC_BASE_PATH));
+
+    String srcPath = props.getString(Config.HOODIE_SRC_BASE_PATH);
+    int numInstantsPerFetch =
+        props.getInteger(Config.NUM_INSTANTS_PER_FETCH, Config.DEFAULT_NUM_INSTANTS_PER_FETCH);
+    boolean readLatestOnMissingCkpt =
+        props.getBoolean(
+            Config.READ_LATEST_INSTANT_ON_MISSING_CKPT,
+            Config.DEFAULT_READ_LATEST_INSTANT_ON_MISSING_CKPT);
+
+    // Use begin Instant if set and non-empty
+    Option<String> beginInstant =
+        lastCkptStr.isPresent()
+            ? lastCkptStr.get().isEmpty() ? Option.empty() : lastCkptStr
+            : Option.empty();
+
+    Pair<String, String> instantEndpts =
+        IncrSourceHelper.calculateBeginAndEndInstants(
+            sparkContext, srcPath, numInstantsPerFetch, beginInstant, readLatestOnMissingCkpt);
+
+    if (instantEndpts.getKey().equals(instantEndpts.getValue())) {
+      LOG.warn("Already caught up. Begin Checkpoint was :" + instantEndpts.getKey());
+      return Pair.of(Option.empty(), instantEndpts.getKey());
+    }
+
+    // Do Incr pull. Set end instant if available
+    DataFrameReader reader =
+        sparkSession
+            .read()
+            .format("org.apache.hudi")
+            .option(
+                DataSourceReadOptions.QUERY_TYPE().key(),
+                DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL())
+            .option(
+                DataSourceReadOptions.BEGIN_INSTANTTIME().key(), instantEndpts.getLeft())
+            .option(
+                DataSourceReadOptions.END_INSTANTTIME().key(), instantEndpts.getRight());
+
+    Dataset<Row> source = reader.load(srcPath);
+
+    // Extract distinct file keys from cloud meta hoodie table
+    final List<Row> cloudMetaDf =
+        source
+            .filter("s3.object.size > 0")
+            .select("s3.bucket.name", "s3.object.key")
+            .distinct()
+            .collectAsList();
+
+    // Create S3 paths
+    List<String> cloudFiles = new ArrayList<>();
+    for (Row row : cloudMetaDf) {
+      String bucket = row.getString(0);
+      String key = row.getString(1);
+      String filePath = "s3://" + bucket + "/" + key;

Review comment:
       Can we do it in one line? 
   ```
   String filePath = "s3://" + row.getString(0) + "/" + row.getString(1);
   ```
   Just add a comment that 0 refers to bucket and 1 refers to key. 

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+    super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties props) {
+    String sourceSelectorClass =
+        props.getString(
+            CloudObjectsMetaSelector.Config.SOURCE_INPUT_SELECTOR,
+            CloudObjectsMetaSelector.class.getName());
+    try {
+      CloudObjectsMetaSelector selector =
+          (CloudObjectsMetaSelector)
+              ReflectionUtils.loadClass(
+                  sourceSelectorClass, new Class<?>[] {TypedProperties.class}, props);
+
+      log.info("Using path selector " + selector.getClass().getName());
+      return selector;
+    } catch (Exception e) {
+      throw new HoodieException("Could not load source selector class " + sourceSelectorClass, e);
+    }
+  }
+
+  /**
+   * List messages from queue, filter out illegible events while doing so. It will also delete the
+   * ineligible messages from queue.
+   *
+   * @param processedMessages array of processed messages to add more messages
+   * @return the list of eligible records
+   */
+  protected List<Map<String, Object>> getEligibleEvents(
+      AmazonSQS sqs, List<Message> processedMessages) throws IOException {
+
+    List<Map<String, Object>> eligibleRecords = new ArrayList<>();
+    List<Message> ineligibleMessages = new ArrayList<>();
+
+    ReceiveMessageRequest receiveMessageRequest =
+        new ReceiveMessageRequest()
+            .withQueueUrl(this.queueUrl)
+            .withWaitTimeSeconds(this.longPollWait)
+            .withVisibilityTimeout(this.visibilityTimeout);
+    receiveMessageRequest.setMaxNumberOfMessages(this.maxMessagesEachRequest);
+
+    List<Message> messages =
+        getMessagesToProcess(
+            sqs,
+            this.queueUrl,
+            receiveMessageRequest,
+            this.maxMessageEachBatch,
+            this.maxMessagesEachRequest);
+
+    for (Message message : messages) {
+      boolean isMessageDelete = Boolean.TRUE;
+
+      JSONObject messageBody = new JSONObject(message.getBody());
+      Map<String, Object> messageMap;
+      ObjectMapper mapper = new ObjectMapper();
+
+      if (messageBody.has("Message")) {
+        // If this messages is from S3Event -> SNS -> SQS
+        messageMap =
+            (Map<String, Object>) mapper.readValue(messageBody.getString("Message"), Map.class);
+      } else {
+        // If this messages is from S3Event -> SQS
+        messageMap = (Map<String, Object>) mapper.readValue(messageBody.toString(), Map.class);
+      }
+      if (messageMap.containsKey("Records")) {
+        List<Map<String, Object>> records = (List<Map<String, Object>>) messageMap.get("Records");
+        for (Map<String, Object> record : records) {
+          String eventName = (String) record.get("eventName");
+
+          // filter only allowed s3 event types
+          if (ALLOWED_S3_EVENT_PREFIX.stream().anyMatch(eventName::startsWith)) {
+            eligibleRecords.add(record);
+            isMessageDelete = Boolean.FALSE;
+            processedMessages.add(message);
+
+          } else {
+            log.info("This S3 event " + eventName + " is not allowed, so ignoring it.");
+          }
+        }
+      } else {
+        log.info("Message is not expected format or it's s3:TestEvent");
+      }
+      if (isMessageDelete) {
+        ineligibleMessages.add(message);
+      }
+    }
+    if (!ineligibleMessages.isEmpty()) {
+      deleteBatchOfMessages(sqs, queueUrl, ineligibleMessages);

Review comment:
       Is there a strict requirement to delete invalid messages right away rather than deleting it in onCommit()?
   We can simplify things. If processedMessages is not going to be used anywhere but only for delete during onCommmit, I would vote to return all messages in processed messages and keep this simple.

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+    super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties props) {
+    String sourceSelectorClass =
+        props.getString(
+            CloudObjectsMetaSelector.Config.SOURCE_INPUT_SELECTOR,
+            CloudObjectsMetaSelector.class.getName());
+    try {
+      CloudObjectsMetaSelector selector =
+          (CloudObjectsMetaSelector)
+              ReflectionUtils.loadClass(
+                  sourceSelectorClass, new Class<?>[] {TypedProperties.class}, props);
+
+      log.info("Using path selector " + selector.getClass().getName());
+      return selector;
+    } catch (Exception e) {
+      throw new HoodieException("Could not load source selector class " + sourceSelectorClass, e);
+    }
+  }
+
+  /**
+   * List messages from queue, filter out illegible events while doing so. It will also delete the
+   * ineligible messages from queue.
+   *
+   * @param processedMessages array of processed messages to add more messages
+   * @return the list of eligible records
+   */
+  protected List<Map<String, Object>> getEligibleEvents(
+      AmazonSQS sqs, List<Message> processedMessages) throws IOException {
+
+    List<Map<String, Object>> eligibleRecords = new ArrayList<>();
+    List<Message> ineligibleMessages = new ArrayList<>();
+
+    ReceiveMessageRequest receiveMessageRequest =
+        new ReceiveMessageRequest()
+            .withQueueUrl(this.queueUrl)
+            .withWaitTimeSeconds(this.longPollWait)
+            .withVisibilityTimeout(this.visibilityTimeout);
+    receiveMessageRequest.setMaxNumberOfMessages(this.maxMessagesEachRequest);
+
+    List<Message> messages =
+        getMessagesToProcess(
+            sqs,
+            this.queueUrl,
+            receiveMessageRequest,
+            this.maxMessageEachBatch,
+            this.maxMessagesEachRequest);
+
+    for (Message message : messages) {
+      boolean isMessageDelete = Boolean.TRUE;
+
+      JSONObject messageBody = new JSONObject(message.getBody());
+      Map<String, Object> messageMap;
+      ObjectMapper mapper = new ObjectMapper();
+
+      if (messageBody.has("Message")) {
+        // If this messages is from S3Event -> SNS -> SQS
+        messageMap =
+            (Map<String, Object>) mapper.readValue(messageBody.getString("Message"), Map.class);
+      } else {
+        // If this messages is from S3Event -> SQS
+        messageMap = (Map<String, Object>) mapper.readValue(messageBody.toString(), Map.class);
+      }
+      if (messageMap.containsKey("Records")) {
+        List<Map<String, Object>> records = (List<Map<String, Object>>) messageMap.get("Records");
+        for (Map<String, Object> record : records) {
+          String eventName = (String) record.get("eventName");
+
+          // filter only allowed s3 event types
+          if (ALLOWED_S3_EVENT_PREFIX.stream().anyMatch(eventName::startsWith)) {
+            eligibleRecords.add(record);
+            isMessageDelete = Boolean.FALSE;
+            processedMessages.add(message);
+
+          } else {
+            log.info("This S3 event " + eventName + " is not allowed, so ignoring it.");
+          }
+        }
+      } else {
+        log.info("Message is not expected format or it's s3:TestEvent");
+      }
+      if (isMessageDelete) {
+        ineligibleMessages.add(message);
+      }
+    }
+    if (!ineligibleMessages.isEmpty()) {
+      deleteBatchOfMessages(sqs, queueUrl, ineligibleMessages);
+    }
+
+    return eligibleRecords;
+  }
+
+  /**
+   * Get the list of events from queue.
+   *
+   * @param sparkContext JavaSparkContext to help parallelize certain operations
+   * @param lastCheckpointStr the last checkpoint time string, empty if first run
+   * @return the list of events
+   */
+  public Pair<List<String>, String> getNextEventsFromQueue(
+      AmazonSQS sqs,
+      JavaSparkContext sparkContext,
+      Option<String> lastCheckpointStr,
+      List<Message> processedMessages) {
+
+    processedMessages.clear();
+
+    log.info("Reading messages....");
+
+    try {
+      log.info("Start Checkpoint : " + lastCheckpointStr);
+
+      long lastCheckpointTime = lastCheckpointStr.map(Long::parseLong).orElse(Long.MIN_VALUE);
+
+      List<Map<String, Object>> eligibleEventRecords = getEligibleEvents(sqs, processedMessages);
+      log.info("eligible events size: " + eligibleEventRecords.size());
+
+      // sort all events by event time.
+      eligibleEventRecords.sort(
+          Comparator.comparingLong(
+              record ->
+                  Date.from(
+                          Instant.from(
+                              DateTimeFormatter.ISO_INSTANT.parse(
+                                  (String) record.get("eventTime"))))
+                      .getTime()));
+
+      List<String> filteredEventRecords = new ArrayList<>();
+      long newCheckpointTime = lastCheckpointTime;
+
+      for (Map<String, Object> eventRecord : eligibleEventRecords) {
+        newCheckpointTime =
+            Date.from(
+                    Instant.from(
+                        DateTimeFormatter.ISO_INSTANT.parse((String) eventRecord.get("eventTime"))))
+                .getTime();
+
+        // Currently HUDI don't supports column names like request-amz-id-2
+        eventRecord.remove("responseElements");
+
+        filteredEventRecords.add(
+            new ObjectMapper().writeValueAsString(eventRecord).replace("%3D", "="));
+      }
+      if (filteredEventRecords.isEmpty()) {
+        return new ImmutablePair<>(filteredEventRecords, String.valueOf(newCheckpointTime));
+      }
+      return new ImmutablePair<>(filteredEventRecords, String.valueOf(newCheckpointTime));
+    } catch (JSONException | IOException e) {
+      e.printStackTrace();

Review comment:
       can we avoid printStackTrace please. HoodieException will take care of it. 

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsHoodieIncrSource.java
##########
@@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.DataSourceReadOptions;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.IncrSourceHelper;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * Cloud Objects Hoodie Incr Source Class. {@link CloudObjectsHoodieIncrSource}.This source will use
+ * the cloud files meta information form cloud meta hoodie table generate by CloudObjectsMetaSource.
+ */
+public class CloudObjectsHoodieIncrSource extends HoodieIncrSource {
+
+  private static final Logger LOG = LogManager.getLogger(CloudObjectsHoodieIncrSource.class);
+
+  public CloudObjectsHoodieIncrSource(
+      TypedProperties props,
+      JavaSparkContext sparkContext,
+      SparkSession sparkSession,
+      SchemaProvider schemaProvider) {
+    super(props, sparkContext, sparkSession, schemaProvider);
+  }
+
+  @Override
+  public Pair<Option<Dataset<Row>>, String> fetchNextBatch(
+      Option<String> lastCkptStr, long sourceLimit) {
+
+    DataSourceUtils.checkRequiredProperties(
+        props, Collections.singletonList(Config.HOODIE_SRC_BASE_PATH));
+
+    String srcPath = props.getString(Config.HOODIE_SRC_BASE_PATH);
+    int numInstantsPerFetch =
+        props.getInteger(Config.NUM_INSTANTS_PER_FETCH, Config.DEFAULT_NUM_INSTANTS_PER_FETCH);
+    boolean readLatestOnMissingCkpt =
+        props.getBoolean(
+            Config.READ_LATEST_INSTANT_ON_MISSING_CKPT,
+            Config.DEFAULT_READ_LATEST_INSTANT_ON_MISSING_CKPT);
+
+    // Use begin Instant if set and non-empty
+    Option<String> beginInstant =
+        lastCkptStr.isPresent()
+            ? lastCkptStr.get().isEmpty() ? Option.empty() : lastCkptStr
+            : Option.empty();
+
+    Pair<String, String> instantEndpts =
+        IncrSourceHelper.calculateBeginAndEndInstants(
+            sparkContext, srcPath, numInstantsPerFetch, beginInstant, readLatestOnMissingCkpt);
+
+    if (instantEndpts.getKey().equals(instantEndpts.getValue())) {
+      LOG.warn("Already caught up. Begin Checkpoint was :" + instantEndpts.getKey());
+      return Pair.of(Option.empty(), instantEndpts.getKey());
+    }
+
+    // Do Incr pull. Set end instant if available
+    DataFrameReader reader =
+        sparkSession
+            .read()
+            .format("org.apache.hudi")
+            .option(
+                DataSourceReadOptions.QUERY_TYPE().key(),
+                DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL())
+            .option(
+                DataSourceReadOptions.BEGIN_INSTANTTIME().key(), instantEndpts.getLeft())
+            .option(
+                DataSourceReadOptions.END_INSTANTTIME().key(), instantEndpts.getRight());
+
+    Dataset<Row> source = reader.load(srcPath);
+
+    // Extract distinct file keys from cloud meta hoodie table
+    final List<Row> cloudMetaDf =
+        source
+            .filter("s3.object.size > 0")
+            .select("s3.bucket.name", "s3.object.key")
+            .distinct()
+            .collectAsList();
+
+    // Create S3 paths
+    List<String> cloudFiles = new ArrayList<>();
+    for (Row row : cloudMetaDf) {
+      String bucket = row.getString(0);
+      String key = row.getString(1);
+      String filePath = "s3://" + bucket + "/" + key;
+      cloudFiles.add(filePath);
+    }
+    String pathStr = String.join(",", cloudFiles);

Review comment:
       why do we join here and later split it from within fromFiles? 

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelector.java
##########
@@ -0,0 +1,285 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+
+import com.amazonaws.regions.Regions;
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.AmazonSQSClientBuilder;
+import com.amazonaws.services.sqs.model.BatchResultErrorEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequest;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequestEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchResult;
+import com.amazonaws.services.sqs.model.GetQueueAttributesRequest;
+import com.amazonaws.services.sqs.model.GetQueueAttributesResult;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.json.JSONObject;
+
+import java.io.UnsupportedEncodingException;
+import java.net.URLDecoder;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ * Cloud Objects Selector Class. This class has methods for processing cloud objects. It currently
+ * supports only AWS S3 objects and AWS SQS queue.
+ */
+public class CloudObjectsSelector {
+  public static final List<String> ALLOWED_S3_EVENT_PREFIX =
+      Collections.singletonList("ObjectCreated");
+  public static volatile Logger log = LogManager.getLogger(CloudObjectsSelector.class);
+  public final String queueUrl;
+  public final int longPollWait;
+  public final int maxMessagesEachRequest;
+  public final int maxMessageEachBatch;
+  public final int visibilityTimeout;
+  public final TypedProperties props;
+  public final String fsName;
+  private final String regionName;
+
+  /**
+   * Cloud Objects Selector Class. {@link CloudObjectsSelector}
+   */
+  public CloudObjectsSelector(TypedProperties props) {
+    DataSourceUtils.checkRequiredProperties(props, Arrays.asList(Config.QUEUE_URL_PROP, Config.QUEUE_REGION));
+    this.props = props;
+    this.queueUrl = props.getString(Config.QUEUE_URL_PROP);
+    this.regionName = props.getString(Config.QUEUE_REGION);
+    this.fsName = props.getString(Config.SOURCE_QUEUE_FS_PROP, "s3").toLowerCase();
+    this.longPollWait = props.getInteger(Config.QUEUE_LONGPOLLWAIT_PROP, 20);
+    this.maxMessageEachBatch = props.getInteger(Config.QUEUE_MAXMESSAGESEACHBATCH_PROP, 5);
+    this.visibilityTimeout = props.getInteger(Config.QUEUE_VISIBILITYTIMEOUT_PROP, 30);
+    this.maxMessagesEachRequest = 10;
+  }
+
+  /**
+   * Get SQS queue attributes.
+   *
+   * @param sqsClient AWSClient for sqsClient
+   * @param queueUrl  queue full url
+   * @return map of attributes needed
+   */
+  protected Map<String, String> getSqsQueueAttributes(AmazonSQS sqsClient, String queueUrl) {
+    GetQueueAttributesResult queueAttributesResult =
+        sqsClient.getQueueAttributes(
+            new GetQueueAttributesRequest(queueUrl)
+                .withAttributeNames("ApproximateNumberOfMessages"));
+    return queueAttributesResult.getAttributes();
+  }
+
+  /**
+   * Get the file attributes filePath, eventTime and size from JSONObject record.
+   *
+   * @param record of object event
+   * @return map of file attribute
+   */
+  protected Map<String, Object> getFileAttributesFromRecord(JSONObject record)
+      throws UnsupportedEncodingException {
+
+    Map<String, Object> fileRecord = new HashMap<>();
+    String eventTimeStr = record.getString("eventTime");
+    long eventTime =
+        Date.from(Instant.from(DateTimeFormatter.ISO_INSTANT.parse(eventTimeStr))).getTime();
+
+    JSONObject s3Object = record.getJSONObject("s3").getJSONObject("object");
+    String bucket =
+        URLDecoder.decode(
+            record.getJSONObject("s3").getJSONObject("bucket").getString("name"), "UTF-8");
+    String key = URLDecoder.decode(s3Object.getString("key"), "UTF-8");
+    String filePath = this.fsName + "://" + bucket + "/" + key;
+
+    fileRecord.put("eventTime", eventTime);
+    fileRecord.put("fileSize", s3Object.getLong("size"));
+    fileRecord.put("filePath", filePath);
+    return fileRecord;
+  }
+
+  /**
+   * Amazon SQS Client Builder.
+   */
+  public AmazonSQS createAmazonSqsClient() {
+    return AmazonSQSClientBuilder.standard().withRegion(Regions.fromName(regionName)).build();
+  }
+
+  /**
+   * List messages from queue.
+   */
+  protected List<Message> getMessagesToProcess(
+      AmazonSQS sqsClient,
+      String queueUrl,
+      ReceiveMessageRequest receiveMessageRequest,
+      int maxMessageEachBatch,
+      int maxMessagesEachRequest) {
+    List<Message> messagesToProcess = new ArrayList<>();
+
+    // Get count for available messages
+    Map<String, String> queueAttributesResult = getSqsQueueAttributes(sqsClient, queueUrl);
+    long approxMessagesAvailable =
+        Long.parseLong(queueAttributesResult.get("ApproximateNumberOfMessages"));
+    log.info("Approx. " + approxMessagesAvailable + " messages available in queue.");

Review comment:
       if we can set numMessagesToProcess = min(approxMessagesAvailable, maxMessageEachBatch), we can avoid line 159 to 161. 

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+    super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties props) {
+    String sourceSelectorClass =
+        props.getString(
+            CloudObjectsMetaSelector.Config.SOURCE_INPUT_SELECTOR,
+            CloudObjectsMetaSelector.class.getName());
+    try {
+      CloudObjectsMetaSelector selector =
+          (CloudObjectsMetaSelector)
+              ReflectionUtils.loadClass(
+                  sourceSelectorClass, new Class<?>[] {TypedProperties.class}, props);
+
+      log.info("Using path selector " + selector.getClass().getName());
+      return selector;
+    } catch (Exception e) {
+      throw new HoodieException("Could not load source selector class " + sourceSelectorClass, e);
+    }
+  }
+
+  /**
+   * List messages from queue, filter out illegible events while doing so. It will also delete the
+   * ineligible messages from queue.
+   *
+   * @param processedMessages array of processed messages to add more messages
+   * @return the list of eligible records
+   */
+  protected List<Map<String, Object>> getEligibleEvents(
+      AmazonSQS sqs, List<Message> processedMessages) throws IOException {
+
+    List<Map<String, Object>> eligibleRecords = new ArrayList<>();
+    List<Message> ineligibleMessages = new ArrayList<>();
+
+    ReceiveMessageRequest receiveMessageRequest =
+        new ReceiveMessageRequest()
+            .withQueueUrl(this.queueUrl)
+            .withWaitTimeSeconds(this.longPollWait)
+            .withVisibilityTimeout(this.visibilityTimeout);
+    receiveMessageRequest.setMaxNumberOfMessages(this.maxMessagesEachRequest);
+
+    List<Message> messages =
+        getMessagesToProcess(

Review comment:
       line 89 to 94 can be moved to getMessagesToProcess(...) itslef. ReceiveMessageRequest is not really required in this method. it is used only within getMessagesToProcess.

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+    super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties props) {
+    String sourceSelectorClass =
+        props.getString(
+            CloudObjectsMetaSelector.Config.SOURCE_INPUT_SELECTOR,
+            CloudObjectsMetaSelector.class.getName());
+    try {
+      CloudObjectsMetaSelector selector =
+          (CloudObjectsMetaSelector)
+              ReflectionUtils.loadClass(
+                  sourceSelectorClass, new Class<?>[] {TypedProperties.class}, props);
+
+      log.info("Using path selector " + selector.getClass().getName());
+      return selector;
+    } catch (Exception e) {
+      throw new HoodieException("Could not load source selector class " + sourceSelectorClass, e);
+    }
+  }
+
+  /**
+   * List messages from queue, filter out illegible events while doing so. It will also delete the
+   * ineligible messages from queue.
+   *
+   * @param processedMessages array of processed messages to add more messages
+   * @return the list of eligible records
+   */
+  protected List<Map<String, Object>> getEligibleEvents(

Review comment:
       Can we maintain same terminology everywhere. method is named as "...Events", where as variables within are name d as eligible**Records**. Can we have a uniform name to all these. either events, or records. 

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+    super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties props) {
+    String sourceSelectorClass =
+        props.getString(
+            CloudObjectsMetaSelector.Config.SOURCE_INPUT_SELECTOR,
+            CloudObjectsMetaSelector.class.getName());
+    try {
+      CloudObjectsMetaSelector selector =
+          (CloudObjectsMetaSelector)
+              ReflectionUtils.loadClass(
+                  sourceSelectorClass, new Class<?>[] {TypedProperties.class}, props);
+
+      log.info("Using path selector " + selector.getClass().getName());
+      return selector;
+    } catch (Exception e) {
+      throw new HoodieException("Could not load source selector class " + sourceSelectorClass, e);
+    }
+  }
+
+  /**
+   * List messages from queue, filter out illegible events while doing so. It will also delete the
+   * ineligible messages from queue.
+   *
+   * @param processedMessages array of processed messages to add more messages
+   * @return the list of eligible records
+   */
+  protected List<Map<String, Object>> getEligibleEvents(
+      AmazonSQS sqs, List<Message> processedMessages) throws IOException {
+
+    List<Map<String, Object>> eligibleRecords = new ArrayList<>();
+    List<Message> ineligibleMessages = new ArrayList<>();
+
+    ReceiveMessageRequest receiveMessageRequest =
+        new ReceiveMessageRequest()
+            .withQueueUrl(this.queueUrl)
+            .withWaitTimeSeconds(this.longPollWait)
+            .withVisibilityTimeout(this.visibilityTimeout);
+    receiveMessageRequest.setMaxNumberOfMessages(this.maxMessagesEachRequest);
+
+    List<Message> messages =
+        getMessagesToProcess(
+            sqs,
+            this.queueUrl,
+            receiveMessageRequest,
+            this.maxMessageEachBatch,
+            this.maxMessagesEachRequest);
+
+    for (Message message : messages) {
+      boolean isMessageDelete = Boolean.TRUE;
+
+      JSONObject messageBody = new JSONObject(message.getBody());
+      Map<String, Object> messageMap;
+      ObjectMapper mapper = new ObjectMapper();
+
+      if (messageBody.has("Message")) {
+        // If this messages is from S3Event -> SNS -> SQS
+        messageMap =
+            (Map<String, Object>) mapper.readValue(messageBody.getString("Message"), Map.class);
+      } else {
+        // If this messages is from S3Event -> SQS
+        messageMap = (Map<String, Object>) mapper.readValue(messageBody.toString(), Map.class);
+      }
+      if (messageMap.containsKey("Records")) {
+        List<Map<String, Object>> records = (List<Map<String, Object>>) messageMap.get("Records");
+        for (Map<String, Object> record : records) {
+          String eventName = (String) record.get("eventName");
+
+          // filter only allowed s3 event types
+          if (ALLOWED_S3_EVENT_PREFIX.stream().anyMatch(eventName::startsWith)) {
+            eligibleRecords.add(record);
+            isMessageDelete = Boolean.FALSE;
+            processedMessages.add(message);
+
+          } else {
+            log.info("This S3 event " + eventName + " is not allowed, so ignoring it.");
+          }
+        }
+      } else {
+        log.info("Message is not expected format or it's s3:TestEvent");
+      }
+      if (isMessageDelete) {
+        ineligibleMessages.add(message);
+      }
+    }
+    if (!ineligibleMessages.isEmpty()) {
+      deleteBatchOfMessages(sqs, queueUrl, ineligibleMessages);
+    }
+
+    return eligibleRecords;
+  }
+
+  /**
+   * Get the list of events from queue.
+   *
+   * @param sparkContext JavaSparkContext to help parallelize certain operations
+   * @param lastCheckpointStr the last checkpoint time string, empty if first run
+   * @return the list of events
+   */
+  public Pair<List<String>, String> getNextEventsFromQueue(
+      AmazonSQS sqs,
+      JavaSparkContext sparkContext,
+      Option<String> lastCheckpointStr,
+      List<Message> processedMessages) {
+
+    processedMessages.clear();
+
+    log.info("Reading messages....");
+
+    try {
+      log.info("Start Checkpoint : " + lastCheckpointStr);
+
+      long lastCheckpointTime = lastCheckpointStr.map(Long::parseLong).orElse(Long.MIN_VALUE);
+
+      List<Map<String, Object>> eligibleEventRecords = getEligibleEvents(sqs, processedMessages);
+      log.info("eligible events size: " + eligibleEventRecords.size());
+
+      // sort all events by event time.
+      eligibleEventRecords.sort(
+          Comparator.comparingLong(
+              record ->
+                  Date.from(
+                          Instant.from(
+                              DateTimeFormatter.ISO_INSTANT.parse(
+                                  (String) record.get("eventTime"))))
+                      .getTime()));
+
+      List<String> filteredEventRecords = new ArrayList<>();
+      long newCheckpointTime = lastCheckpointTime;
+
+      for (Map<String, Object> eventRecord : eligibleEventRecords) {
+        newCheckpointTime =
+            Date.from(
+                    Instant.from(
+                        DateTimeFormatter.ISO_INSTANT.parse((String) eventRecord.get("eventTime"))))
+                .getTime();
+
+        // Currently HUDI don't supports column names like request-amz-id-2
+        eventRecord.remove("responseElements");

Review comment:
       lines 194 and 196, can't we do this within getEligibleEvents(). why do manipulations at two places. can you help me understand.

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+    super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties props) {
+    String sourceSelectorClass =
+        props.getString(
+            CloudObjectsMetaSelector.Config.SOURCE_INPUT_SELECTOR,
+            CloudObjectsMetaSelector.class.getName());
+    try {
+      CloudObjectsMetaSelector selector =
+          (CloudObjectsMetaSelector)
+              ReflectionUtils.loadClass(
+                  sourceSelectorClass, new Class<?>[] {TypedProperties.class}, props);
+
+      log.info("Using path selector " + selector.getClass().getName());
+      return selector;
+    } catch (Exception e) {
+      throw new HoodieException("Could not load source selector class " + sourceSelectorClass, e);
+    }
+  }
+
+  /**
+   * List messages from queue, filter out illegible events while doing so. It will also delete the
+   * ineligible messages from queue.
+   *
+   * @param processedMessages array of processed messages to add more messages
+   * @return the list of eligible records
+   */
+  protected List<Map<String, Object>> getEligibleEvents(
+      AmazonSQS sqs, List<Message> processedMessages) throws IOException {
+
+    List<Map<String, Object>> eligibleRecords = new ArrayList<>();
+    List<Message> ineligibleMessages = new ArrayList<>();
+
+    ReceiveMessageRequest receiveMessageRequest =
+        new ReceiveMessageRequest()
+            .withQueueUrl(this.queueUrl)
+            .withWaitTimeSeconds(this.longPollWait)
+            .withVisibilityTimeout(this.visibilityTimeout);
+    receiveMessageRequest.setMaxNumberOfMessages(this.maxMessagesEachRequest);
+
+    List<Message> messages =
+        getMessagesToProcess(
+            sqs,
+            this.queueUrl,
+            receiveMessageRequest,
+            this.maxMessageEachBatch,
+            this.maxMessagesEachRequest);
+
+    for (Message message : messages) {

Review comment:
       can we move this entire processing of messages to a method. 
   
   So, high level, getEligibleEvents(..) should be like this.
   ```
   allMsgs = getAllMessages(...)
   eligibleEvents = processAndDeleteInValidMessages(...)
   return eligibleEvents
   ```

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+    super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties props) {
+    String sourceSelectorClass =
+        props.getString(
+            CloudObjectsMetaSelector.Config.SOURCE_INPUT_SELECTOR,
+            CloudObjectsMetaSelector.class.getName());
+    try {
+      CloudObjectsMetaSelector selector =
+          (CloudObjectsMetaSelector)
+              ReflectionUtils.loadClass(
+                  sourceSelectorClass, new Class<?>[] {TypedProperties.class}, props);
+
+      log.info("Using path selector " + selector.getClass().getName());
+      return selector;
+    } catch (Exception e) {
+      throw new HoodieException("Could not load source selector class " + sourceSelectorClass, e);
+    }
+  }
+
+  /**
+   * List messages from queue, filter out illegible events while doing so. It will also delete the
+   * ineligible messages from queue.
+   *
+   * @param processedMessages array of processed messages to add more messages
+   * @return the list of eligible records
+   */
+  protected List<Map<String, Object>> getEligibleEvents(
+      AmazonSQS sqs, List<Message> processedMessages) throws IOException {
+
+    List<Map<String, Object>> eligibleRecords = new ArrayList<>();
+    List<Message> ineligibleMessages = new ArrayList<>();
+
+    ReceiveMessageRequest receiveMessageRequest =
+        new ReceiveMessageRequest()
+            .withQueueUrl(this.queueUrl)
+            .withWaitTimeSeconds(this.longPollWait)
+            .withVisibilityTimeout(this.visibilityTimeout);
+    receiveMessageRequest.setMaxNumberOfMessages(this.maxMessagesEachRequest);
+
+    List<Message> messages =
+        getMessagesToProcess(
+            sqs,
+            this.queueUrl,
+            receiveMessageRequest,
+            this.maxMessageEachBatch,
+            this.maxMessagesEachRequest);
+
+    for (Message message : messages) {
+      boolean isMessageDelete = Boolean.TRUE;
+
+      JSONObject messageBody = new JSONObject(message.getBody());
+      Map<String, Object> messageMap;
+      ObjectMapper mapper = new ObjectMapper();
+
+      if (messageBody.has("Message")) {
+        // If this messages is from S3Event -> SNS -> SQS
+        messageMap =
+            (Map<String, Object>) mapper.readValue(messageBody.getString("Message"), Map.class);
+      } else {
+        // If this messages is from S3Event -> SQS
+        messageMap = (Map<String, Object>) mapper.readValue(messageBody.toString(), Map.class);
+      }
+      if (messageMap.containsKey("Records")) {
+        List<Map<String, Object>> records = (List<Map<String, Object>>) messageMap.get("Records");
+        for (Map<String, Object> record : records) {
+          String eventName = (String) record.get("eventName");
+
+          // filter only allowed s3 event types
+          if (ALLOWED_S3_EVENT_PREFIX.stream().anyMatch(eventName::startsWith)) {
+            eligibleRecords.add(record);
+            isMessageDelete = Boolean.FALSE;
+            processedMessages.add(message);
+
+          } else {
+            log.info("This S3 event " + eventName + " is not allowed, so ignoring it.");
+          }
+        }
+      } else {
+        log.info("Message is not expected format or it's s3:TestEvent");
+      }
+      if (isMessageDelete) {
+        ineligibleMessages.add(message);
+      }
+    }
+    if (!ineligibleMessages.isEmpty()) {
+      deleteBatchOfMessages(sqs, queueUrl, ineligibleMessages);
+    }
+
+    return eligibleRecords;
+  }
+
+  /**
+   * Get the list of events from queue.
+   *
+   * @param sparkContext JavaSparkContext to help parallelize certain operations
+   * @param lastCheckpointStr the last checkpoint time string, empty if first run
+   * @return the list of events
+   */
+  public Pair<List<String>, String> getNextEventsFromQueue(
+      AmazonSQS sqs,
+      JavaSparkContext sparkContext,
+      Option<String> lastCheckpointStr,
+      List<Message> processedMessages) {
+
+    processedMessages.clear();
+
+    log.info("Reading messages....");
+
+    try {
+      log.info("Start Checkpoint : " + lastCheckpointStr);
+
+      long lastCheckpointTime = lastCheckpointStr.map(Long::parseLong).orElse(Long.MIN_VALUE);
+
+      List<Map<String, Object>> eligibleEventRecords = getEligibleEvents(sqs, processedMessages);
+      log.info("eligible events size: " + eligibleEventRecords.size());
+
+      // sort all events by event time.
+      eligibleEventRecords.sort(
+          Comparator.comparingLong(
+              record ->
+                  Date.from(
+                          Instant.from(
+                              DateTimeFormatter.ISO_INSTANT.parse(
+                                  (String) record.get("eventTime"))))
+                      .getTime()));
+
+      List<String> filteredEventRecords = new ArrayList<>();
+      long newCheckpointTime = lastCheckpointTime;
+
+      for (Map<String, Object> eventRecord : eligibleEventRecords) {
+        newCheckpointTime =

Review comment:
       Is it possible to set the newCheckpoint outside the for loop. it should refer to last msg right? Guess thats the reason why we sort the records earlier is it? 

##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+    super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties props) {
+    String sourceSelectorClass =
+        props.getString(
+            CloudObjectsMetaSelector.Config.SOURCE_INPUT_SELECTOR,
+            CloudObjectsMetaSelector.class.getName());
+    try {
+      CloudObjectsMetaSelector selector =
+          (CloudObjectsMetaSelector)
+              ReflectionUtils.loadClass(
+                  sourceSelectorClass, new Class<?>[] {TypedProperties.class}, props);
+
+      log.info("Using path selector " + selector.getClass().getName());
+      return selector;
+    } catch (Exception e) {
+      throw new HoodieException("Could not load source selector class " + sourceSelectorClass, e);
+    }
+  }
+
+  /**
+   * List messages from queue, filter out illegible events while doing so. It will also delete the
+   * ineligible messages from queue.
+   *
+   * @param processedMessages array of processed messages to add more messages
+   * @return the list of eligible records
+   */
+  protected List<Map<String, Object>> getEligibleEvents(
+      AmazonSQS sqs, List<Message> processedMessages) throws IOException {
+
+    List<Map<String, Object>> eligibleRecords = new ArrayList<>();
+    List<Message> ineligibleMessages = new ArrayList<>();
+
+    ReceiveMessageRequest receiveMessageRequest =
+        new ReceiveMessageRequest()
+            .withQueueUrl(this.queueUrl)
+            .withWaitTimeSeconds(this.longPollWait)
+            .withVisibilityTimeout(this.visibilityTimeout);
+    receiveMessageRequest.setMaxNumberOfMessages(this.maxMessagesEachRequest);
+
+    List<Message> messages =
+        getMessagesToProcess(
+            sqs,
+            this.queueUrl,
+            receiveMessageRequest,
+            this.maxMessageEachBatch,
+            this.maxMessagesEachRequest);
+
+    for (Message message : messages) {
+      boolean isMessageDelete = Boolean.TRUE;
+
+      JSONObject messageBody = new JSONObject(message.getBody());
+      Map<String, Object> messageMap;
+      ObjectMapper mapper = new ObjectMapper();
+
+      if (messageBody.has("Message")) {
+        // If this messages is from S3Event -> SNS -> SQS
+        messageMap =
+            (Map<String, Object>) mapper.readValue(messageBody.getString("Message"), Map.class);
+      } else {
+        // If this messages is from S3Event -> SQS
+        messageMap = (Map<String, Object>) mapper.readValue(messageBody.toString(), Map.class);
+      }
+      if (messageMap.containsKey("Records")) {
+        List<Map<String, Object>> records = (List<Map<String, Object>>) messageMap.get("Records");
+        for (Map<String, Object> record : records) {
+          String eventName = (String) record.get("eventName");
+
+          // filter only allowed s3 event types
+          if (ALLOWED_S3_EVENT_PREFIX.stream().anyMatch(eventName::startsWith)) {
+            eligibleRecords.add(record);
+            isMessageDelete = Boolean.FALSE;
+            processedMessages.add(message);
+
+          } else {
+            log.info("This S3 event " + eventName + " is not allowed, so ignoring it.");
+          }
+        }
+      } else {
+        log.info("Message is not expected format or it's s3:TestEvent");

Review comment:
       1. debug.
   2. do you think we need to add the msg value as well. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot edited a comment on pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
hudi-bot edited a comment on pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#issuecomment-895012652


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1572",
       "triggerID" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1590",
       "triggerID" : "895681502",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "33f7d78265f1a9635d6254e0dbfb40f161a3d4a7",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1607",
       "triggerID" : "33f7d78265f1a9635d6254e0dbfb40f161a3d4a7",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 33f7d78265f1a9635d6254e0dbfb40f161a3d4a7 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1607) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run travis` re-run the last Travis build
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on a change in pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
codope commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688207653



##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/CloudObjectsMetaSource.java
##########
@@ -0,0 +1,88 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.hudi.utilities.sources.helpers.CloudObjectsMetaSelector;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Encoders;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * This source provides capability to create the hoodie table for cloudObject Metadata (eg. s3

Review comment:
       "hoodie cloud meta table" sounds like hoodie as cloud provider (or provisioned by hoodie). Instead, "hoodie table for cloud object metadata" sounds more clear. Wdyt?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot edited a comment on pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
hudi-bot edited a comment on pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#issuecomment-895012652


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1572",
       "triggerID" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1590",
       "triggerID" : "895681502",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "33f7d78265f1a9635d6254e0dbfb40f161a3d4a7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1607",
       "triggerID" : "33f7d78265f1a9635d6254e0dbfb40f161a3d4a7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bd9b7dd1e3f0a17fca7fab59650f6f0b03873dc1",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1712",
       "triggerID" : "bd9b7dd1e3f0a17fca7fab59650f6f0b03873dc1",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bd9b7dd1e3f0a17fca7fab59650f6f0b03873dc1 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1712) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run travis` re-run the last Travis build
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on a change in pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
codope commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688480052



##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelector.java
##########
@@ -0,0 +1,285 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.TypedProperties;
+
+import com.amazonaws.regions.Regions;
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.AmazonSQSClientBuilder;
+import com.amazonaws.services.sqs.model.BatchResultErrorEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequest;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchRequestEntry;
+import com.amazonaws.services.sqs.model.DeleteMessageBatchResult;
+import com.amazonaws.services.sqs.model.GetQueueAttributesRequest;
+import com.amazonaws.services.sqs.model.GetQueueAttributesResult;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.json.JSONObject;
+
+import java.io.UnsupportedEncodingException;
+import java.net.URLDecoder;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ * Cloud Objects Selector Class. This class has methods for processing cloud objects. It currently
+ * supports only AWS S3 objects and AWS SQS queue.
+ */
+public class CloudObjectsSelector {
+  public static final List<String> ALLOWED_S3_EVENT_PREFIX =
+      Collections.singletonList("ObjectCreated");
+  public static volatile Logger log = LogManager.getLogger(CloudObjectsSelector.class);
+  public final String queueUrl;
+  public final int longPollWait;
+  public final int maxMessagesEachRequest;
+  public final int maxMessageEachBatch;
+  public final int visibilityTimeout;
+  public final TypedProperties props;
+  public final String fsName;
+  private final String regionName;
+
+  /**
+   * Cloud Objects Selector Class. {@link CloudObjectsSelector}
+   */
+  public CloudObjectsSelector(TypedProperties props) {
+    DataSourceUtils.checkRequiredProperties(props, Arrays.asList(Config.QUEUE_URL_PROP, Config.QUEUE_REGION));
+    this.props = props;
+    this.queueUrl = props.getString(Config.QUEUE_URL_PROP);
+    this.regionName = props.getString(Config.QUEUE_REGION);
+    this.fsName = props.getString(Config.SOURCE_QUEUE_FS_PROP, "s3").toLowerCase();
+    this.longPollWait = props.getInteger(Config.QUEUE_LONGPOLLWAIT_PROP, 20);
+    this.maxMessageEachBatch = props.getInteger(Config.QUEUE_MAXMESSAGESEACHBATCH_PROP, 5);
+    this.visibilityTimeout = props.getInteger(Config.QUEUE_VISIBILITYTIMEOUT_PROP, 30);
+    this.maxMessagesEachRequest = 10;
+  }
+
+  /**
+   * Get SQS queue attributes.
+   *
+   * @param sqsClient AWSClient for sqsClient
+   * @param queueUrl  queue full url
+   * @return map of attributes needed
+   */
+  protected Map<String, String> getSqsQueueAttributes(AmazonSQS sqsClient, String queueUrl) {
+    GetQueueAttributesResult queueAttributesResult =
+        sqsClient.getQueueAttributes(
+            new GetQueueAttributesRequest(queueUrl)
+                .withAttributeNames("ApproximateNumberOfMessages"));
+    return queueAttributesResult.getAttributes();
+  }
+
+  /**
+   * Get the file attributes filePath, eventTime and size from JSONObject record.
+   *
+   * @param record of object event
+   * @return map of file attribute
+   */
+  protected Map<String, Object> getFileAttributesFromRecord(JSONObject record)
+      throws UnsupportedEncodingException {
+
+    Map<String, Object> fileRecord = new HashMap<>();
+    String eventTimeStr = record.getString("eventTime");
+    long eventTime =
+        Date.from(Instant.from(DateTimeFormatter.ISO_INSTANT.parse(eventTimeStr))).getTime();
+
+    JSONObject s3Object = record.getJSONObject("s3").getJSONObject("object");
+    String bucket =
+        URLDecoder.decode(
+            record.getJSONObject("s3").getJSONObject("bucket").getString("name"), "UTF-8");
+    String key = URLDecoder.decode(s3Object.getString("key"), "UTF-8");
+    String filePath = this.fsName + "://" + bucket + "/" + key;
+
+    fileRecord.put("eventTime", eventTime);
+    fileRecord.put("fileSize", s3Object.getLong("size"));
+    fileRecord.put("filePath", filePath);
+    return fileRecord;
+  }
+
+  /**
+   * Amazon SQS Client Builder.
+   */
+  public AmazonSQS createAmazonSqsClient() {
+    return AmazonSQSClientBuilder.standard().withRegion(Regions.fromName(regionName)).build();
+  }
+
+  /**
+   * List messages from queue.
+   */
+  protected List<Message> getMessagesToProcess(
+      AmazonSQS sqsClient,
+      String queueUrl,
+      ReceiveMessageRequest receiveMessageRequest,
+      int maxMessageEachBatch,
+      int maxMessagesEachRequest) {
+    List<Message> messagesToProcess = new ArrayList<>();
+
+    // Get count for available messages
+    Map<String, String> queueAttributesResult = getSqsQueueAttributes(sqsClient, queueUrl);
+    long approxMessagesAvailable =
+        Long.parseLong(queueAttributesResult.get("ApproximateNumberOfMessages"));
+    log.info("Approx. " + approxMessagesAvailable + " messages available in queue.");

Review comment:
       Done. But we will still need to check `messages.isEmpty()` and break off the loop because the the value of `ApproximateNumberOfMessages` returned by SQS is eventually consistent. So, in case this is some positive value but actually there are no messages, we don't want to run the loop again.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] hudi-bot edited a comment on pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
hudi-bot edited a comment on pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#issuecomment-895012652


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1572",
       "triggerID" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "076123328724c1ef5051208c57706ae09ba6c11e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1590",
       "triggerID" : "895681502",
       "triggerType" : "MANUAL"
     }, {
       "hash" : "33f7d78265f1a9635d6254e0dbfb40f161a3d4a7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1607",
       "triggerID" : "33f7d78265f1a9635d6254e0dbfb40f161a3d4a7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bd9b7dd1e3f0a17fca7fab59650f6f0b03873dc1",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1712",
       "triggerID" : "bd9b7dd1e3f0a17fca7fab59650f6f0b03873dc1",
       "triggerType" : "PUSH"
     }, {
       "hash" : "ee8fbced2d229bd487794a19123c47417acbf306",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1719",
       "triggerID" : "ee8fbced2d229bd487794a19123c47417acbf306",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5ffdc82b5e08f83772f28c2bf844688bc3e9fc50",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5ffdc82b5e08f83772f28c2bf844688bc3e9fc50",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * ee8fbced2d229bd487794a19123c47417acbf306 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=1719) 
   * 5ffdc82b5e08f83772f28c2bf844688bc3e9fc50 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run travis` re-run the last Travis build
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] codope commented on a change in pull request #3433: [HUDI-1897] Deltastreamer source for AWS S3

Posted by GitBox <gi...@apache.org>.
codope commented on a change in pull request #3433:
URL: https://github.com/apache/hudi/pull/3433#discussion_r688588692



##########
File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsMetaSelector.java
##########
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.helpers;
+
+import com.amazonaws.services.sqs.AmazonSQS;
+import com.amazonaws.services.sqs.model.Message;
+import com.amazonaws.services.sqs.model.ReceiveMessageRequest;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import java.io.IOException;
+import java.time.Instant;
+import java.time.format.DateTimeFormatter;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.Date;
+import java.util.List;
+import java.util.Map;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.collection.ImmutablePair;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.json.JSONException;
+import org.json.JSONObject;
+
+/**
+ * Cloud Objects Meta Selector Class. This class will provide the methods to process the messages
+ * from queue for CloudObjectsMetaSource.
+ */
+public class CloudObjectsMetaSelector extends CloudObjectsSelector {
+
+  /** Cloud Objects Meta Selector Class. {@link CloudObjectsSelector} */
+  public CloudObjectsMetaSelector(TypedProperties props) {
+    super(props);
+  }
+
+  /**
+   * Factory method for creating custom CloudObjectsMetaSelector. Default selector to use is {@link
+   * CloudObjectsMetaSelector}
+   */
+  public static CloudObjectsMetaSelector createSourceSelector(TypedProperties props) {
+    String sourceSelectorClass =
+        props.getString(
+            CloudObjectsMetaSelector.Config.SOURCE_INPUT_SELECTOR,
+            CloudObjectsMetaSelector.class.getName());
+    try {
+      CloudObjectsMetaSelector selector =
+          (CloudObjectsMetaSelector)
+              ReflectionUtils.loadClass(
+                  sourceSelectorClass, new Class<?>[] {TypedProperties.class}, props);
+
+      log.info("Using path selector " + selector.getClass().getName());
+      return selector;
+    } catch (Exception e) {
+      throw new HoodieException("Could not load source selector class " + sourceSelectorClass, e);
+    }
+  }
+
+  /**
+   * List messages from queue, filter out illegible events while doing so. It will also delete the
+   * ineligible messages from queue.
+   *
+   * @param processedMessages array of processed messages to add more messages
+   * @return the list of eligible records
+   */
+  protected List<Map<String, Object>> getEligibleEvents(
+      AmazonSQS sqs, List<Message> processedMessages) throws IOException {
+
+    List<Map<String, Object>> eligibleRecords = new ArrayList<>();
+    List<Message> ineligibleMessages = new ArrayList<>();
+
+    ReceiveMessageRequest receiveMessageRequest =
+        new ReceiveMessageRequest()
+            .withQueueUrl(this.queueUrl)
+            .withWaitTimeSeconds(this.longPollWait)
+            .withVisibilityTimeout(this.visibilityTimeout);
+    receiveMessageRequest.setMaxNumberOfMessages(this.maxMessagesEachRequest);
+
+    List<Message> messages =
+        getMessagesToProcess(
+            sqs,
+            this.queueUrl,
+            receiveMessageRequest,
+            this.maxMessageEachBatch,
+            this.maxMessagesEachRequest);
+
+    for (Message message : messages) {
+      boolean isMessageDelete = Boolean.TRUE;
+
+      JSONObject messageBody = new JSONObject(message.getBody());
+      Map<String, Object> messageMap;
+      ObjectMapper mapper = new ObjectMapper();
+
+      if (messageBody.has("Message")) {
+        // If this messages is from S3Event -> SNS -> SQS
+        messageMap =
+            (Map<String, Object>) mapper.readValue(messageBody.getString("Message"), Map.class);
+      } else {
+        // If this messages is from S3Event -> SQS
+        messageMap = (Map<String, Object>) mapper.readValue(messageBody.toString(), Map.class);
+      }
+      if (messageMap.containsKey("Records")) {
+        List<Map<String, Object>> records = (List<Map<String, Object>>) messageMap.get("Records");
+        for (Map<String, Object> record : records) {
+          String eventName = (String) record.get("eventName");
+
+          // filter only allowed s3 event types
+          if (ALLOWED_S3_EVENT_PREFIX.stream().anyMatch(eventName::startsWith)) {
+            eligibleRecords.add(record);
+            isMessageDelete = Boolean.FALSE;
+            processedMessages.add(message);
+
+          } else {
+            log.info("This S3 event " + eventName + " is not allowed, so ignoring it.");
+          }
+        }
+      } else {
+        log.info("Message is not expected format or it's s3:TestEvent");
+      }
+      if (isMessageDelete) {
+        ineligibleMessages.add(message);
+      }
+    }
+    if (!ineligibleMessages.isEmpty()) {
+      deleteBatchOfMessages(sqs, queueUrl, ineligibleMessages);

Review comment:
       Agreed. All messages, valid or invalid will be deleted onCommit. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org