You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/04/25 13:42:47 UTC

[GitHub] [flink] empcl opened a new pull request, #19573: [FLINK-27384] solve the problem that the latest data cannot be read under the creat…

empcl opened a new pull request, #19573:
URL: https://github.com/apache/flink/pull/19573

   ## What is the purpose of the change
   This pull request solve the problem that the latest data cannot be read under the createtime configuration in the hive dimension table.
   
   ## Brief change log
   solve the problem that the latest data cannot be read under the createtime configuration in the hive dimension table
   
   
   ## Verifying this change
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   no
   
   This change added tests and can be verified as follows:
   
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] empcl commented on a diff in pull request #19573: [FLINK-27384] solve the problem that the latest data cannot be read under the creat…

Posted by GitBox <gi...@apache.org>.
empcl commented on code in PR #19573:
URL: https://github.com/apache/flink/pull/19573#discussion_r859451083


##########
flink-connectors/flink-connector-hive/src/main/java/org/apache/flink/connectors/hive/read/HivePartitionFetcherContextBase.java:
##########
@@ -139,16 +138,9 @@ public List<ComparablePartitionValue> getComparablePartitionValueList() throws E
                                 tablePath.getDatabaseName(),
                                 tablePath.getObjectName(),
                                 Short.MAX_VALUE);
-                List<String> newNames =
-                        partitionNames.stream()
-                                .filter(
-                                        n ->
-                                                !partValuesToCreateTime.containsKey(
-                                                        extractPartitionValues(n)))
-                                .collect(Collectors.toList());
                 List<Partition> newPartitions =
                         metaStoreClient.getPartitionsByNames(
-                                tablePath.getDatabaseName(), tablePath.getObjectName(), newNames);
+                                tablePath.getDatabaseName(), tablePath.getObjectName(), partitionNames);

Review Comment:
   OK, I'll look at your reply tonight. thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] luoyuxia commented on pull request #19573: [FLINK-27384] solve the problem that the latest data cannot be read under the creat…

Posted by GitBox <gi...@apache.org>.
luoyuxia commented on PR #19573:
URL: https://github.com/apache/flink/pull/19573#issuecomment-1110455056

   cc @leonardBang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] empcl commented on pull request #19573: [FLINK-27384] solve the problem that the latest data cannot be read under the creat…

Posted by GitBox <gi...@apache.org>.
empcl commented on PR #19573:
URL: https://github.com/apache/flink/pull/19573#issuecomment-1112448067

   @luoyuxia Regarding adding test cases, I'm sorry, I can't verify this change on HivePartitionFetcherTest. Because, the test case above HivePartitionFetcherTest cannot actually obtain the specific partition information (that is why `assertEquals(0, fetcherContext.getComparablePartitionValueList().size());`). In addition, because the aspect of my modification is to operate on a known partition during the running of the program. In terms of current test mocks, it doesn't meet my needs. So, I'm sorry for not being able to provide relevant test cases to meet my needs. However, I have verified the correctness of the code on a distributed cluster.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] luoyuxia commented on a diff in pull request #19573: [FLINK-27384] solve the problem that the latest data cannot be read under the creat…

Posted by GitBox <gi...@apache.org>.
luoyuxia commented on code in PR #19573:
URL: https://github.com/apache/flink/pull/19573#discussion_r859313505


##########
flink-connectors/flink-connector-hive/src/main/java/org/apache/flink/connectors/hive/read/HivePartitionFetcherContextBase.java:
##########
@@ -139,16 +138,9 @@ public List<ComparablePartitionValue> getComparablePartitionValueList() throws E
                                 tablePath.getDatabaseName(),
                                 tablePath.getObjectName(),
                                 Short.MAX_VALUE);
-                List<String> newNames =
-                        partitionNames.stream()
-                                .filter(
-                                        n ->
-                                                !partValuesToCreateTime.containsKey(
-                                                        extractPartitionValues(n)))
-                                .collect(Collectors.toList());
                 List<Partition> newPartitions =
                         metaStoreClient.getPartitionsByNames(
-                                tablePath.getDatabaseName(), tablePath.getObjectName(), newNames);
+                                tablePath.getDatabaseName(), tablePath.getObjectName(), partitionNames);

Review Comment:
   Then it will always contain the all paritions, finally return all partitions with create time, right?
   I prefer to only return the partitions that dose changes. Maybe you can compare the modifcation time for all partitions to decide which partition is updated and should be returned.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] luoyuxia commented on pull request #19573: [FLINK-27384] solve the problem that the latest data cannot be read under the creat…

Posted by GitBox <gi...@apache.org>.
luoyuxia commented on PR #19573:
URL: https://github.com/apache/flink/pull/19573#issuecomment-1110455604

   @empcl The compiler failed. You can use `mvn spotless:apply` to fix the failure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] flinkbot commented on pull request #19573: [FLINK-27384] solve the problem that the latest data cannot be read under the creat…

Posted by GitBox <gi...@apache.org>.
flinkbot commented on PR #19573:
URL: https://github.com/apache/flink/pull/19573#issuecomment-1108603599

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "d0d36d0299573abdb21157e70a8f9d81cf66cbda",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "d0d36d0299573abdb21157e70a8f9d81cf66cbda",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * d0d36d0299573abdb21157e70a8f9d81cf66cbda UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] empcl commented on a diff in pull request #19573: [FLINK-27384] solve the problem that the latest data cannot be read under the creat…

Posted by GitBox <gi...@apache.org>.
empcl commented on code in PR #19573:
URL: https://github.com/apache/flink/pull/19573#discussion_r861119613


##########
flink-connectors/flink-connector-hive/src/main/java/org/apache/flink/connectors/hive/read/HivePartitionFetcherContextBase.java:
##########
@@ -139,16 +138,9 @@ public List<ComparablePartitionValue> getComparablePartitionValueList() throws E
                                 tablePath.getDatabaseName(),
                                 tablePath.getObjectName(),
                                 Short.MAX_VALUE);
-                List<String> newNames =
-                        partitionNames.stream()
-                                .filter(
-                                        n ->
-                                                !partValuesToCreateTime.containsKey(
-                                                        extractPartitionValues(n)))
-                                .collect(Collectors.toList());
                 List<Partition> newPartitions =
                         metaStoreClient.getPartitionsByNames(
-                                tablePath.getDatabaseName(), tablePath.getObjectName(), newNames);
+                                tablePath.getDatabaseName(), tablePath.getObjectName(), partitionNames);

Review Comment:
   @luoyuxia Hi, this is a great idea. However, this method getComparablePartitionValueList() returns all partitions in the current directory. The comparison and selection of the required partition information is done in the outer layer of the method. In addition, this method does not recommend directly returning the required partition information. Because in the case of bounded flow, all partition information is obtained by this method.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] empcl commented on pull request #19573: [FLINK-27384] solve the problem that the latest data cannot be read under the creat…

Posted by GitBox <gi...@apache.org>.
empcl commented on PR #19573:
URL: https://github.com/apache/flink/pull/19573#issuecomment-1160616245

   @leonardBang hi, could you help me review & merge?Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] luoyuxia commented on pull request #19573: [FLINK-27384] solve the problem that the latest data cannot be read under the creat…

Posted by GitBox <gi...@apache.org>.
luoyuxia commented on PR #19573:
URL: https://github.com/apache/flink/pull/19573#issuecomment-1118078947

   @leonardBang Cound you please help review & merge?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] leonardBang commented on pull request #19573: [FLINK-27384] solve the problem that the latest data cannot be read under the creat…

Posted by GitBox <gi...@apache.org>.
leonardBang commented on PR #19573:
URL: https://github.com/apache/flink/pull/19573#issuecomment-1178607449

   @empcl The change looks good to me, however could you also open PR for `master` and `release-1.15` branch, usually we should fix master first and then backport to other branches. Btw, could you rebase this PR to `release-1.14` as the latest CI has been passed some days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org