You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@seatunnel.apache.org by GitBox <gi...@apache.org> on 2022/09/09 16:31:14 UTC

[GitHub] [incubator-seatunnel] TyrantLucifer opened a new pull request, #2708: [Improve][Connector-V2] Refactor hive source & sink connector

TyrantLucifer opened a new pull request, #2708:
URL: https://github.com/apache/incubator-seatunnel/pull/2708

   <!--
   
   Thank you for contributing to SeaTunnel! Please make sure that your code changes
   are covered with tests. And in case of new features or big changes
   remember to adjust the documentation.
   
   Feel free to ping committers for the review!
   
   ## Contribution Checklist
   
     - Make sure that the pull request corresponds to a [GITHUB issue](https://github.com/apache/incubator-seatunnel/issues).
   
     - Name the pull request in the form "[Feature] [component] Title of the pull request", where *Feature* can be replaced by `Hotfix`, `Bug`, etc.
   
     - Minor fixes should be named following this pattern: `[hotfix] [docs] Fix typo in README.md doc`.
   
   -->
   
   ## Purpose of this pull request
   
   According to #2555, this pr has optimized the hive sink connector on top of it.
   
   <!-- Describe the purpose of this pull request. For example: This pull request adds checkstyle plugin.-->
   
   ## Check list
   
   * [x] Code changed are covered with tests, or it does not need tests for reason:
   * [ ] If any new Jar binary package adding in your PR, please add License Notice according
     [New License Guide](https://github.com/apache/incubator-seatunnel/blob/dev/docs/en/contribution/new-license.md)
   * [x] If necessary, please update the documentation to describe the new feature. https://github.com/apache/incubator-seatunnel/tree/dev/docs
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] EricJoy2048 commented on a diff in pull request #2708: [Improve][Connector-V2] Refactor hive source & sink connector

Posted by GitBox <gi...@apache.org>.
EricJoy2048 commented on code in PR #2708:
URL: https://github.com/apache/incubator-seatunnel/pull/2708#discussion_r969567365


##########
docs/en/connector-v2/source/Hive.md:
##########
@@ -0,0 +1,47 @@
+# Hive
+
+> Hive source connector
+
+## Description
+
+Read data from Hive.
+
+In order to use this connector, You must ensure your spark/flink cluster already integrated hive. The tested hive version is 2.3.9.
+
+## Key features
+
+- [x] [exactly-once](../../concept/connector-v2-features.md)
+
+By default, we use 2PC commit to ensure `exactly-once`

Review Comment:
   > So file source does not support `exactly-once`, right?
   
   I read the code again, if we read all read of a split and send to the down stream in `pollNext` method, we can support `exactly-once`. Because the `pollNext` method and `snapshot` method use same lock `checkpointLock`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] EricJoy2048 merged pull request #2708: [Improve][Connector-V2] Refactor hive source & sink connector

Posted by GitBox <gi...@apache.org>.
EricJoy2048 merged PR #2708:
URL: https://github.com/apache/incubator-seatunnel/pull/2708


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] EricJoy2048 commented on a diff in pull request #2708: [Improve][Connector-V2] Refactor hive source & sink connector

Posted by GitBox <gi...@apache.org>.
EricJoy2048 commented on code in PR #2708:
URL: https://github.com/apache/incubator-seatunnel/pull/2708#discussion_r969564012


##########
docs/en/connector-v2/source/Hive.md:
##########
@@ -0,0 +1,47 @@
+# Hive
+
+> Hive source connector
+
+## Description
+
+Read data from Hive.
+
+In order to use this connector, You must ensure your spark/flink cluster already integrated hive. The tested hive version is 2.3.9.
+
+## Key features
+
+- [x] [exactly-once](../../concept/connector-v2-features.md)
+
+By default, we use 2PC commit to ensure `exactly-once`

Review Comment:
   > So file source does not support `exactly-once`, right?
   
   From the perspective of code implementation, yes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] CalvinKirs commented on a diff in pull request #2708: [Improve][Connector-V2] Refactor hive source & sink connector

Posted by GitBox <gi...@apache.org>.
CalvinKirs commented on code in PR #2708:
URL: https://github.com/apache/incubator-seatunnel/pull/2708#discussion_r970278440


##########
seatunnel-connectors-v2/connector-hive/src/main/java/org/apache/seatunnel/connectors/seatunnel/hive/utils/HiveMetaStoreProxy.java:
##########
@@ -38,15 +46,36 @@ public HiveMetaStoreProxy(@NonNull String uris) {
         }
     }
 
+    public static synchronized HiveMetaStoreProxy getInstance(Config config) {
+        if (INSTANCE == null) {
+            String metastoreUri = config.getString(HiveConfig.METASTORE_URI);
+            INSTANCE = new HiveMetaStoreProxy(metastoreUri);
+        }
+        return INSTANCE;
+    }

Review Comment:
   We can lock during initialization, 
   ```
   if instance! =null 
      then return,
      else lock and init
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] EricJoy2048 commented on pull request #2708: [Improve][Connector-V2] Refactor hive source & sink connector

Posted by GitBox <gi...@apache.org>.
EricJoy2048 commented on PR #2708:
URL: https://github.com/apache/incubator-seatunnel/pull/2708#issuecomment-1246184196

   @CalvinKirs  PTAL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] EricJoy2048 commented on a diff in pull request #2708: [Improve][Connector-V2] Refactor hive source & sink connector

Posted by GitBox <gi...@apache.org>.
EricJoy2048 commented on code in PR #2708:
URL: https://github.com/apache/incubator-seatunnel/pull/2708#discussion_r969564012


##########
docs/en/connector-v2/source/Hive.md:
##########
@@ -0,0 +1,47 @@
+# Hive
+
+> Hive source connector
+
+## Description
+
+Read data from Hive.
+
+In order to use this connector, You must ensure your spark/flink cluster already integrated hive. The tested hive version is 2.3.9.
+
+## Key features
+
+- [x] [exactly-once](../../concept/connector-v2-features.md)
+
+By default, we use 2PC commit to ensure `exactly-once`

Review Comment:
   > So file source does not support `exactly-once`, right?
   
   From the perspective of code implementation, it does not support now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] TyrantLucifer commented on a diff in pull request #2708: [Improve][Connector-V2] Refactor hive source & sink connector

Posted by GitBox <gi...@apache.org>.
TyrantLucifer commented on code in PR #2708:
URL: https://github.com/apache/incubator-seatunnel/pull/2708#discussion_r969259029


##########
docs/en/connector-v2/source/Hive.md:
##########
@@ -0,0 +1,47 @@
+# Hive
+
+> Hive source connector
+
+## Description
+
+Read data from Hive.
+
+In order to use this connector, You must ensure your spark/flink cluster already integrated hive. The tested hive version is 2.3.9.
+
+## Key features
+
+- [x] [exactly-once](../../concept/connector-v2-features.md)
+
+By default, we use 2PC commit to ensure `exactly-once`

Review Comment:
   So file source does not support `exactly-once`, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] EricJoy2048 commented on a diff in pull request #2708: [Improve][Connector-V2] Refactor hive source & sink connector

Posted by GitBox <gi...@apache.org>.
EricJoy2048 commented on code in PR #2708:
URL: https://github.com/apache/incubator-seatunnel/pull/2708#discussion_r969201716


##########
docs/en/connector-v2/source/Hive.md:
##########
@@ -0,0 +1,47 @@
+# Hive
+
+> Hive source connector
+
+## Description
+
+Read data from Hive.
+
+In order to use this connector, You must ensure your spark/flink cluster already integrated hive. The tested hive version is 2.3.9.
+
+## Key features
+
+- [x] [exactly-once](../../concept/connector-v2-features.md)
+
+By default, we use 2PC commit to ensure `exactly-once`

Review Comment:
   If a source connector support `exactly-once`, it must save the split and the offset already send to down stream when snapshot.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] EricJoy2048 commented on a diff in pull request #2708: [Improve][Connector-V2] Refactor hive source & sink connector

Posted by GitBox <gi...@apache.org>.
EricJoy2048 commented on code in PR #2708:
URL: https://github.com/apache/incubator-seatunnel/pull/2708#discussion_r969567365


##########
docs/en/connector-v2/source/Hive.md:
##########
@@ -0,0 +1,47 @@
+# Hive
+
+> Hive source connector
+
+## Description
+
+Read data from Hive.
+
+In order to use this connector, You must ensure your spark/flink cluster already integrated hive. The tested hive version is 2.3.9.
+
+## Key features
+
+- [x] [exactly-once](../../concept/connector-v2-features.md)
+
+By default, we use 2PC commit to ensure `exactly-once`

Review Comment:
   > So file source does not support `exactly-once`, right?
   
   I read the code again, if we read all the data of split and  send them to the down stream in a call to the `pollNext` method. It can support `exactly-once`. Because the `pollNext` method and `snapshot` method use same lock `checkpointLock`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] EricJoy2048 commented on a diff in pull request #2708: [Improve][Connector-V2] Refactor hive source & sink connector

Posted by GitBox <gi...@apache.org>.
EricJoy2048 commented on code in PR #2708:
URL: https://github.com/apache/incubator-seatunnel/pull/2708#discussion_r969567365


##########
docs/en/connector-v2/source/Hive.md:
##########
@@ -0,0 +1,47 @@
+# Hive
+
+> Hive source connector
+
+## Description
+
+Read data from Hive.
+
+In order to use this connector, You must ensure your spark/flink cluster already integrated hive. The tested hive version is 2.3.9.
+
+## Key features
+
+- [x] [exactly-once](../../concept/connector-v2-features.md)
+
+By default, we use 2PC commit to ensure `exactly-once`

Review Comment:
   > So file source does not support `exactly-once`, right?
   
   I read the code again, if we read all data of a split and send to the down stream in `pollNext` method, we can support `exactly-once`. Because the `pollNext` method and `snapshot` method use same lock `checkpointLock`.



##########
docs/en/connector-v2/source/Hive.md:
##########
@@ -0,0 +1,47 @@
+# Hive
+
+> Hive source connector
+
+## Description
+
+Read data from Hive.
+
+In order to use this connector, You must ensure your spark/flink cluster already integrated hive. The tested hive version is 2.3.9.
+
+## Key features
+
+- [x] [exactly-once](../../concept/connector-v2-features.md)
+
+By default, we use 2PC commit to ensure `exactly-once`

Review Comment:
   > So file source does not support `exactly-once`, right?
   
   I read the code again, if we read all data of a split and send them to the down stream in `pollNext` method, we can support `exactly-once`. Because the `pollNext` method and `snapshot` method use same lock `checkpointLock`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-seatunnel] EricJoy2048 commented on a diff in pull request #2708: [Improve][Connector-V2] Refactor hive source & sink connector

Posted by GitBox <gi...@apache.org>.
EricJoy2048 commented on code in PR #2708:
URL: https://github.com/apache/incubator-seatunnel/pull/2708#discussion_r969567365


##########
docs/en/connector-v2/source/Hive.md:
##########
@@ -0,0 +1,47 @@
+# Hive
+
+> Hive source connector
+
+## Description
+
+Read data from Hive.
+
+In order to use this connector, You must ensure your spark/flink cluster already integrated hive. The tested hive version is 2.3.9.
+
+## Key features
+
+- [x] [exactly-once](../../concept/connector-v2-features.md)
+
+By default, we use 2PC commit to ensure `exactly-once`

Review Comment:
   > So file source does not support `exactly-once`, right?
   
   I read the code again, if we read all read of a split and send to the down stream, we can support `exactly-once`. Because the `pollNext` method and `snapshot` method use same lock `checkpointLock`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@seatunnel.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org