You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by "myfjdthink (via GitHub)" <gi...@apache.org> on 2023/04/13 01:55:49 UTC

[GitHub] [doris] myfjdthink opened a new pull request, #18616: support read iceberg data on gcs

myfjdthink opened a new pull request, #18616:
URL: https://github.com/apache/doris/pull/18616

   # Proposed changes
   
   Issue Number: close #10452
   
   ## Problem summary
   
   Describe your changes.
   
   ## Checklist(Required)
   
   - [ ] Does it affect the original behavior
   - [ ] Has unit tests been added
   - [ ] Has document been added or modified
   - [ ] Does it need to update dependencies
   - [x] Is this PR support rollback (If NO, please explain WHY)
   
   ## Further comments
   
   我修改了 `org.apache.doris.planner.external.IcebergSplitter#getSplits` 方法,在 Iceberg 读取 metadata 获取到 data file path 后将 gs:// 前缀修改为 s3:// 前缀,这样 BE 才能顺利读取文件,但是我不知道为什么 aliyun oss 不需要这样做,这里存疑
   
   
   I changed the `org.apache.doris.planner.external.IcebergSplitter#getSplits` method changes the gs://  prefix to s3:// prefix after Iceberg reads the metadata to get the data file path, so BE can read the file smoothly, but I don't know why Aliyun oss doesn't need to do this, which is questionable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] wsjz commented on a diff in pull request #18616: Support read iceberg data on gcs

Posted by "wsjz (via GitHub)" <gi...@apache.org>.
wsjz commented on code in PR #18616:
URL: https://github.com/apache/doris/pull/18616#discussion_r1187192116


##########
fe/fe-core/src/main/java/org/apache/doris/planner/external/iceberg/IcebergScanNode.java:
##########
@@ -185,7 +186,10 @@ public List<Split> getSplits() throws UserException {
             long fileSize = task.file().fileSizeInBytes();
             for (FileScanTask splitTask : task.split(splitSize)) {
                 String dataFilePath = splitTask.file().path().toString();
-                IcebergSplit split = new IcebergSplit(new Path(dataFilePath), splitTask.start(),
+                // BE 无法识别 gs:// 开头的 file path,需要转换为 s3
+                // 为什么其他对象存储不需要转换? 例如 oss

Review Comment:
   到BE那边都转换成s3://了的,我觉得这个注释可以删掉



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] myfjdthink closed pull request #18616: Support read iceberg data on gcs

Posted by "myfjdthink (via GitHub)" <gi...@apache.org>.
myfjdthink closed pull request #18616: Support read iceberg data on gcs
URL: https://github.com/apache/doris/pull/18616


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] wsjz commented on pull request #18616: support read iceberg data on gcs

Posted by "wsjz (via GitHub)" <gi...@apache.org>.
wsjz commented on PR #18616:
URL: https://github.com/apache/doris/pull/18616#issuecomment-1522873174

   > Please hold and wait for #18960 fixed CC @wsjz
   
   I will debug it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] morningman commented on a diff in pull request #18616: Support read iceberg data on gcs

Posted by "morningman (via GitHub)" <gi...@apache.org>.
morningman commented on code in PR #18616:
URL: https://github.com/apache/doris/pull/18616#discussion_r1186649165


##########
fe/fe-core/src/main/java/org/apache/doris/common/util/S3Util.java:
##########
@@ -0,0 +1,64 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+package org.apache.doris.common.util;
+
+import org.apache.doris.common.FeConstants;
+import org.apache.doris.thrift.TFileType;
+
+import org.apache.logging.log4j.LogManager;
+import org.apache.logging.log4j.Logger;
+
+public class S3Util {
+    public static final Logger LOG = LogManager.getLogger(S3Util.class);
+
+    public static boolean isS3Path(String location) {

Review Comment:
   private



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] wsjz commented on a diff in pull request #18616: Support read iceberg data on gcs

Posted by "wsjz (via GitHub)" <gi...@apache.org>.
wsjz commented on code in PR #18616:
URL: https://github.com/apache/doris/pull/18616#discussion_r1187189573


##########
fe/fe-core/src/main/java/org/apache/doris/common/util/S3Util.java:
##########
@@ -0,0 +1,64 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+package org.apache.doris.common.util;
+
+import org.apache.doris.common.FeConstants;
+import org.apache.doris.thrift.TFileType;
+
+import org.apache.logging.log4j.LogManager;
+import org.apache.logging.log4j.Logger;
+
+public class S3Util {
+    public static final Logger LOG = LogManager.getLogger(S3Util.class);
+
+    public static boolean isS3Path(String location) {

Review Comment:
   we can use isObjStorage instead of isS3Path. so maybe need remove it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] wsjz commented on a diff in pull request #18616: support read iceberg data on gcs

Posted by "wsjz (via GitHub)" <gi...@apache.org>.
wsjz commented on code in PR #18616:
URL: https://github.com/apache/doris/pull/18616#discussion_r1177419516


##########
fe/fe-core/pom.xml:
##########
@@ -746,6 +746,22 @@ under the License.
             </exclusions>
         </dependency>
 
+        <!-- for fe recognize files stored on gcs -->
+        <dependency>
+            <groupId>com.google.cloud.bigdataoss</groupId>
+            <artifactId>gcs-connector</artifactId>
+            <version>hadoop3-2.1.4</version>
+            <classifier>shaded</classifier>
+            <!-- conflict with grpc-netty-shaded -->

Review Comment:
   I think the reason of that  'gs://' is required is because of the gcs-connector?  I had the same problem when I used Huawei OBSClient before.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] CalvinKirs commented on a diff in pull request #18616: support read iceberg data on gcs

Posted by "CalvinKirs (via GitHub)" <gi...@apache.org>.
CalvinKirs commented on code in PR #18616:
URL: https://github.com/apache/doris/pull/18616#discussion_r1177452208


##########
fe/fe-core/pom.xml:
##########
@@ -746,6 +746,22 @@ under the License.
             </exclusions>
         </dependency>
 
+        <!-- for fe recognize files stored on gcs -->
+        <dependency>
+            <groupId>com.google.cloud.bigdataoss</groupId>
+            <artifactId>gcs-connector</artifactId>
+            <version>hadoop3-2.1.4</version>
+            <classifier>shaded</classifier>
+            <!-- conflict with grpc-netty-shaded -->

Review Comment:
   > The latest version of the `gcs-connector` conflicts with the `grpc-netty-shaded`. You can upgrade to the latest version if you know how to resolve the conflict↳
   
   If you can tease out their dependencies, and list version compatibility, then we can consider which version is more appropriate.
   But one thing that needs to be explained, is the consideration of CVE is always the first and most important point.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] morningman commented on pull request #18616: support read iceberg data on gcs

Posted by "morningman (via GitHub)" <gi...@apache.org>.
morningman commented on PR #18616:
URL: https://github.com/apache/doris/pull/18616#issuecomment-1522700324

   Please hold and wait for #18960 fixed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] myfjdthink commented on a diff in pull request #18616: Support read iceberg data on gcs

Posted by "myfjdthink (via GitHub)" <gi...@apache.org>.
myfjdthink commented on code in PR #18616:
URL: https://github.com/apache/doris/pull/18616#discussion_r1196175952


##########
fe/fe-core/pom.xml:
##########
@@ -746,6 +746,22 @@ under the License.
             </exclusions>
         </dependency>
 
+        <!-- for fe recognize files stored on gcs -->
+        <dependency>
+            <groupId>com.google.cloud.bigdataoss</groupId>
+            <artifactId>gcs-connector</artifactId>
+            <version>hadoop3-2.1.4</version>
+            <classifier>shaded</classifier>
+            <!-- conflict with grpc-netty-shaded -->

Review Comment:
   gcs-connector upgrade to the latest version,  set <exclusion>*</exclusion> to avoid conflict problems



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] myfjdthink commented on pull request #18616: Support read iceberg data on gcs

Posted by "myfjdthink (via GitHub)" <gi...@apache.org>.
myfjdthink commented on PR #18616:
URL: https://github.com/apache/doris/pull/18616#issuecomment-1537924299

   Get same error as this issuehttps://github.com/apache/doris/issues/19275
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] wsjz commented on pull request #18616: Support read iceberg data on gcs

Posted by "wsjz (via GitHub)" <gi...@apache.org>.
wsjz commented on PR #18616:
URL: https://github.com/apache/doris/pull/18616#issuecomment-1537984242

   run buildall


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] morningman commented on a diff in pull request #18616: Support read iceberg data on gcs

Posted by "morningman (via GitHub)" <gi...@apache.org>.
morningman commented on code in PR #18616:
URL: https://github.com/apache/doris/pull/18616#discussion_r1196717843


##########
docs/en/docs/lakehouse/multi-catalog/iceberg.md:
##########
@@ -129,6 +129,23 @@ If you want to use S3 storage, the following properties need to be set.
 "AWS_CREDENTIALS_PROVIDER" = "provider-class-name" // Optional. The default credentials class is based on BasicAWSCredentials.
 ```
 
+#### Google Dataproc Metastore 作为元数据服务
+
+```sql
+CREATE CATALOG iceberg PROPERTIES (
+"type"="iceberg",

Review Comment:
   Add indent , eg, 4 spaces for all properties



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] myfjdthink commented on pull request #18616: support read iceberg data on gcs

Posted by "myfjdthink (via GitHub)" <gi...@apache.org>.
myfjdthink commented on PR #18616:
URL: https://github.com/apache/doris/pull/18616#issuecomment-1507899787

   > should be LogManager.getLogger(S3Util.class)
   
   fixed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] myfjdthink commented on pull request #18616: support read iceberg data on gcs

Posted by "myfjdthink (via GitHub)" <gi...@apache.org>.
myfjdthink commented on PR #18616:
URL: https://github.com/apache/doris/pull/18616#issuecomment-1507899338

   在 master 分支上跑通,测试可以读取 gcs iceberg 数据,我后面再补充一个文档的 PR,描述何如配置 gcs iceberg


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] CalvinKirs commented on a diff in pull request #18616: support read iceberg data on gcs

Posted by "CalvinKirs (via GitHub)" <gi...@apache.org>.
CalvinKirs commented on code in PR #18616:
URL: https://github.com/apache/doris/pull/18616#discussion_r1171073957


##########
fe/fe-core/pom.xml:
##########
@@ -746,6 +746,22 @@ under the License.
             </exclusions>
         </dependency>
 
+        <!-- for fe recognize files stored on gcs -->
+        <dependency>
+            <groupId>com.google.cloud.bigdataoss</groupId>
+            <artifactId>gcs-connector</artifactId>
+            <version>hadoop3-2.1.4</version>
+            <classifier>shaded</classifier>
+            <!-- conflict with grpc-netty-shaded -->

Review Comment:
   Considering that this jar was released in 2020, there are some CVEs of dependent components, have you considered upgrading the version?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] myfjdthink commented on a diff in pull request #18616: support read iceberg data on gcs

Posted by "myfjdthink (via GitHub)" <gi...@apache.org>.
myfjdthink commented on code in PR #18616:
URL: https://github.com/apache/doris/pull/18616#discussion_r1174860303


##########
fe/fe-core/pom.xml:
##########
@@ -746,6 +746,22 @@ under the License.
             </exclusions>
         </dependency>
 
+        <!-- for fe recognize files stored on gcs -->
+        <dependency>
+            <groupId>com.google.cloud.bigdataoss</groupId>
+            <artifactId>gcs-connector</artifactId>
+            <version>hadoop3-2.1.4</version>
+            <classifier>shaded</classifier>
+            <!-- conflict with grpc-netty-shaded -->

Review Comment:
   The latest version of the `gcs-connector` conflicts with the `grpc-netty-shaded`. You can upgrade to the latest version if you know how to resolve the conflict



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org


[GitHub] [doris] wsjz commented on a diff in pull request #18616: support read iceberg data on gcs

Posted by "wsjz (via GitHub)" <gi...@apache.org>.
wsjz commented on code in PR #18616:
URL: https://github.com/apache/doris/pull/18616#discussion_r1177419516


##########
fe/fe-core/pom.xml:
##########
@@ -746,6 +746,22 @@ under the License.
             </exclusions>
         </dependency>
 
+        <!-- for fe recognize files stored on gcs -->
+        <dependency>
+            <groupId>com.google.cloud.bigdataoss</groupId>
+            <artifactId>gcs-connector</artifactId>
+            <version>hadoop3-2.1.4</version>
+            <classifier>shaded</classifier>
+            <!-- conflict with grpc-netty-shaded -->

Review Comment:
   I think the reason of that  'gs://' is required maybe you use the gcs-connector?  I had the same problem when I used Huawei OBSClient before.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org