You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by GitBox <gi...@apache.org> on 2022/06/28 07:51:40 UTC

[GitHub] [doris] myfjdthink commented on issue #10452: [Feature] Doris support read iceberg table on google cloud storage

myfjdthink commented on issue #10452:
URL: https://github.com/apache/doris/issues/10452#issuecomment-1168365071

   In addition to authorization issues, doirs are also unable to read data stored on gcs.
   Take a look at this example
   I wrote iceberg table on spark and the data is stored on gcs
   and then read it in the hive environment, it works fine
   
   `hive> select * from gs_table2;`
   ```
   Query ID = nick_20220628041137_6fe75ec9-bae7-40b2-8d96-f6653fcdfb49
   Total jobs = 1
   Launching Job 1 out of 1
   Status: Running (Executing on YARN cluster with App id application_1656302766976_0013)
   
   ----------------------------------------------------------------------------------------------
           VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
   ----------------------------------------------------------------------------------------------
   Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0
   ----------------------------------------------------------------------------------------------
   VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 8.34 s
   ----------------------------------------------------------------------------------------------
   OK
   1	a
   2	b
   3	c
   4	a
   1	a
   2	b
   3	c
   4	a
   Time taken: 9.569 seconds, Fetched: 8 row(s)
   ```
   
   `hive> describe formatted gs_table2;`
   ```
   OK
   # col_name            	data_type           	comment
   id                  	int                 	from deserializer
   data                	string              	from deserializer
   
   # Detailed Table Information
   Database:           	gsdb
   OwnerType:          	USER
   Owner:              	nick
   CreateTime:         	Tue Jun 28 02:34:36 UTC 2022
   LastAccessTime:     	Sun Dec 14 22:38:21 UTC 1969
   Retention:          	2147483647
   Location:           	gs://iceberg-spark/datasets/gsdb.db/gs_table2
   Table Type:         	EXTERNAL_TABLE
   Table Parameters:
   	EXTERNAL            	TRUE
   	metadata_location   	gs://iceberg-spark/datasets/gsdb.db/gs_table2/metadata/00002-833dafb8-cda5-4c7c-a2ea-04e2c84fa372.metadata.json
   	numFiles            	8
   	numRows             	8
   	owner               	nick
   	previous_metadata_location	gs://iceberg-spark/datasets/gsdb.db/gs_table2/metadata/00001-0c51975f-0d0a-4c2e-a5d1-aad3251c0393.metadata.json
   	storage_handler     	org.apache.iceberg.mr.hive.HiveIcebergStorageHandler
   	table_type          	ICEBERG
   	totalSize           	4952
   	transient_lastDdlTime	1656383676
   	uuid                	86645efc-8a03-47f0-a397-1ec30877b1bb
   
   # Storage Information
   SerDe Library:      	org.apache.iceberg.mr.hive.HiveIcebergSerDe
   InputFormat:        	org.apache.iceberg.mr.hive.HiveIcebergInputFormat
   OutputFormat:       	org.apache.iceberg.mr.hive.HiveIcebergOutputFormat
   Compressed:         	No
   Num Buckets:        	0
   Bucket Columns:     	[]
   Sort Columns:       	[]
   Time taken: 0.206 seconds, Fetched: 34 row(s)
   ```
   
   Then I try to create the iceberg table in doris
   ```sql
   CREATE TABLE `gs_table2` 
   ENGINE = ICEBERG
   PROPERTIES (
   "iceberg.database" = "gsdb",
   "iceberg.table" = "gs_table2",
   "iceberg.hive.metastore.uris" = "thrift://10.201.0.104:9083",
   "iceberg.catalog.type"  =  "HIVE_CATALOG"
   );
   ```
   sql execution timeout, table creation failed
   
   
   
   Let's look at another example
   I wrote the iceberg table on spark, the data is stored in hdfs
   `hive> select * from test_table;`
   ```
   Query ID = nick_20220628042917_7209ca76-1704-45e1-8678-88af4297b64a
   Total jobs = 1
   Launching Job 1 out of 1
   Tez session was closed. Reopening...
   Session re-established.
   Session re-established.
   Status: Running (Executing on YARN cluster with App id application_1656302766976_0014)
   
   ----------------------------------------------------------------------------------------------
           VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
   ----------------------------------------------------------------------------------------------
   Map 1 .......... container     SUCCEEDED      1          1        0        0       0       0
   ----------------------------------------------------------------------------------------------
   VERTICES: 01/01  [==========================>>] 100%  ELAPSED TIME: 7.40 s
   ----------------------------------------------------------------------------------------------
   OK
   1	a
   2	b
   3	c
   Time taken: 14.78 seconds, Fetched: 3 row(s)
   ```
   
   `hive> describe formatted test_table;`
   
   ```
   OK
   # col_name            	data_type           	comment
   id                  	bigint              	from deserializer
   data                	string              	from deserializer
   
   # Detailed Table Information
   Database:           	testdb
   OwnerType:          	USER
   Owner:              	nick
   CreateTime:         	Thu Jun 23 11:11:56 UTC 2022
   LastAccessTime:     	Wed Dec 10 07:15:41 UTC 1969
   Retention:          	2147483647
   Location:           	hdfs://10.201.0.104:8020/user/hive/warehouse/testdb.db/test_table
   Table Type:         	EXTERNAL_TABLE
   Table Parameters:
   	EXTERNAL            	TRUE
   	metadata_location   	hdfs://10.201.0.104:8020/user/hive/warehouse/testdb.db/test_table/metadata/00001-ceb6024b-3d7d-4304-8fff-f2aad293d2cf.metadata.json
   	numFiles            	3
   	numRows             	3
   	owner               	nick
   	previous_metadata_location	hdfs://10.201.0.104:8020/user/hive/warehouse/testdb.db/test_table/metadata/00000-f95f2c72-1e64-421b-a4b9-bccb77984d32.metadata.json
   	storage_handler     	org.apache.iceberg.mr.hive.HiveIcebergStorageHandler
   	table_type          	ICEBERG
   	totalSize           	1929
   	transient_lastDdlTime	1655982716
   	uuid                	b3a2408e-bc56-4486-bd81-65b2be22b2f3
   
   # Storage Information
   SerDe Library:      	org.apache.iceberg.mr.hive.HiveIcebergSerDe
   InputFormat:        	org.apache.iceberg.mr.hive.HiveIcebergInputFormat
   OutputFormat:       	org.apache.iceberg.mr.hive.HiveIcebergOutputFormat
   Compressed:         	No
   Num Buckets:        	0
   Bucket Columns:     	[]
   Sort Columns:       	[]
   Time taken: 0.144 seconds, Fetched: 34 row(s)
   ```
   
   and then try to create iceberg table in doris, created successfully and successfully read the data
   ```sql
   CREATE TABLE `test_table` 
   ENGINE = ICEBERG
   PROPERTIES (
   "iceberg.database" = "testdb",
   "iceberg.table" = "test_table",
   "iceberg.hive.metastore.uris" = "thrift://10.201.0.104:9083",
   "iceberg.catalog.type"  =  "HIVE_CATALOG"
   );
   
   
   select * from iceberg_db.test_table;
   ```
   query  result
   
   data|id|
   ----+--+
   a   | 1|
   b   | 2|
   c   | 3|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org