You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2023/01/11 07:22:54 UTC

[GitHub] [hudi] BruceKellan opened a new issue, #7643: [SUPPORT] Too slow while using trino-hudi connector while querying partitioned tables.

BruceKellan opened a new issue, #7643:
URL: https://github.com/apache/hudi/issues/7643

   **Describe the problem you faced**
   
   We are testing the Hudi Connector on copy-on-write table using trino405 (latest stable version), but we ran into serious performance problem. 
   We will have a very large number of partitions in a table and we made a minimal test set for this.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   test table data:
   [hudi_reproduce.tar.gz](https://github.com/apache/hudi/files/10389539/hudi_reproduce.tar.gz)
   desc: 
   This test table has many partitions and parititoined by day, type. There are 657 data in total.
   
   <img width="342" alt="image" src="https://user-images.githubusercontent.com/13477122/211738768-09bc6156-bb49-4e0f-82ba-5c058220ee89.png">
   <img width="690" alt="image" src="https://user-images.githubusercontent.com/13477122/211742605-050b65b0-d09b-4618-90a5-98b5ebd3f8a1.png">
   
   
   1. Import data and run a hiveql to repair partitions.
   ```sql
   CREATE EXTERNAL TABLE `website.hudi_reproduce`(
   `_hoodie_commit_time` string,
   `_hoodie_commit_seqno` string,
   `_hoodie_record_key` string,
   `_hoodie_partition_path` string,
   `_hoodie_file_name` string,
   `uniquekey` string)
   PARTITIONED BY (
   `day` bigint,
   `type` bigint)
   ROW FORMAT SERDE
   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
   WITH SERDEPROPERTIES (
   'hoodie.query.as.ro.table'='false',
   'path'='hdfs://xxx/hudi/warehouse/hudi_reproduce')
   STORED AS INPUTFORMAT
   'org.apache.hudi.hadoop.HoodieParquetInputFormat'
   OUTPUTFORMAT
   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
   LOCATION
   'hdfs://xxx/hudi/warehouse/hudi_reproduce'
   TBLPROPERTIES (
   'last_commit_time_sync'='20230111113655773',
   'last_modified_time'='1673406649',
   'spark.sql.sources.provider'='hudi',
   'spark.sql.sources.schema.numPartCols'='2',
   'spark.sql.sources.schema.numParts'='1',
   'spark.sql.sources.schema.part.0'='{"type":"struct","fields":[{"name":"_hoodie_commit_time","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_commit_seqno","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_record_key","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_partition_path","type":"string","nullable":true,"metadata":{}},{"name":"_hoodie_file_name","type":"string","nullable":true,"metadata":{}},{"name":"uniqueKey","type":"string","nullable":true,"metadata":{}},{"name":"day","type":"long","nullable":true,"metadata":{}},{"name":"type","type":"long","nullable":true,"metadata":{}}]}',
   'spark.sql.sources.schema.partCol.0'='day',
   'spark.sql.sources.schema.partCol.1'='type',
   'transient_lastDdlTime'='1673406649');
   
   -- repair partitions.
   msck repair table website.hudi_reproduce;
   ```
   
   2. Run trino sql to query:
   ```sql
   -- we want to query the data that type was between 1 and 9 and day between 20230101 and 20230104
   select count(1) from hudi.website.hudi_reproduce where day between 20230101 and 20230104 and type between 1 and 9;
   ```
   
   3. Query too slow:
   <img width="1261" alt="image" src="https://user-images.githubusercontent.com/13477122/211741572-b58cbc3f-50c5-4fb0-8ad7-83cf53a12e81.png">
   
   **Expected behavior**
   
   Can query as fast as hive table.
   
   **Environment Description**
   
   * Hudi version : 0.12.1
   
   * Hive version : 2.3.9
   
   * Hadoop version : 2.8.5
   
   * Trino version: 405
   
   * Number of trino worker: 8
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   **Additional context**
   
   Share our trino server.log, hope this helps you.
   [hudi_reproduce_trino_server_log.log](https://github.com/apache/hudi/files/10389658/hudi_reproduce_trino_server_log.log)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] BruceKellan commented on issue #7643: [SUPPORT] Too slow while using trino-hudi connector while querying partitioned tables.

Posted by GitBox <gi...@apache.org>.
BruceKellan commented on issue #7643:
URL: https://github.com/apache/hudi/issues/7643#issuecomment-1382598461

   @alexeykudinkin 
   
   > Is it Hive?
   
   No, it's Trino.
   
   > can you elaborate what you're measuring this performance against?
   
   On the one hand, the amount of data is very small, and on the other hand, with the same amount of data in the Hive table instead of the Hudi table, each query only takes 2 seconds, and if it is a Hudi table, it takes 15 seconds, so this confuses me
   
   > You're using Hudi's Hive connector in Trino, right?
   
   No, I'm using Trino's embedded hudi connector, not the hive connector.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on issue #7643: [SUPPORT] Too slow while using trino-hudi connector while querying partitioned tables.

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on issue #7643:
URL: https://github.com/apache/hudi/issues/7643#issuecomment-1385743502

   @BruceKellan thanks for the detailed context! This is very helpful
   
   cc @yihua 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on issue #7643: [SUPPORT] Too slow while using trino-hudi connector while querying partitioned tables.

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on issue #7643:
URL: https://github.com/apache/hudi/issues/7643#issuecomment-1419155910

   @BruceKellan I have a working patch with significant performance gains. On your table, i could see 50-60% latency reduction. https://github.com/codope/trino/pull/23
   Can you try above patch? Let me know if you have trouble building, then I can share the trino-server tarball with you.
   I need to make a few minor changes before I can raise a PR against the Trino repo. But, early feedback from you would helpful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] BruceKellan commented on issue #7643: [SUPPORT] Too slow while using trino-hudi connector while querying partitioned tables.

Posted by GitBox <gi...@apache.org>.
BruceKellan commented on issue #7643:
URL: https://github.com/apache/hudi/issues/7643#issuecomment-1382726802

   BTW, if running the query through trino-hive-connector and hudi-hadoop-mr-bundle.jar, it only takes 2.x seconds.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] BruceKellan commented on issue #7643: [SUPPORT] Too slow while using trino-hudi connector while querying partitioned tables.

Posted by GitBox <gi...@apache.org>.
BruceKellan commented on issue #7643:
URL: https://github.com/apache/hudi/issues/7643#issuecomment-1381426977

   The entire query took 15 seconds, and the step of fetching the partitions took 12 seconds, it seem to not be the expected behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] BruceKellan commented on issue #7643: [SUPPORT] Too slow while using trino-hudi connector while querying partitioned tables.

Posted by GitBox <gi...@apache.org>.
BruceKellan commented on issue #7643:
URL: https://github.com/apache/hudi/issues/7643#issuecomment-1381583382

   > The regression should be a blocker for release 0.13.0, have created a JIRA: https://issues.apache.org/jira/browse/HUDI-5552
   
   Thanks danny! I have provided more detailed reproduction steps, and a minimal data set I made, if you need anything else, please ping me.
   I read the rfc related to trino-hudi-connector, but I still find this get partition behavior very confusing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] BruceKellan closed issue #7643: [SUPPORT] Too slow while using trino-hudi connector while querying partitioned tables.

Posted by "BruceKellan (via GitHub)" <gi...@apache.org>.
BruceKellan closed issue #7643: [SUPPORT] Too slow while using trino-hudi connector while querying partitioned tables.
URL: https://github.com/apache/hudi/issues/7643


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] BruceKellan commented on issue #7643: [SUPPORT] Too slow while using trino-hudi connector while querying partitioned tables.

Posted by GitBox <gi...@apache.org>.
BruceKellan commented on issue #7643:
URL: https://github.com/apache/hudi/issues/7643#issuecomment-1378333370

   @codope Can you help me


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] BruceKellan commented on issue #7643: [SUPPORT] Too slow while using trino-hudi connector while querying partitioned tables.

Posted by GitBox <gi...@apache.org>.
BruceKellan commented on issue #7643:
URL: https://github.com/apache/hudi/issues/7643#issuecomment-1380282920

   > @BruceKellan we addressed performance regression in 0.12.2. Can you give it a try?
   
   <img width="502" alt="image" src="https://user-images.githubusercontent.com/13477122/212068968-d535849a-8ad0-4034-8769-c0decba4c293.png">
   
   I upgrade the version of hudi in trino-hudi to 0.12.2 and recompiled, but it's also too slow. We try to locate the problem by enabling debug level logging and get some information.
   
   While running trino sql to query:
   ```sql
   select count(1) from hudi.website.hudi_reproduce where day between 20230101 and 20230104 and type between 1 and 9;
   ```
   
   The hudi connector also get all paritions from hive metastore one by one. Maybe this is the reason.
   <img width="1534" alt="image" src="https://user-images.githubusercontent.com/13477122/212069245-c4c3903c-b85c-4174-acd1-1f0bdab2c977.png">
   [server.log](https://github.com/apache/hudi/files/10401595/server.log)
   @alexeykudinkin @codope WDYT?
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on issue #7643: [SUPPORT] Too slow while using trino-hudi connector while querying partitioned tables.

Posted by GitBox <gi...@apache.org>.
danny0405 commented on issue #7643:
URL: https://github.com/apache/hudi/issues/7643#issuecomment-1381538288

   The regression should be a blocker for release 0.13.0, have created a JIRA: https://issues.apache.org/jira/browse/HUDI-5552


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] BruceKellan commented on issue #7643: [SUPPORT] Too slow while using trino-hudi connector while querying partitioned tables.

Posted by "BruceKellan (via GitHub)" <gi...@apache.org>.
BruceKellan commented on issue #7643:
URL: https://github.com/apache/hudi/issues/7643#issuecomment-1420122870

   @codope  ok. I will try it this week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on issue #7643: [SUPPORT] Too slow while using trino-hudi connector while querying partitioned tables.

Posted by GitBox <gi...@apache.org>.
danny0405 commented on issue #7643:
URL: https://github.com/apache/hudi/issues/7643#issuecomment-1378346776

   Kind of remember there are some issues with Trino, it's like the partitions/files list are queried many times, @codope may give more details here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on issue #7643: [SUPPORT] Too slow while using trino-hudi connector while querying partitioned tables.

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on issue #7643:
URL: https://github.com/apache/hudi/issues/7643#issuecomment-1382212563

   @BruceKellan can you elaborate what you're measuring this performance against? Is it Hive?
   
   You're using Hudi's Hive connector in Trino, right? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] BruceKellan commented on issue #7643: [SUPPORT] Too slow while using trino-hudi connector while querying partitioned tables.

Posted by GitBox <gi...@apache.org>.
BruceKellan commented on issue #7643:
URL: https://github.com/apache/hudi/issues/7643#issuecomment-1379730867

   @alexeykudinkin Maybe I didn't understand it accurately enough, do you mean to use the latest master of trino?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on issue #7643: [SUPPORT] Too slow while using trino-hudi connector while querying partitioned tables.

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on issue #7643:
URL: https://github.com/apache/hudi/issues/7643#issuecomment-1379294059

   @BruceKellan we addressed performance regression in 0.12.2. Can you give it a try?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] BruceKellan commented on issue #7643: [SUPPORT] Too slow while using trino-hudi connector while querying partitioned tables.

Posted by "BruceKellan (via GitHub)" <gi...@apache.org>.
BruceKellan commented on issue #7643:
URL: https://github.com/apache/hudi/issues/7643#issuecomment-1404423564

   Thank you for your work! I am also looking forward to seeing a powerful hudi connector.
   Also, I have an additional question, do you think the get all partitions appearing in the logs is a normal behavior?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on issue #7643: [SUPPORT] Too slow while using trino-hudi connector while querying partitioned tables.

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on issue #7643:
URL: https://github.com/apache/hudi/issues/7643#issuecomment-1403217907

   @BruceKellan Sorry, i was away on a break and didn't get a chance to look into this.
   First of all, I don't think it's a regression as the query is slow even with hudi connector of Trino version 400 using Hudi 0.11.1. There was a regression in hive connector due to a change in hudi code and we have fixed that in [master](https://github.com/apache/hudi/commit/a882f440d37b4adb0ff194dad579c11dc44bbc78).
   
   Now, in my setup of hudi connector, I found that the query is slow because there is single split manager thread doing all the listing. It's also evident in your setup (`hudi-split-manager-0`). This is quite inefficient. I need to improve this, do more like how hive connector's background split loader works. This is a change in Trino codebase and not Hudi. I will work on it next week.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org