You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/07/15 12:21:01 UTC

[GitHub] [hudi] tooptoop4 opened a new issue #1833: [SUPPORT] 100% update on 10mn keys in single partition slow

tooptoop4 opened a new issue #1833:
URL: https://github.com/apache/hudi/issues/1833


   I have a single 700MB file containing 10mn rows (all unique keys, key is single column, single partition for all rows).
   
   1. Create brand new table
   2. Using spark datasource on the 700MB to write in COW insert mode, all rows to single partition takes 7.3mins (i thought could be faster but its acceptable). 88 parquet files end up on S3 table path after this succeeds (total size 288MB)
   3. Repeat the exact same spark-submit (but with COW upsert mode) from point 2 above to single partition with exact same 700MB file (still running after more than 2hrs!)
   
   
   --total-executor-cores 16 --driver-memory 2G --executor-memory 16G --executor-cores 2
   dynamic allocation of executors is on, shuffle service is on
   spark 2.4.6
   hudi 0.5.3
   
   "spark.sql.shuffle.partitions" is "16"
   "hoodie.insert.shuffle.parallelism" is "16"
   "hoodie.upsert.shuffle.parallelism" is "16"
   private static final String DEFAULT_INLINE_COMPACT = "true";
   private static final String DEFAULT_CLEANER_FILE_VERSIONS_RETAINED = "1";
   private static final String DEFAULT_CLEANER_COMMITS_RETAINED = "1";
   private static final String DEFAULT_MAX_COMMITS_TO_KEEP = "3";
   private static final String DEFAULT_MIN_COMMITS_TO_KEEP = "2";
   private static final String DEFAULT_EMBEDDED_TIMELINE_SERVER_ENABLED = "false";
   private static final String DEFAULT_FAIL_ON_TIMELINE_ARCHIVING_ENABLED = "false";
   rest of configs are default
   
   ![image](https://user-images.githubusercontent.com/33283496/87543811-42235b80-c69d-11ea-94fe-452e681e8f55.png)
   
   ![image](https://user-images.githubusercontent.com/33283496/87543859-5c5d3980-c69d-11ea-94c9-0717d49b0276.png)
   
   ![image](https://user-images.githubusercontent.com/33283496/87543302-6599d680-c69c-11ea-9244-b073d89c58d9.png)
   
   ![image](https://user-images.githubusercontent.com/33283496/87543362-86622c00-c69c-11ea-9d0a-e71ae6f74f13.png)
   
   ![image](https://user-images.githubusercontent.com/33283496/87543472-bdd0d880-c69c-11ea-90c8-cb44b6d17d1c.png)
   
   ![image](https://user-images.githubusercontent.com/33283496/87543555-de992e00-c69c-11ea-9ae3-db5c510f30f1.png)
   
   
   shuffle size seems to be extremely high! any idea how to speed this up? how long does it take you to do 100% update? ie run same 10mn/700MB file twice on new table
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar closed issue #1833: [SUPPORT] 100% update on 10mn keys in single partition slow

Posted by GitBox <gi...@apache.org>.
bvaradar closed issue #1833:
URL: https://github.com/apache/hudi/issues/1833


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #1833: [SUPPORT] 100% update on 10mn keys in single partition slow

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1833:
URL: https://github.com/apache/hudi/issues/1833#issuecomment-660415532


   @tooptoop4 : Pinging again to see if you can get us the information between the 2 runs w/o bucketized bloom index
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar edited a comment on issue #1833: [SUPPORT] 100% update on 10mn keys in single partition slow

Posted by GitBox <gi...@apache.org>.
bvaradar edited a comment on issue #1833:
URL: https://github.com/apache/hudi/issues/1833#issuecomment-659193833


   It looks like hoodie.bloom.index.bucketized.checking  has the same setting in 0.4.6. So, no need to check that. Can you try checking other configs related to index lookup between 0.4.6 and 0.5.3 to see if any defaults are changes. I will check back in a day or so. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #1833: [SUPPORT] 100% update on 10mn keys in single partition slow

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1833:
URL: https://github.com/apache/hudi/issues/1833#issuecomment-659193833


   It looks like this is the same setting in 0.4.6. Can you try checking other configs related to index lookup between 0.4.6 and 0.5.3 to see if any defaults are changes. I will check back in a day or so. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] tooptoop4 edited a comment on issue #1833: [SUPPORT] 100% update on 10mn keys in single partition slow

Posted by GitBox <gi...@apache.org>.
tooptoop4 edited a comment on issue #1833:
URL: https://github.com/apache/hudi/issues/1833#issuecomment-659327491


   Actually was using 0.4.6-SNAPSHOT before bucketized.checking code landed. I changed hoodie.bloom.index.bucketized.checking to false on hudi 0.5.3 and time down to 107mins :) 
   
   hudi 0.5.3 in local mode with hoodie.bloom.index.bucketized.checking to false takes 122mins
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] tooptoop4 commented on issue #1833: [SUPPORT] 100% update on 10mn keys in single partition slow

Posted by GitBox <gi...@apache.org>.
tooptoop4 commented on issue #1833:
URL: https://github.com/apache/hudi/issues/1833#issuecomment-659655270


   ![image](https://user-images.githubusercontent.com/33283496/87719610-8735b380-c7ab-11ea-9eee-d55ef3d7fd36.png)
   
   @bvaradar this image shows tasks for flatMapToPair at HoodieBloomIndex.java:308 stage, how can the number of records/shuffle size be so much larger than both my input file and existing table?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] tooptoop4 edited a comment on issue #1833: [SUPPORT] 100% update on 10mn keys in single partition slow

Posted by GitBox <gi...@apache.org>.
tooptoop4 edited a comment on issue #1833:
URL: https://github.com/apache/hudi/issues/1833#issuecomment-660715533


   @bvaradar i noticed "There is insufficient memory for the Java Runtime Environment to continue." error so i reduced SPARK_WORKER_MEMORY (ie leave more room for OS memory). Now the timings I get are: 43mins for hoodie.bloom.index.bucketized.checking = false. 59 mins for hoodie.bloom.index.bucketized.checking = true.
   
   **hoodie.bloom.index.bucketized.checking = false**
   
   ![image](https://user-images.githubusercontent.com/33283496/87885750-b4829b80-ca0f-11ea-99d9-195b3a6cc562.png)
   ![image](https://user-images.githubusercontent.com/33283496/87885776-dda32c00-ca0f-11ea-9f8e-e9c15ead96c2.png)
   ![image](https://user-images.githubusercontent.com/33283496/87885794-fad7fa80-ca0f-11ea-8d16-b5a290676525.png)
   ![image](https://user-images.githubusercontent.com/33283496/87885812-1e9b4080-ca10-11ea-9ac7-e3a487f4a8b7.png)
   ![image](https://user-images.githubusercontent.com/33283496/87885847-5dc99180-ca10-11ea-9a13-fbef57f240b3.png)
   ![image](https://user-images.githubusercontent.com/33283496/87885876-91a4b700-ca10-11ea-906b-563cd0d25d55.png)
   ![image](https://user-images.githubusercontent.com/33283496/87885894-bb5dde00-ca10-11ea-977a-681a3c7b4d1c.png)
   ![image](https://user-images.githubusercontent.com/33283496/87885907-d6305280-ca10-11ea-8f2d-aeec67b1916b.png)
   ![image](https://user-images.githubusercontent.com/33283496/87885922-f19b5d80-ca10-11ea-8359-5fc0adecb8cb.png)
   ![image](https://user-images.githubusercontent.com/33283496/87885930-07a91e00-ca11-11ea-84f5-379f1953ad67.png)
   ![image](https://user-images.githubusercontent.com/33283496/87885947-1e4f7500-ca11-11ea-81cb-977a289eba53.png)
   ![image](https://user-images.githubusercontent.com/33283496/87885961-4343e800-ca11-11ea-9f7d-bea8d5a47012.png)
   ![image](https://user-images.githubusercontent.com/33283496/87885972-5eaef300-ca11-11ea-82a2-3dcc70474d5c.png)
   
   
   
   
   **hoodie.bloom.index.bucketized.checking = true**
   
   
   
   ![image](https://user-images.githubusercontent.com/33283496/87886008-a03f9e00-ca11-11ea-9a23-acccedbcae29.png)
   ![image](https://user-images.githubusercontent.com/33283496/87886021-bd746c80-ca11-11ea-986f-ce83b8430869.png)
   ![image](https://user-images.githubusercontent.com/33283496/87886046-e85ec080-ca11-11ea-99d0-52fe4d7bdc2d.png)
   ![image](https://user-images.githubusercontent.com/33283496/87886069-09271600-ca12-11ea-8bab-e06ccb503e80.png)
   ![image](https://user-images.githubusercontent.com/33283496/87886091-2bb92f00-ca12-11ea-9d00-561ef63bcabf.png)
   ![image](https://user-images.githubusercontent.com/33283496/87886110-4be8ee00-ca12-11ea-9eb3-d17de793bb9b.png)
   ![image](https://user-images.githubusercontent.com/33283496/87886117-63c07200-ca12-11ea-97fa-7655500c3848.png)
   ![image](https://user-images.githubusercontent.com/33283496/87886131-79ce3280-ca12-11ea-898b-bbaca156fd91.png)
   ![image](https://user-images.githubusercontent.com/33283496/87886152-95393d80-ca12-11ea-8b90-c6f6c52bff94.png)
   ![image](https://user-images.githubusercontent.com/33283496/87886164-ac782b00-ca12-11ea-8231-e147ad4376b5.png)
   ![image](https://user-images.githubusercontent.com/33283496/87886171-bc900a80-ca12-11ea-9ef6-a7b680d2943a.png)
   
   
   i wonder if https://issues.apache.org/jira/browse/SPARK-27734 is causing the memory issues


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar edited a comment on issue #1833: [SUPPORT] 100% update on 10mn keys in single partition slow

Posted by GitBox <gi...@apache.org>.
bvaradar edited a comment on issue #1833:
URL: https://github.com/apache/hudi/issues/1833#issuecomment-659194433


   Also wondering if we are indeed comparing apples to apples. As the runs are happening in different setup, can you help localize by running both 0.4.6 and 0.5.3 in same setup and provide spark dag UI screenshots (jobs, stages and tasks (for skewness)) 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] tooptoop4 commented on issue #1833: [SUPPORT] 100% update on 10mn keys in single partition slow

Posted by GitBox <gi...@apache.org>.
tooptoop4 commented on issue #1833:
URL: https://github.com/apache/hudi/issues/1833#issuecomment-659327491


   Actually was using 0.4.6-SNAPSHOT before bucketized.checking code landed. I changed hoodie.bloom.index.bucketized.checking to false on hudi 0.5.3 and time down to 107mins :) I will now try hudi 0.5.3 in local mode


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #1833: [SUPPORT] 100% update on 10mn keys in single partition slow

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1833:
URL: https://github.com/apache/hudi/issues/1833#issuecomment-659174736


   Can you try setting the config "hoodie.bloom.index.bucketized.checking" to false and try. Kindly report back with the observation. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] tooptoop4 commented on issue #1833: [SUPPORT] 100% update on 10mn keys in single partition slow

Posted by GitBox <gi...@apache.org>.
tooptoop4 commented on issue #1833:
URL: https://github.com/apache/hudi/issues/1833#issuecomment-660715533


   @bvaradar i noticed "There is insufficient memory for the Java Runtime Environment to continue." error so i reduced SPARK_WORKER_MEMORY (ie leave more room for OS memory). Now the timings I get are: 43mins for hoodie.bloom.index.bucketized.checking = false. 59 mins for hoodie.bloom.index.bucketized.checking = true.
   
   **hoodie.bloom.index.bucketized.checking = false**
   
   ![image](https://user-images.githubusercontent.com/33283496/87885750-b4829b80-ca0f-11ea-99d9-195b3a6cc562.png)
   ![image](https://user-images.githubusercontent.com/33283496/87885776-dda32c00-ca0f-11ea-9f8e-e9c15ead96c2.png)
   ![image](https://user-images.githubusercontent.com/33283496/87885794-fad7fa80-ca0f-11ea-8d16-b5a290676525.png)
   ![image](https://user-images.githubusercontent.com/33283496/87885812-1e9b4080-ca10-11ea-9ac7-e3a487f4a8b7.png)
   ![image](https://user-images.githubusercontent.com/33283496/87885847-5dc99180-ca10-11ea-9a13-fbef57f240b3.png)
   ![image](https://user-images.githubusercontent.com/33283496/87885876-91a4b700-ca10-11ea-906b-563cd0d25d55.png)
   ![image](https://user-images.githubusercontent.com/33283496/87885894-bb5dde00-ca10-11ea-977a-681a3c7b4d1c.png)
   ![image](https://user-images.githubusercontent.com/33283496/87885907-d6305280-ca10-11ea-8f2d-aeec67b1916b.png)
   ![image](https://user-images.githubusercontent.com/33283496/87885922-f19b5d80-ca10-11ea-8359-5fc0adecb8cb.png)
   ![image](https://user-images.githubusercontent.com/33283496/87885930-07a91e00-ca11-11ea-84f5-379f1953ad67.png)
   ![image](https://user-images.githubusercontent.com/33283496/87885947-1e4f7500-ca11-11ea-81cb-977a289eba53.png)
   ![image](https://user-images.githubusercontent.com/33283496/87885961-4343e800-ca11-11ea-9f7d-bea8d5a47012.png)
   ![image](https://user-images.githubusercontent.com/33283496/87885972-5eaef300-ca11-11ea-82a2-3dcc70474d5c.png)
   
   
   
   
   **hoodie.bloom.index.bucketized.checking = true**
   
   
   
   ![image](https://user-images.githubusercontent.com/33283496/87886008-a03f9e00-ca11-11ea-9a23-acccedbcae29.png)
   ![image](https://user-images.githubusercontent.com/33283496/87886021-bd746c80-ca11-11ea-986f-ce83b8430869.png)
   ![image](https://user-images.githubusercontent.com/33283496/87886046-e85ec080-ca11-11ea-99d0-52fe4d7bdc2d.png)
   ![image](https://user-images.githubusercontent.com/33283496/87886069-09271600-ca12-11ea-8bab-e06ccb503e80.png)
   ![image](https://user-images.githubusercontent.com/33283496/87886091-2bb92f00-ca12-11ea-9d00-561ef63bcabf.png)
   ![image](https://user-images.githubusercontent.com/33283496/87886110-4be8ee00-ca12-11ea-9eb3-d17de793bb9b.png)
   ![image](https://user-images.githubusercontent.com/33283496/87886117-63c07200-ca12-11ea-97fa-7655500c3848.png)
   ![image](https://user-images.githubusercontent.com/33283496/87886131-79ce3280-ca12-11ea-898b-bbaca156fd91.png)
   ![image](https://user-images.githubusercontent.com/33283496/87886152-95393d80-ca12-11ea-8b90-c6f6c52bff94.png)
   ![image](https://user-images.githubusercontent.com/33283496/87886164-ac782b00-ca12-11ea-8231-e147ad4376b5.png)
   ![image](https://user-images.githubusercontent.com/33283496/87886171-bc900a80-ca12-11ea-9ef6-a7b680d2943a.png)
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] tooptoop4 edited a comment on issue #1833: [SUPPORT] 100% update on 10mn keys in single partition slow

Posted by GitBox <gi...@apache.org>.
tooptoop4 edited a comment on issue #1833:
URL: https://github.com/apache/hudi/issues/1833#issuecomment-659327491


   Actually was using 0.4.6-SNAPSHOT before bucketized.checking code landed. I changed hoodie.bloom.index.bucketized.checking to false on hudi 0.5.3 and time down to 107mins :) 
   
   hudi 0.5.3 in local mode takes 122mins
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #1833: [SUPPORT] 100% update on 10mn keys in single partition slow

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1833:
URL: https://github.com/apache/hudi/issues/1833#issuecomment-659194433


   Also wondering if we are indeed comparing apples to apples. Are both the runs (0.4.6 vs 0.5.3) happening in the same setup ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #1833: [SUPPORT] 100% update on 10mn keys in single partition slow

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1833:
URL: https://github.com/apache/hudi/issues/1833#issuecomment-659892874


   
   @tooptoop4 : Can you provide us the spark DAGs with times (Job, Stage and Task level) between 0.5.3 (with bucketized bloom index on) and 0.5.3 (with bucketized bloom index off). We need to see why you are seeing such a massive performance difference. 
   
   Regarding your question, Please take a look at the comment in https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndex.java#L249 
   This is basically an exploded RDD of record-Key with files to be compared.
   
   Thanks,
   Balaji.V
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org