You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/04/07 16:58:26 UTC
[GitHub] [incubator-hudi] vinothchandar commented on issue #1491: [SUPPORT] OutOfMemoryError during upsert 53M records

vinothchandar commented on issue #1491: [SUPPORT] OutOfMemoryError during upsert 53M records
URL: https://github.com/apache/incubator-hudi/issues/1491#issuecomment-610504319
 
 
   Is this real data or can you share a reproducible snippet of code? Especially with these local microbenchmarks, its useful to understand as the small costs that typically don't matter in real cluster, kind of tend to amplify.. 
   
   From the logs, it seems like 
   1) bulk_insert is succeeding and upsert is what's failing... and it's failing during the write phase, when we actually allocate some memory to do the merge.. 
   
   
   2) From the logs below, it seems like you have a lot of data potentially for a single node.. How much total data do you have in those 53M records? (That's a key metric for runtime, more than number of records. Hudi does not have a maxiumum records limit etc per se)
   
   ```
   20/04/07 08:02:55 INFO ExternalAppendOnlyMap: Thread 136 spilling in-memory map of 1325.4 MB to disk (1 time so far)
   20/04/07 08:03:04 INFO ExternalAppendOnlyMap: Thread 137 spilling in-memory map of 1329.9 MB to disk (1 time so far)
   20/04/07 08:03:04 INFO ExternalAppendOnlyMap: Thread 135 spilling in-memory map of 1325.7 MB to disk (1 time so far)
   20/04/07 08:03:07 INFO ExternalAppendOnlyMap: Thread 47 spilling in-memory map of 1385.6 MB to disk (1 time so far)
   20/04/07 08:03:25 INFO ExternalAppendOnlyMap: Thread 136 spilling in-memory map of 1325.4 MB to disk (2 times so far)
   20/04/07 08:03:41 INFO ExternalAppendOnlyMap: Thread 137 spilling in-memory map of 1325.5 MB to disk (2 times so far)
   20/04/07 08:03:43 INFO ExternalAppendOnlyMap: Thread 135 spilling in-memory map of 1325.4 MB to disk (2 times so far)
   20/04/07 08:03:58 INFO ExternalAppendOnlyMap: Thread 47 spilling in-memory map of 1381.4 MB to disk (2 times so far)
   20/04/07 08:04:08 INFO ExternalAppendOnlyMap: Thread 136 spilling in-memory map of 1325.4 MB to disk (3 times so far)
   20/04/07 08:04:24 INFO ExternalAppendOnlyMap: Thread 137 spilling in-memory map of 1325.4 MB to disk (3 times so far)
   20/04/07 08:04:28 INFO ExternalAppendOnlyMap: Thread 135 spilling in-memory map of 1327.7 MB to disk (3 times so far)
   20/04/07 08:04:57 INFO ExternalAppendOnlyMap: Thread 136 spilling in-memory map of 1325.4 MB to disk (4 times so far)
   20/04/07 08:04:59 INFO ExternalAppendOnlyMap: Thread 47 spilling in-memory map of 1491.8 MB to disk (3 times so far)
   20/04/07 08:05:14 INFO ExternalAppendOnlyMap: Thread 137 spilling in-memory map of 1363.9 MB to disk (4 times so far)
   20/04/07 08:05:16 INFO ExternalAppendOnlyMap: Thread 135 spilling in-memory map of 1325.4 MB to disk (4 times so far)
   20/04/07 08:05:47 INFO ExternalAppendOnlyMap: Thread 47 spilling in-memory map of 1349.8 MB to disk (4 times so far)
   20/04/07 08:06:05 INFO ExternalAppendOnlyMap: Thread 137 spilling in-memory map of 1300.9 MB to disk (5 times so far)
   ```
   
   I suspect what's happening is that spark memory is actually full (Hudi caches input to derive workload profile etc and typically advised to keep input data in memory) and it keeps spilling to disk, slowing everything down.. (more of a spark tuning thing)... But things don't break until Hudi tries to allocate some memory on its own, at which point the heap is full.. 
   
   Can you give this a shot on a cluster?
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services