You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/04/03 01:20:06 UTC

[GitHub] [incubator-hudi] cdmikechen opened a new issue #1481: [SUPPORT] If use a spark session deal with many tables, hudi cache may report `java.lang.OutOfMemoryError`

cdmikechen opened a new issue #1481: [SUPPORT] If use a spark session deal with many tables, hudi cache may report `java.lang.OutOfMemoryError`
URL: https://github.com/apache/incubator-hudi/issues/1481
 
 
   **Describe the problem you faced**
   If use a spark session deal with many tables, hudi cache may report `java.lang.OutOfMemoryError`
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   In my hudi production enviroment, hudi process 4000+ tables every day and always use COW type. 
   The idea of processing each table is to use JDBC to get data from the original table (RDBMS) and compare it with the existing hudi data, so as to identify the data to be merged. At last, we use hudi's upsert method to update the existing data.
   Because the processing time set for each table is inconsistent, in order to ensure that the processing task can be executed immediately, we always keep the spark session open. In addition, in order to respond to some rest based requests, we use springboot to start spark session.
   This program has no problems and runs normally in the first many days, but it often appears some `java.lang.OutOfMemoryError` in one day. I check the logs and found some messages that some of hudi's cache might have caused `java.lang.OutOfMemoryError`. 
   So I was wondering if I should start a timeline service independently or perform hudi cache data cleanup on the tables that have been processed by hudi?
   
   
   **Expected behavior**
   
   Have no idea. Maybe hudi table cache can be stored by a single server or timeline server?
   
   **Environment Description**
   
   * Hudi version :
   0.5.1
   * Spark version :
   2.4.3
   * Hive version :
   2.3.3
   * Hadoop version :
   2.8.5
   * Storage (HDFS/S3/GCS..) :
   HDFS
   * Running on Docker? (yes/no) :
   no
   
   **Additional context**
   
   no
   
   **Stacktrace**
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] cdmikechen closed issue #1481: [SUPPORT] If using a spark session deal with many tables, hudi cache may report OOM

Posted by GitBox <gi...@apache.org>.

cdmikechen closed issue #1481: [SUPPORT] If using a spark session deal with many tables, hudi cache may report OOM
URL: https://github.com/apache/incubator-hudi/issues/1481
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services