You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/03/16 15:45:11 UTC

[GitHub] [hudi] worf0815 opened a new issue #5054: [SUPPORT] Hudi Failing with out of memory issue on Glue with >300 Mio. Records

worf0815 opened a new issue #5054:
URL: https://github.com/apache/hudi/issues/5054


   **Describe the problem you faced**
   
   We are trying to ingest and deduplicate via Hudi a table with a total record size of 25 billion where each record is about 3-4kb size (there are even larger tables in our portfolio with the largest ingesting 1 - 7 billion records daily with a total volume of 221 billion ). 
   
   Above table ran into memory issues with AWS Glue 3 and failed in the "countByKey - Building Workload Profile" stage with "org.apache.spark.shuffle.FetchFailedException: The relative remote executor(Id: 26), which maintains the block data to fetch is dead." in the sparkui logs.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Start Glue 3.0 Job with 80 G1.X workers, reading from a standard glue catalog table where files are stored on s3
   2. Without specifying a bounded context of roughly 7GB in Glue the job fails with out of memory issue
   3. I also tried it with Glue2.0 and spill_to_s3 enabled, which resulted in nearly 3 TB of spilling...
   
   **Expected behavior**
   
   If possible during upsert a larger number of records should be processible with 80 G1.X workers
   
   **Environment Description**
   
   * Hudi version : 0.9.0 (via AWS Glue Connector)
   
   * Spark version : 3.1.1 (AWS Glue)
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   The original complete dataset is about 420GB snappy compressed parquet files.
   Hudi configuration with or without memory fraction did not caused a difference, Partitioncolumns and Recordkeys consist of multiple columns:
   
   ```
   commonConfig = {
       "className": "org.apache.hudi",
       "hoodie.table.name": hudi_table_name,
       "path": f"s3://upsert-poc/hudie/default/{hudi_table_name}",
       "hoodie.datasource.write.precombine.field": "update_date",
       "hoodie.datasource.write.partitionpath.field": partition_fields,
       "hoodie.datasource.write.recordkey.field": primary_keys,
       "hoodie.datasource.write.keygenerator.class": "org.apache.hudi.keygen.ComplexKeyGenerator",
       "hoodie.datasource.hive_sync.enable": "true",
       "hoodie.datasource.hive_sync.support_timestamp": "true",
       "hoodie.datasource.hive_sync.use_jdbc": "false",
       "hoodie.datasource.hive_sync.database": hudi_database,
       "hoodie.datasource.hive_sync.table": hudi_table_name,
       "hoodie.datasource.hive_sync.partition_fields": partition_fields,
       "hoodie.datasource.hive_sync.mode":"hms",
       "hoodie.datasource.hive_sync.partition_extractor_class": "org.apache.hudi.hive.MultiPartKeysValueExtractor",
   }
   
   incrementalConfig = {
       "hoodie.datasource.write.operation": "upsert",
       "hoodie.cleaner.policy": "KEEP_LATEST_COMMITS",
       "hoodie.cleaner.commits.retained": 1,
   }
   ```
   
   **Stacktrace** 
   SparkUI LogView
   
   ![grafik](https://user-images.githubusercontent.com/10959555/158626781-2e67516f-84b9-409f-838a-70bde86861e0.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] srinikandi commented on issue #5054: [SUPPORT] Hudi Failing with out of memory issue on Glue with >300 Mio. Records

Posted by GitBox <gi...@apache.org>.

srinikandi commented on issue #5054:
URL: https://github.com/apache/hudi/issues/5054#issuecomment-1080048329


   I have been experiencing a similar issue with Glue and Hudi 0.90. However in my case, I did a full load of a table that had close to a billion records and with 30 worker nodes, it took around 16 minutes using a partitioning key on a data column. When I tried to run an upsert operation on the same table with about 1.5 million records, the Glue job fails with no more containers available, which indicates that there is a memory issue/disk spill.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org