You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/08/20 19:29:10 UTC

[GitHub] [hudi] rubenssoto opened a new issue #2003: [SUPPORT] Spark Fails to Process 300Gb Of Data

rubenssoto opened a new issue #2003:
URL: https://github.com/apache/hudi/issues/2003


   Hi Guys,
   
   I'm trying to migrate my biggest dataset to Hudi and I'm facing some errors.
   
   Data Size: 350Gb
   Spark Master: 4 Cpus, 16 Gb Ram
   Cores Nodes: 8 R5.4xLarge = 16 cpus, 122 Gb ram EACH
   
   **MY spark Submit:**
   
   `spark-submit --deploy-mode cluster --conf "spark.executor.extraJavaOptions -XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -XX:+PrintTenuringDistribution -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/hoodie-heapdump.hprof" --conf spark.executor.cores=5 --conf spark.executor.memory=33g --conf spark.executor.memoryOverhead=2048 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4 `
   
   
   My hudi Options:
   
   {
      "hoodie.datasource.write.recordkey.field":"id",
      "hoodie.table.name":"stockout",
      "hoodie.datasource.write.table.name":"stockout",
      "hoodie.datasource.write.operation":"bulk_insert",
      "hoodie.datasource.write.partitionpath.field":"created_date_brt",
      "hoodie.datasource.write.hive_style_partitioning":"true",
      "hoodie.combine.before.insert":"true",
      "hoodie.combine.before.upsert":"false",
      "hoodie.datasource.write.precombine.field":"LineCreatedTimestamp",
      "hoodie.datasource.write.keygenerator.class":"org.apache.hudi.keygen.SimpleKeyGenerator",
      "hoodie.parquet.small.file.limit":996147200,
      "hoodie.parquet.max.file.size":1073741824,
      "hoodie.parquet.block.size":1073741824,
      "hoodie.copyonwrite.record.size.estimate":512,
      "hoodie.cleaner.commits.retained":10,
      "hoodie.datasource.hive_sync.enable":"true",
      "hoodie.datasource.hive_sync.database":"datalake_raw",
      "hoodie.datasource.hive_sync.table":"stockout",
      "hoodie.datasource.hive_sync.partition_fields":"created_date_brt",
      "hoodie.datasource.hive_sync.partition_extractor_class":"org.apache.hudi.hive.MultiPartKeysValueExtractor",
      "hoodie.datasource.hive_sync.jdbcurl":"jdbc:hive2://ip-10-0-21-127.us-west-2.compute.internal:10000",
      "hoodie.insert.shuffle.parallelism":1500,
      "hoodie.bulkinsert.shuffle.parallelism":700,
      "hoodie.upsert.shuffle.parallelism":1500
   }
   
   <img width="1680" alt="Captura de Tela 2020-08-20 às 16 15 10" src="https://user-images.githubusercontent.com/36298331/90816019-f8640b80-e301-11ea-8334-c64bd3e0278c.png">
   <img width="1680" alt="Captura de Tela 2020-08-20 às 16 14 38" src="https://user-images.githubusercontent.com/36298331/90816029-fc902900-e301-11ea-9515-6f407d05968e.png">
   <img width="1680" alt="Captura de Tela 2020-08-20 às 16 14 10" src="https://user-images.githubusercontent.com/36298331/90816031-fdc15600-e301-11ea-9830-47c2c91ee983.png">
   <img width="1680" alt="Captura de Tela 2020-08-20 às 16 13 46" src="https://user-images.githubusercontent.com/36298331/90816034-fe59ec80-e301-11ea-96e8-0b22de34e233.png">
   
   
   I tried use bulk_insert paralelism with 4000 but didn't work. I really don't know what to do...
   
   Thank you.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto closed issue #2003: [SUPPORT] Spark Fails to Process 300Gb Of Data

Posted by GitBox <gi...@apache.org>.
rubenssoto closed issue #2003:
URL: https://github.com/apache/hudi/issues/2003


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto commented on issue #2003: [SUPPORT] Spark Fails to Process 300Gb Of Data

Posted by GitBox <gi...@apache.org>.
rubenssoto commented on issue #2003:
URL: https://github.com/apache/hudi/issues/2003#issuecomment-677960641


   I need to use 10 R5.4Xlarge and process worked.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto edited a comment on issue #2003: [SUPPORT] Spark Fails to Process 300Gb Of Data

Posted by GitBox <gi...@apache.org>.
rubenssoto edited a comment on issue #2003:
URL: https://github.com/apache/hudi/issues/2003#issuecomment-677858275


   Spark stuck on the last screen, I don't know if are doing anything.
   
   <img width="1680" alt="Captura de Tela 2020-08-20 às 17 15 33" src="https://user-images.githubusercontent.com/36298331/90821263-010c1000-e309-11ea-8b59-895e8da9f193.png">
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto commented on issue #2003: [SUPPORT] Spark Fails to Process 300Gb Of Data

Posted by GitBox <gi...@apache.org>.
rubenssoto commented on issue #2003:
URL: https://github.com/apache/hudi/issues/2003#issuecomment-677858275


   Spark stuck on the last screen, I don't know if are doing anything.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org