You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/09/21 22:21:35 UTC

[GitHub] [hudi] ZeMirella opened a new issue #3699: [SUPPORT] Job hanging on toRdd at HoodieSparkUtils

ZeMirella opened a new issue #3699:
URL: https://github.com/apache/hudi/issues/3699


   **Describe the problem you faced**
   My job hangs during the toRdd task here is some screenshots of the task size
   
   <img width="1771" alt="Captura de Tela 2021-09-21 às 19 07 20" src="https://user-images.githubusercontent.com/75490501/134254359-2f2d0cb9-8bb8-48d2-b02d-f84fd9dab9d6.png">
   
   <img width="1746" alt="Captura de Tela 2021-09-21 às 19 07 46" src="https://user-images.githubusercontent.com/75490501/134254377-963509f6-3075-4509-bf88-7188dc71d548.png">
   
   **My spark-submit**
   `spark-submit --deploy-mode cluster --conf spark.executor.cores=5 --conf spark.executor.memoryOverhead=6g --conf spark.executor.memory=43g --conf spark.dynamicAllocation.maxExecutors=50 --conf spark.sql.hive.convertMetastoreParquet=false --conf spark.rdd.compress=true --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryoserializer.buffer.max=512m --packages org.apache.hudi:hudi-spark-bundle_2.12:0.8.0,org.apache.spark:spark-avro_2.12:3.0.1,com.audienceproject:spark-dynamodb_2.12:1.1.2 --py-files s3://bucket/modules.zip --files s3://bucket/config.yml s3://bucket/main.py`
   
   **My cluster configuration**
   
   <img width="1533" alt="Captura de Tela 2021-09-21 às 19 11 44" src="https://user-images.githubusercontent.com/75490501/134254642-ffdbc239-77b2-4d47-be1e-8a3872280533.png">
   
   I also set shuffle.parallelism=2000
   
   **Expected behavior**
   It should run without hang
   
   **Environment Description**
   
   * Hudi version : 0.8
   
   * Spark version : 3.0
   
   * Hive version : 3.1.2
   
   * Hadoop version : Amazon 3.2.1
   
   * Storage (HDFS/S3/GCS..) : s3
   
   * Running on Docker? (yes/no) : no
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3699: [SUPPORT] Job hanging on toRdd at HoodieSparkUtils

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3699:
URL: https://github.com/apache/hudi/issues/3699#issuecomment-997299238


   @ZeMirella : Can we get any updates on this end. Did the proposed suggestions work for you. please let us know. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] ZeMirella commented on issue #3699: [SUPPORT] Job hanging on toRdd at HoodieSparkUtils

Posted by GitBox <gi...@apache.org>.

ZeMirella commented on issue #3699:
URL: https://github.com/apache/hudi/issues/3699#issuecomment-925219008


   Hi, thanks for you reply
   **Which line of code from HoodieSparkUtils was ran here?**
   The jobs hangs before even start, it hangs when it start to list files and tries to read s3 files.
   the hanged task that the spark history shows me is this one
   <img width="958" alt="Captura de Tela 2021-09-22 às 15 50 25" src="https://user-images.githubusercontent.com/75490501/134405074-b8cde70b-d81d-4299-b4a6-05cceb538386.png">
   
   **What Hudi actions are you trying to perform?**
   This job was suppose to join some tables and save the output to s3, the code line where it hangs ia an create table operation, here is code line 
   `        hudi_options = {
               'hoodie.table.name': self.table_name,
               'hoodie.datasource.write.recordkey.field': self.primary_key,
               'hoodie.datasource.write.table.name': self.table_name,
               'hoodie.datasource.write.operation': 'bulk_insert',
               'hoodie.bulkinsert.shuffle.parallelism': self.bulk_insert_shuffle_parallelism,
               'hoodie.datasource.hive_sync.enable': self.hive_sync_enabled,
               'hoodie.datasource.hive_sync.database': self.hive_database_name,
               'hoodie.datasource.hive_sync.jdbcurl': f'jdbc:hive2://{self.hive_jdbc_url}:10000',
               'hoodie.datasource.hive_sync.table': self.table_name,
               'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.NonPartitionedExtractor',
               'hoodie.datasource.hive_sync.support_timestamp': 'true',
               'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.NonpartitionedKeyGenerator',
               'hoodie.datasource.write.row.writer.enable': 'false',
               'hoodie.parquet.small.file.limit': 536870912,
               'hoodie.parquet.max.file.size': 1073741824,
               'hoodie.parquet.block.size': 536870912
           }
   
   spark_df.write.format("hudi").options(**hudi_options).mode("overwrite").save(self.table_path)`
    
   **What is the total input data size are you reading?**
   1,6TB
   
   **How many executors were actually created during the run?**
   37
   <img width="1745" alt="image" src="https://user-images.githubusercontent.com/75490501/134403621-c4ca12e1-93fa-405a-910a-595013062343.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xushiyan commented on issue #3699: [SUPPORT] Job hanging on toRdd at HoodieSparkUtils

Posted by GitBox <gi...@apache.org>.

xushiyan commented on issue #3699:
URL: https://github.com/apache/hudi/issues/3699#issuecomment-927230026


   Ok @ZeMirella so basically it's about bulk insert a chunk of data. Was `bulk_insert_shuffle_parallelism` set to 2000 in this case? given the data size, you may want to try partitioning the output dataset. Say you choose a field `foo` from the schema as the partition field, and roughly it gives 200 partitions. A few configs you may try changing accordingly
   - hoodie.bulkinsert.shuffle.parallelism = 200
   - hoodie.datasource.write.row.writer.enable = true // why disabled it explicitly?
   - Use a proper class for hoodie.datasource.write.keygenerator.class and hoodie.datasource.hive_sync.partition_extractor_class. It should fit the partitioning strategy you chose. Check out [this blog](https://hudi.incubator.apache.org/blog/2021/02/13/hudi-key-generators/).
   - the code for spark write should be sth like
     ```python
     spark_df.repartition(200, spark_df.col("foo")).write.format("hudi").options(**hudi_options).mode("overwrite").partitionBy("foo").save(self.table_path)
     ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xushiyan commented on issue #3699: [SUPPORT] Job hanging on toRdd at HoodieSparkUtils

Posted by GitBox <gi...@apache.org>.

xushiyan commented on issue #3699:
URL: https://github.com/apache/hudi/issues/3699#issuecomment-924585220


   Need more details to understand what's going on:
   - Which line of code from HoodieSparkUtils was ran here? 
   - What Hudi actions are you trying to perform? 
   - What is the total input data size are you reading?
   - How many executors were actually created during the run?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3699: [SUPPORT] Job hanging on toRdd at HoodieSparkUtils

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3699:
URL: https://github.com/apache/hudi/issues/3699#issuecomment-968190567


   also, I see you are setting max parquet size to 1GB. is that intentional? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org