You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/08/12 19:04:24 UTC

[GitHub] [hudi] rubenssoto opened a new issue #1957: [SUPPORT] Small Table Upsert sometimes take a lot of time

rubenssoto opened a new issue #1957:
URL: https://github.com/apache/hudi/issues/1957


   Hi Guys,
   
   Sometimes my Hudi upserts take so long, this table job used to run in less than 2 minutes and I have this behavior for all tables, I think it is a delete operation to remove old commits, but why take so long? only 20mb files
   
   <img width="1680" alt="Captura de Tela 2020-08-12 às 15 58 08" src="https://user-images.githubusercontent.com/36298331/90056538-6c286780-dcb5-11ea-88b3-3382b25721a4.png">
   <img width="1680" alt="Captura de Tela 2020-08-12 às 15 57 38" src="https://user-images.githubusercontent.com/36298331/90056549-6e8ac180-dcb5-11ea-9694-95b185fc2362.png">
   <img width="1671" alt="Captura de Tela 2020-08-12 às 15 57 21" src="https://user-images.githubusercontent.com/36298331/90056550-6fbbee80-dcb5-11ea-9d05-6e4c6b2e096e.png">
   
   
   
   hudi_options = {
       'hoodie.table.name': table_name,
       'hoodie.datasource.write.recordkey.field': 'id',
       'hoodie.datasource.write.table.name': table_name,
       'hoodie.datasource.write.operation': 'upsert',
       'hoodie.combine.before.upsert': 'true',
       'hoodie.datasource.write.precombine.field': 'LineCreatedTimestamp',
       'hoodie.parquet.small.file.limit': 200000000,
       'hoodie.parquet.max.file.size': 256000000,
       'hoodie.parquet.block.size': 256000000,
       'hoodie.cleaner.commits.retained': 10,
       'hoodie.datasource.hive_sync.enable': 'true',
       'hoodie.datasource.hive_sync.table': table_name,
       'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.NonpartitionedKeyGenerator',
       'hoodie.datasource.hive_sync.database': 'datalake_raw',
       'hoodie.datasource.hive_sync.jdbcurl': 'jdbc:hive2://ip-10-0-94-214.us-west-2.compute.internal:10000',
       'hoodie.copyonwrite.record.size.estimate': 512,
       'hoodie.insert.shuffle.parallelism': 10,
       'hoodie.upsert.shuffle.parallelism': 10
   }


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #1957: [SUPPORT] Small Table Upsert sometimes take a lot of time

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1957:
URL: https://github.com/apache/hudi/issues/1957#issuecomment-678238829


   @rubenssoto : Sorry for the delay in responding. I looked at your hudi.zip folder. All I see is I commit which wrote 3 records. Are you suggesting that was slow ?  THe isEmpty() Stage corresponds to the part of reading source records from parquet and converting to avro RDD and does not include index lookup or Hudi write. Do you have any task level metrics for this ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash closed issue #1957: [SUPPORT] Small Table Upsert sometimes take a lot of time

Posted by GitBox <gi...@apache.org>.
n3nash closed issue #1957:
URL: https://github.com/apache/hudi/issues/1957


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #1957: [SUPPORT] Small Table Upsert sometimes take a lot of time

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #1957:
URL: https://github.com/apache/hudi/issues/1957#issuecomment-767569608


   @rubenssoto: thanks. 
   @jiangok2006 : since the original post was from rubenssoto, would you mind creating a new support ticket w/ you issue. We can close this out. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto commented on issue #1957: [SUPPORT] Small Table Upsert sometimes take a lot of time

Posted by GitBox <gi...@apache.org>.
rubenssoto commented on issue #1957:
URL: https://github.com/apache/hudi/issues/1957#issuecomment-674193769


   Hi @bhasudha ,
   
   Yes, this is a simple job, reads parquet generated by aws DMS and writes Hudi.
   
   My spark submit:
   spark-submit --deploy-mode cluster --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=60s --conf spark.dynamicAllocation.executorIdleTimeout=60s --conf spark.dynamicAllocation.maxExecutors=1 --conf spark.executor.memoryOverhead=2048 --conf spark.executor.cores=3 --conf spark.executor.memory=10g --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.sql.hive.convertMetastoreParquet=false --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4 --py-files python_modules main.py poc_available_history
   
   
   
   <img width="1680" alt="Captura de Tela 2020-08-14 às 14 51 08" src="https://user-images.githubusercontent.com/36298331/90278335-afaddd80-de3d-11ea-8cf0-9fd168fa643a.png">
   <img width="1680" alt="Captura de Tela 2020-08-14 às 14 50 25" src="https://user-images.githubusercontent.com/36298331/90278349-b63c5500-de3d-11ea-91a9-02f3ae5bfc76.png">
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #1957: [SUPPORT] Small Table Upsert sometimes take a lot of time

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1957:
URL: https://github.com/apache/hudi/issues/1957#issuecomment-743030117


   @jiangok2006 : Can you enable INFO log and paste the logs here. Also, can you attach spark UI screen shots at job, stage and task levels.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash closed issue #1957: [SUPPORT] Small Table Upsert sometimes take a lot of time

Posted by GitBox <gi...@apache.org>.
n3nash closed issue #1957:
URL: https://github.com/apache/hudi/issues/1957


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on issue #1957: [SUPPORT] Small Table Upsert sometimes take a lot of time

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #1957:
URL: https://github.com/apache/hudi/issues/1957#issuecomment-824570762


   Closing this ticket due to inactivity. @jiangok2006 please re-open if your slow issue continues to persist.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto commented on issue #1957: [SUPPORT] Small Table Upsert sometimes take a lot of time

Posted by GitBox <gi...@apache.org>.
rubenssoto commented on issue #1957:
URL: https://github.com/apache/hudi/issues/1957#issuecomment-767542464


   My problem is an old problem, so you could close this issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] jiangok2006 commented on issue #1957: [SUPPORT] Small Table Upsert sometimes take a lot of time

Posted by GitBox <gi...@apache.org>.
jiangok2006 commented on issue #1957:
URL: https://github.com/apache/hudi/issues/1957#issuecomment-742662159


   I observed slow upsert:
   
   val options = HudiOptionFactory.create(tableName = tableName,
         tableType = HudiTableType.COW,
         recordKeyField = "ulid",
         partitonPathField = "",
         preCombineField = "sid",
         insertDropDups = false,
         keyGeneratorType = HudiKeyGeneratorType.Nonpartitioned,
         partitionValueExtractorType = HudiPartitionValueExtractorType.NonPartitioned
       )
       val helper = HudiHelper(options, log)
   
       /**
        * insert perf
        */
       val listInfo = spark.createDataFrame(Seq(
         (Some(1), Some("11"), Some("111")),
         (Some(2), Some("11"), Some("444"))
       )).toDF("zid", "sid", "ulid")
       var t1 = ZonedDateTime.now
       helper.insert(listInfo, path, saveMode = SaveMode.Overwrite)
       var t2 = ZonedDateTime.now
       // Duration.between(t1, t2).getSeconds is about 5 seconds here
   
       /**
        * upsert perf
        */
       val listInfo2 = spark.createDataFrame(Seq(
         (Some(1), Some("22"), Some("111"))
       )).toDF("zid", "sid", "ulid")
   
       t1 = ZonedDateTime.now
       helper.upsert(
         df = listInfo2,
         path = path
       )
       t2 = ZonedDateTime.now
       // Duration.between(t1, t2).getSeconds is about 131 seconds here
   
   This discounts the benefit of using upsert to partial update a big dataset. Thanks for any help.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bhasudha commented on issue #1957: [SUPPORT] Small Table Upsert sometimes take a lot of time

Posted by GitBox <gi...@apache.org>.
bhasudha commented on issue #1957:
URL: https://github.com/apache/hudi/issues/1957#issuecomment-673910635


   @rubenssoto  how are you writing? could you paste the spark submit command ? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto commented on issue #1957: [SUPPORT] Small Table Upsert sometimes take a lot of time

Posted by GitBox <gi...@apache.org>.
rubenssoto commented on issue #1957:
URL: https://github.com/apache/hudi/issues/1957#issuecomment-674194952


   [Hudi.zip](https://github.com/apache/hudi/files/5076229/Hudi.zip)
   
   
   I think it is the commit files.
   
   <img width="1426" alt="Captura de Tela 2020-08-14 às 15 00 25" src="https://user-images.githubusercontent.com/36298331/90279124-f3edad80-de3e-11ea-8c3c-76872a1738f9.png">
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash closed issue #1957: [SUPPORT] Small Table Upsert sometimes take a lot of time

Posted by GitBox <gi...@apache.org>.
n3nash closed issue #1957:
URL: https://github.com/apache/hudi/issues/1957


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #1957: [SUPPORT] Small Table Upsert sometimes take a lot of time

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #1957:
URL: https://github.com/apache/hudi/issues/1957#issuecomment-767569608


   @jiangok2006 : since the original post was from rubenssoto, would you mind creating a new support ticket w/ you issue. We can close this out. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #1957: [SUPPORT] Small Table Upsert sometimes take a lot of time

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #1957:
URL: https://github.com/apache/hudi/issues/1957#issuecomment-767483774


   @rubenssoto : can you please respond to Balaji's comment when you can. 
   @jiangok2006 : can you please respond to Balaji's comment when you can. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto commented on issue #1957: [SUPPORT] Small Table Upsert sometimes take a lot of time

Posted by GitBox <gi...@apache.org>.
rubenssoto commented on issue #1957:
URL: https://github.com/apache/hudi/issues/1957#issuecomment-674230980


   This is my write function:
   
   def write_hudi_dataset(spark_data_frame, write_folder_path, hudi_options, write_mode):
   
       spark_data_frame \
                         .write \
                         .options(**hudi_options) \
                         .mode(write_mode) \
                         .format('hudi')\
                         .save(write_folder_path)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rubenssoto edited a comment on issue #1957: [SUPPORT] Small Table Upsert sometimes take a lot of time

Posted by GitBox <gi...@apache.org>.
rubenssoto edited a comment on issue #1957:
URL: https://github.com/apache/hudi/issues/1957#issuecomment-674194952


   **Commit Files:** [Hudi.zip](https://github.com/apache/hudi/files/5076229/Hudi.zip)
   
   
   I think it is the commit files.
   
   <img width="1426" alt="Captura de Tela 2020-08-14 às 15 00 25" src="https://user-images.githubusercontent.com/36298331/90279124-f3edad80-de3e-11ea-8c3c-76872a1738f9.png">
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] jiangok2006 edited a comment on issue #1957: [SUPPORT] Small Table Upsert sometimes take a lot of time

Posted by GitBox <gi...@apache.org>.
jiangok2006 edited a comment on issue #1957:
URL: https://github.com/apache/hudi/issues/1957#issuecomment-742662159


   I observed slow upsert:
   
   ```
   val options = HudiOptionFactory.create(tableName = tableName,
         tableType = HudiTableType.COW,
         recordKeyField = "ulid",
         partitonPathField = "",
         preCombineField = "sid",
         insertDropDups = false,
         keyGeneratorType = HudiKeyGeneratorType.Nonpartitioned,
         partitionValueExtractorType = HudiPartitionValueExtractorType.NonPartitioned
       )
       val helper = HudiHelper(options, log)
   
       /**
        * insert perf
        */
       val listInfo = spark.createDataFrame(Seq(
         (Some(1), Some("11"), Some("111")),
         (Some(2), Some("11"), Some("444"))
       )).toDF("zid", "sid", "ulid")
       var t1 = ZonedDateTime.now
       helper.insert(listInfo, path, saveMode = SaveMode.Overwrite)
       var t2 = ZonedDateTime.now
       // Duration.between(t1, t2).getSeconds is about 5 seconds here
   
       /**
        * upsert perf
        */
       val listInfo2 = spark.createDataFrame(Seq(
         (Some(1), Some("22"), Some("111"))
       )).toDF("zid", "sid", "ulid")
   
       t1 = ZonedDateTime.now
       helper.upsert(
         df = listInfo2,
         path = path
       )
       t2 = ZonedDateTime.now
       // Duration.between(t1, t2).getSeconds is about 131 seconds here
   ```
   This discounts the benefit of using upsert to partial update a big dataset. Thanks for any help.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org