You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/04/22 11:00:48 UTC

[GitHub] [incubator-hudi] harshi2506 opened a new issue #1552: [SUPPORT]

harshi2506 opened a new issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552


   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)?
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.
   2.
   3.
   4.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :
   
   * Spark version :
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) :
   
   * Running on Docker? (yes/no) :
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
lamber-ken edited a comment on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-617976874


   hi @harshi2506, based on the above analysis, please try to increate the upsert parallelism(`hoodie.upsert.shuffle.parallelism`) and spark executor instances, for example
   ```
   export SPARK_HOME=/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7
   ${SPARK_HOME}/bin/spark-shell \
     --master yarn \
     --driver-memory 6G \
     --num-executors 10 \
     --executor-cores 5 \
     --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4 \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
   import org.apache.spark.sql.functions._
   
   val tableName = "hudi_mor_table"
   val basePath = "file:///tmp/hudi_mor_tablen"
   
   val hudiOptions = Map[String,String](
     "hoodie.upsert.shuffle.parallelism" -> "200",
     "hoodie.datasource.write.recordkey.field" -> "id",
     "hoodie.datasource.write.partitionpath.field" -> "key", 
     "hoodie.table.name" -> tableName,
     "hoodie.datasource.write.precombine.field" -> "timestamp",
     "hoodie.memory.merge.max.size" -> "2004857600000"
   )
   
   val inputDF = spark.range(1, 300).
      withColumn("key", $"id").
      withColumn("data", lit("data")).
      withColumn("timestamp", current_timestamp()).
      withColumn("dt", date_format($"timestamp", "yyyy-MM-dd"))
   
   inputDF.write.format("org.apache.hudi").
     options(hudiOptions).
     mode("Append").
     save(basePath)
   
   spark.read.format("org.apache.hudi").load(basePath + "/*/*").show();
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] vinothchandar commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-618024238


   @harshi2506 the pause is weird.. As you can see the wall clock time itself is small.. i.e if you subtract the time lost pausing.. 
   
   Is this reproducible? i.e happens every time? What version of hudi are you running?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] lamber-ken commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
lamber-ken commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-617965381


   From the detailed commit metadata and above spark ui, we know
   
   1. write about 700 million records at first commit.
   2. upsert 2633 records and touched 255 partitions.
   
   **Key point**: small input upsert to large old data file slide, and touched many partitions.
   
   This is a different case, so the `hoodie.memory.merge.max.size` option will not work, 
   So I think it takes a lot of time to read the old file, WDYT? @vinothchandar 
   
   ![image](https://user-images.githubusercontent.com/20113411/80020370-7f97a300-850b-11ea-9489-8e7002e98cda.png)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] harshi2506 commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
harshi2506 commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-618070975


   hi @lamber-ken, I am already setting parallelism to 200
   hudiOptions += (HoodieWriteConfig.TABLE_NAME -> "table_name",
             DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> "COPY_ON_WRITE",
             DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "primaryKey",
             DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "dedupeKey",
             DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "partition_key"(date in format(yyyy/mm/dd)),
             "hoodie.upsert.shuffle.parallelism" ->"200",
             DataSourceWriteOptions.OPERATION_OPT_KEY -> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
   inputDF.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save(s"$s3Location")


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] lamber-ken commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
lamber-ken commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-618657572


   Upsert hoodie log, cost about 30min
   ```
   Started at 20/04/22 20:12:14 
   
   
   20/04/22 20:15:30 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from <storgaeLocation>
   20/04/22 20:15:30 INFO HoodieTableMetaClient: Loading Active commit timeline for <storageLocation>
   20/04/22 20:15:30 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@4af81941
   20/04/22 20:15:30 INFO HoodieCommitArchiveLog: No Instants to archive
   20/04/22 20:15:30 INFO HoodieWriteClient: Auto cleaning is enabled. Running cleaner now
   20/04/22 20:15:30 INFO HoodieWriteClient: Cleaner started
   20/04/22 20:15:30 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from <storageLocation>
   20/04/22 20:15:30 INFO FSUtils: Hadoop Configuration: fs.defaultFS:hdfs, Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, emrfs-site.xml, __spark_hadoop_conf__.xml, file:/etc/spark/conf.dist/hive-site.xml], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@3c0cfaad]
   20/04/22 20:15:31 INFO HoodieTableConfig: Loading dataset properties from <storageLocation>.hoodie/hoodie.properties
   20/04/22 20:15:31 INFO S3NativeFileSystem: Opening <storageLocation>.hoodie/hoodie.properties' for reading
   20/04/22 20:15:31 WARN S3CryptoModuleAE: Unable to detect encryption information for object '<storageLocation>.hoodie/hoodie.properties' in bucket 'delta-data-devo'. Returning object without decryption.
   20/04/22 20:15:31 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from <storageLocation>
   20/04/22 20:15:31 INFO HoodieTableMetaClient: Loading Active commit timeline for <storageLocation>
   20/04/22 20:15:31 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@72dd302
   20/04/22 20:15:31 INFO FileSystemViewManager: Creating View Manager with storage type :MEMORY
   20/04/22 20:15:31 INFO FileSystemViewManager: Creating in-memory based Table View
   20/04/22 20:15:31 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from <storageLocation>
   20/04/22 20:15:31 INFO FSUtils: Hadoop Configuration: fs.defaultFS: hdfs, Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, emrfs-site.xml, __spark_hadoop_conf__.xml, file:/etc/spark/conf.dist/hive-site.xml], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@3c0cfaad]
   20/04/22 20:15:31 INFO HoodieTableConfig: Loading dataset properties from <storageLocation>.hoodie/hoodie.properties
   20/04/22 20:15:31 INFO S3NativeFileSystem: Opening '<storageLocation>.hoodie/hoodie.properties' for reading
   20/04/22 20:15:31 WARN S3CryptoModuleAE: Unable to detect encryption information for object '<storageLocation>.hoodie/hoodie.properties' in bucket 'delta-data-devo'. Returning object without decryption.
   20/04/22 20:15:31 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from <storageLocation>
   20/04/22 20:15:31 INFO HoodieTableMetaClient: Loading Active commit timeline for <storageLocation>
   20/04/22 20:15:31 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@678852b5
   20/04/22 20:24:33 INFO HoodieCopyOnWriteTable: Partitions to clean up : [2001/05/02, 2001/05/07, 2001/05/09, 2001/05/10, 2001/05/17, 2001/05/18, 2001/05/21, 2001/06/01, 2001/06/04, 2001/06/08, 2001/06/20, 2001/06/21, 2001/07/17, 2001/07/23, 2001/07/25, 2001/07/30, 2001/08/02, 2001/08/03, 2001/08/07, 2001/08/08, 2001/08/09, 2001/08/14, 2001/08/23, 2001/09/05, 2001/09/06, 2001/09/07, 2001/09/13, 2001/09/14, 2001/10/02, 2001/10/03, 2001/10/04, 2001/10/09, 2001/11/01, 2001/11/09, 2001/11/14, 2001/11/15, 2001/11/16, 2001/11/19, 2001/11/20, 2001/11/21, 2001/11/27, 2001/11/28, 2001/11/29, 2001/11/30, 2001/12/03, 2001/12/07, 2001/12/10, 2001/12/11, 2001/12/12, 2001/12/13, 2001/12/17, 2001/12/20, 2001/12/21, 2001/12/25, 2001/12/26, 2001/12/27, 2001/12/28, 2001/12/29, 2001/12/31, 2002/01/02, 2002/01/03, 2002/01/07, 2002/01/08, 2002/01/09, 2002/01/11, 2002/01/13, 2002/01/14, 2002/01/15, 2002/01/16, 2002/01/17, 2002/01/18, 2002/01/21, 2002/01/22, 2002/01/23, 2002/01/25, 2002/01/28, 2002/01/29, 2002/01/30, 2002/02/03, 2002/02/05, 2002/02/06, 2002/02/07, 2002/02/11, 2002/02/12, 2002/02/14, 2002/02/15, 2002/02/18, 2002/02/19, 2002/02/20, 2002/02/21, 2002/02/22, 2002/02/26, 2002/03/02, 2002/03/04, 2002/03/06, 2002/03/10, 2002/03/15, 2002/03/17, 2002/03/19, 2002/03/20, 2002/03/21, 2002/03/22, 2002/03/25, 2002/03/26, 2002/03/27, 2002/03/28, 2002/03/30, 2002/04/02, 2002/04/03, 2002/04/04, 2002/04/05, 2002/04/07, 2002/04/09, 2002/04/10, 2002/04/11, 2002/04/14, 2002/04/16, 2002/04/17, 2002/04/22, 2002/04/23, 2002/04/25, 2002/04/30, 2002/05/01, 2002/05/02, 2002/05/06, 2002/05/08, 2002/05/09, 2002/05/12, 2002/05/13, 2002/05/14, 2002/05/17, 2002/05/19, 2002/05/20, 2002/05/21, 2002/05/22, 2002/05/24, 2002/05/27, 2002/05/28, 2002/05/29, 2002/06/03, 2002/06/04, 2002/06/06, 2002/06/10, 2002/06/11, 2002/06/12, 2002/06/14, 2002/06/18, 2002/06/19, 2002/06/24, 2002/07/01, 2002/07/02, 2002/07/04, 2002/07/17, 2002/07/21, 2002/07/31, 2002/08/01, 2002/08/02, 2002/08/07, 2002/08/21, 2002/08/26, 2002/08/27, 2002/08/28, 2002/08/29, 2002/09/03, 2002/09/05, 2002/09/06, 2002/09/09, 2002/09/10, 2002/09/11, 2002/09/12, 2002/09/13, 2002/09/16, 2002/09/17, 2002/09/18, 2002/09/19, 2002/09/23, 2002/09/24, 2002/09/25, 2002/09/28, 2002/09/30, 2002/10/07, 2002/10/08, 2002/10/09, 2002/10/10, 2002/10/11, 2002/10/14, 2002/10/15, 2002/10/16, 2002/10/17, 2002/10/21, 2002/10/23, 2002/10/25, 2002/10/28, 2002/10/29, 2002/10/30, 2002/10/31, 2002/11/07, 2002/11/11, 2002/11/12, 2002/11/13, 2002/11/14, 2002/11/15, 2002/11/18, 2002/11/19, 2002/11/20, 2002/11/21, 2002/11/22, 2002/11/25, 2002/11/26, 2002/11/27, 2002/12/02, 2002/12/04, 2002/12/05, 2002/12/06, 2002/12/09, 2002/12/10, 2002/12/11, 2002/12/12, 2002/12/13, 2002/12/16, 2002/12/17, 2002/12/18, 2002/12/19, 2002/12/20, 2002/12/23, 2002/12/24, 2002/12/25, 2002/12/30, 2002/12/31, 2003/01/03, 2003/01/04, 2003/01/08, 2003/01/09, 2003/01/10, 2003/01/13, 2003/01/14, 2003/01/15, 2003/01/16, 2003/01/17, 2003/01/20, 2003/01/21, 2003/01/22, 2003/01/23, 2003/01/24, 2003/01/27, 2003/01/28, 2003/01/29, 2003/01/30, 2003/01/31, 2003/02/03, 2003/02/06, 2003/02/07, 2003/02/10, 2003/02/11, 2003/02/12, 2003/02/13, 2003/02/14, 2003/02/17, 2003/02/18, 2003/02/19, 2003/02/24, 2003/02/25, 2003/02/26, 2003/02/27, 2003/02/28, 2003/03/04, 2003/03/05, 2003/03/06, 2003/03/07, 2003/03/10, 2003/03/11, 2003/03/12, 2003/03/13, 2003/03/14, 2003/03/17, 2003/03/18, 2003/03/19, 2003/03/24, 2003/03/25, 2003/03/26, 2003/03/27, 2003/03/31, 2003/04/01, 2003/04/02, 2003/04/05, 2003/04/07, 2003/04/08, 2003/04/09, 2003/04/10, 2003/04/11, 2003/04/14, 2003/04/15, 2003/04/16, 2003/04/17, 2003/04/18, 2003/04/21, 2003/04/22, 2003/04/23, 2003/04/24, 2003/04/25, 2003/04/28, 2003/04/29, 2003/04/30, 2003/05/02, 2003/05/05, 2003/05/06, 2003/05/07, 2003/05/12, 2003/05/15, 2003/05/19, 2003/05/20, 2003/05/21, 2003/05/27, 2003/05/28, 2003/05/29, 2003/05/30, 2003/06/02, 2003/06/03, 2003/06/04, 2003/06/05, 2003/06/10, 2003/06/11, 2003/06/18, 2003/06/19, 2003/06/23, 2003/06/24, 2003/06/25, 2003/06/27, 2003/07/01, 2003/07/02, 2003/07/05, 2003/07/07, 2003/07/08, 2003/07/09, 2003/07/10, 2003/07/11, 2003/07/14, 2003/07/15, 2003/07/16, 2003/07/17, 2003/07/18, 2003/07/21, 2003/07/22, 2003/07/23, 2003/07/24, 2003/07/28, 2003/07/29, 2003/07/30, 2003/08/01, 2003/08/05, 2003/08/06, 2003/08/07, 2003/08/11, 2003/08/12, 2003/08/13, 2003/08/15, 2003/08/18, 2003/08/19, 2003/08/20, 2003/08/21, 2003/08/22, 2003/08/26, 2003/08/27, 2003/08/28, 2003/08/29, 2003/09/02, 2003/09/03, 2003/09/04, 2003/09/08, 2003/09/09, 2003/09/10, 2003/09/12, 2003/09/15, 2003/09/16, 2003/09/17, 2003/09/19, 2003/09/22, 2003/09/23, 2003/09/24, 2003/09/25, 2003/09/26, 2003/09/29, 2003/09/30, 2003/10/01, 2003/10/02, 2003/10/08, 2003/10/13, 2003/10/14, 2003/10/15, 2003/10/16, 2003/10/17, 2003/10/19, 2003/10/20, 2003/10/21, 2003/10/22, 2003/10/23, 2003/10/29, 2003/10/30, 2003/10/31, 2003/11/03, 2003/11/04, 2003/11/05, 2003/11/06, 2003/11/07, 2003/11/10, 2003/11/11, 2003/11/12, 2003/11/17, 2003/11/18, 2003/11/19, 2003/11/20, 2003/11/24, 2003/11/25, 2003/11/26, 2003/12/01, 2003/12/02, 2003/12/03, 2003/12/04, 2003/12/05, 2003/12/08, 2003/12/09, 2003/12/10, 2003/12/11, 2003/12/14, 2003/12/15, 2003/12/17, 2003/12/18, 2003/12/19, 2003/12/20, 2003/12/21, 2003/12/22, 2003/12/23, 2003/12/24, 2003/12/25, 2003/12/26, 2003/12/28, 2003/12/30, 2003/12/31, 2004/01/02, 2004/01/03, 2004/01/07, 2004/01/09, 2004/01/12, 2004/01/14, 2004/01/16, 2004/01/19, 2004/01/20, 2004/01/21, 2004/01/23, 2004/01/26, 2004/01/27, 2004/01/28, 2004/01/29, 2004/02/01, 2004/02/02, 2004/02/03, 2004/02/04, 2004/02/06, 2004/02/09, 2004/02/10, 2004/02/11, 2004/02/12, 2004/02/16, 2004/02/18, 2004/02/20, 2004/02/23, 2004/02/24, 2004/02/25, 2004/02/26, 2004/02/27, 2004/03/01, 2004/03/02, 2004/03/03, 2004/03/04, 2004/03/05, 2004/03/08, 2004/03/09, 2004/03/10, 2004/03/11, 2004/03/14, 2004/03/16, 2004/03/17, 2004/03/22, 2004/03/23, 2004/03/24, 2004/03/25, 2004/03/26, 2004/03/28, 2004/03/29, 2004/03/30, 2004/03/31, 2004/04/01, 2004/04/02, 2004/04/03, 2004/04/05, 2004/04/09, 2004/04/12, 2004/04/13, 2004/04/14, 2004/04/16, 2004/04/19, 2004/04/20, 2004/04/21, 2004/04/22, 2004/04/23, 2004/04/26, 2004/04/27, 2004/04/28, 2004/04/29, 2004/04/30, 2004/05/03, 2004/05/04, 2004/05/05, 2004/05/07, 2004/05/11, 2004/05/12, 2004/05/13, 2004/05/14, 2004/05/17, 2004/05/18, 2004/05/20, 2004/05/24, 2004/05/25, 2004/05/26, 2004/05/27, 2004/05/28, 2004/05/29, 2004/05/30, 2004/05/31, 2004/06/02, 2004/06/03, 2004/06/07, 2004/06/08, 2004/06/09, 2004/06/10, 2004/06/11, 2004/06/13, 2004/06/14, 2004/06/15, 2004/06/16, 2004/06/20, 2004/06/21, 2004/06/22, 2004/06/23, 2004/06/24, 2004/06/25, 2004/06/27, 2004/06/28, 2004/06/29, 2004/06/30, 2004/07/06, 2004/07/07, 2004/07/08, 2004/07/09, 2004/07/11, 2004/07/12, 2004/07/14, 2004/07/15, 2004/07/17, 2004/07/21, 2004/07/28, 2004/07/31, 2004/08/02, 2004/08/03, 2004/08/04, 2004/08/05, 2004/08/09, 2004/08/10, 2004/08/12, 2004/08/13, 2004/08/17, 2004/08/19, 2004/08/20, 2004/08/21, 2004/08/22, 2004/08/23, 2004/08/24, 2004/08/25, 2004/08/26, 2004/08/27, 2004/08/28, 2004/08/29, 2004/08/30, 2004/08/31, 2004/09/01, 2004/09/02, 2004/09/09, 2004/09/10, 2004/09/11, 2004/09/12, 2004/09/13, 2004/09/14, 2004/09/15, 2004/09/16, 2004/09/17, 2004/09/18, 2004/09/19, 2004/09/20, 2004/09/21, 2004/09/22, 2004/09/23, 2004/09/24, 2004/09/27, 2004/09/28, 2004/09/29, 2004/09/30, 2004/10/01, 2004/10/02, 2004/10/03, 2004/10/04, 2004/10/05, 2004/10/06, 2004/10/07, 2004/10/08, 2004/10/09, 2004/10/10, 2004/10/11, 2004/10/12, 2004/10/13, 2004/10/14, 2004/10/15, 2004/10/16, 2004/10/17, 2004/10/18, 2004/10/19, 2004/10/20, 2004/10/21, 2004/10/22, 2004/10/23, 2004/10/24, 2004/10/25, 2004/10/26, 2004/10/27, 2004/10/28, 2004/10/29, 2004/10/30, 2004/10/31, 2004/11/02, 2004/11/03, 2004/11/04, 2004/11/05, 2004/11/06, 2004/11/07, 2004/11/08, 2004/11/09, 2004/11/10, 2004/11/11, 2004/11/12, 2004/11/13, 2004/11/16, 2004/11/17, 2004/11/18, 2004/11/19, 2004/11/20, 2004/11/21, 2004/11/22, 2004/11/23,......] - 6040 partitions
   
   
   20/04/22 20:24:33 INFO HoodieCopyOnWriteTable: Using cleanerParallelism: 200
   20/04/22 20:24:33 INFO SparkContext: Starting job: collect at HoodieCopyOnWriteTable.java:396
   20/04/22 20:24:33 INFO DAGScheduler: Registering RDD 58 (repartition at HoodieCopyOnWriteTable.java:392)
   20/04/22 20:24:33 INFO DAGScheduler: Registering RDD 62 (mapPartitionsToPair at HoodieCopyOnWriteTable.java:393)
   20/04/22 20:24:33 INFO DAGScheduler: Got job 8 (collect at HoodieCopyOnWriteTable.java:396) with 200 output partitions
   20/04/22 20:24:33 INFO DAGScheduler: Final stage: ResultStage 29 (collect at HoodieCopyOnWriteTable.java:396)
   20/04/22 20:24:33 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 28)
   20/04/22 20:24:33 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 28)
   20/04/22 20:24:33 INFO DAGScheduler: Submitting ShuffleMapStage 27 (MapPartitionsRDD[58] at repartition at HoodieCopyOnWriteTable.java:392), which has no missing parents
   20/04/22 20:24:33 INFO MemoryStore: Block broadcast_17 stored as values in memory (estimated size 174.2 KB, free 1008.1 MB)
   20/04/22 20:24:33 INFO MemoryStore: Block broadcast_17_piece0 stored as bytes in memory (estimated size 60.8 KB, free 1008.0 MB)
   20/04/22 20:24:33 INFO BlockManagerInfo: Added broadcast_17_piece0 in memory on ip (size: 60.8 KB, free: 1008.7 MB)
   20/04/22 20:24:33 INFO SparkContext: Created broadcast 17 from broadcast at DAGScheduler.scala:1201
   20/04/22 20:24:33 INFO DAGScheduler: Submitting 200 missing tasks from ShuffleMapStage 27 (MapPartitionsRDD[58] at repartition at HoodieCopyOnWriteTable.java:392) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
   
   
   20/04/22 20:24:35 INFO DAGScheduler: ShuffleMapStage 27 (repartition at HoodieCopyOnWriteTable.java:392) finished in 1.401 s
   20/04/22 20:24:35 INFO DAGScheduler: looking for newly runnable stages
   20/04/22 20:24:35 INFO DAGScheduler: running: Set()
   20/04/22 20:24:35 INFO DAGScheduler: waiting: Set(ShuffleMapStage 28, ResultStage 29)
   20/04/22 20:24:35 INFO DAGScheduler: failed: Set()
   20/04/22 20:24:35 INFO DAGScheduler: Submitting ShuffleMapStage 28 (MapPartitionsRDD[62] at mapPartitionsToPair at HoodieCopyOnWriteTable.java:393), which has no missing parents
   20/04/22 20:24:35 INFO MemoryStore: Block broadcast_18 stored as values in memory (estimated size 174.8 KB, free 1007.9 MB)
   20/04/22 20:24:35 INFO MemoryStore: Block broadcast_18_piece0 stored as bytes in memory (estimated size 61.5 KB, free 1007.8 MB)
   20/04/22 20:24:35 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory on ip (size: 61.5 KB, free: 1008.7 MB)
   20/04/22 20:24:35 INFO SparkContext: Created broadcast 18 from broadcast at DAGScheduler.scala:1201
   20/04/22 20:24:35 INFO DAGScheduler: Submitting 200 missing tasks from ShuffleMapStage 28 (MapPartitionsRDD[62] at mapPartitionsToPair at HoodieCopyOnWriteTable.java:393) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
   
   
   
   20/04/22 20:24:35 INFO DAGScheduler: ResultStage 29 (collect at HoodieCopyOnWriteTable.java:396) finished in 0.057 s
   20/04/22 20:24:35 INFO DAGScheduler: Job 8 finished: collect at HoodieCopyOnWriteTable.java:396, took 1.627846 s
   20/04/22 20:24:35 INFO FileSystemViewManager: Creating InMemory based view for basePath <storageLocation>
   20/04/22 20:24:35 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from <storageLocation>
   20/04/22 20:24:35 INFO FSUtils: Hadoop Configuration: fs.defaultFS: hdfs-ip, Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, emrfs-site.xml, __spark_hadoop_conf__.xml, file:/etc/spark/conf.dist/hive-site.xml], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@3c0cfaad]
   20/04/22 20:24:35 INFO HoodieTableConfig: Loading dataset properties from <storageLocation>.hoodie/hoodie.properties
   20/04/22 20:24:35 INFO S3NativeFileSystem: Opening '<storageLocation>.hoodie/hoodie.properties' for reading
   20/04/22 20:24:35 WARN S3CryptoModuleAE: Unable to detect encryption information for object '<storageLocation>.hoodie/hoodie.properties' in bucket 'delta-data-devo'. Returning object without decryption.
   20/04/22 20:24:35 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from <storageLocation>
   20/04/22 20:24:35 INFO HoodieTableMetaClient: Loading Active commit timeline for <storageLocation>
   20/04/22 20:24:35 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@1b7440c3
   20/04/22 20:24:35 INFO HoodieWriteClient: Cleaned 0 files
   20/04/22 20:24:36 INFO HoodieActiveTimeline: Marking instant complete [==>20200422201250__clean__INFLIGHT]
   20/04/22 20:24:36 INFO CSEMultipartUploadOutputStream: close closed:false <storageLocation>.hoodie/20200422201250.clean.inflight
   20/04/22 20:24:36 INFO DefaultMultipartUploadDispatcher: Completed multipart upload of 1 parts 0 bytes
   20/04/22 20:24:36 INFO CSEMultipartUploadOutputStream: Finished uploading <storageLocation>.hoodie/20200422201250.clean.inflight. Elapsed seconds: 0.
   20/04/22 20:24:36 INFO HoodieActiveTimeline: Created a new file in meta path: <storageLocation>.hoodie/20200422201250.clean.inflight
   20/04/22 20:24:36 INFO CSEMultipartUploadOutputStream: close closed:false <storageLocation>.hoodie/20200422201250.clean.inflight
   20/04/22 20:24:36 INFO DefaultMultipartUploadDispatcher: Completed multipart upload of 1 parts 272855 bytes
   20/04/22 20:24:36 INFO CSEMultipartUploadOutputStream: Finished uploading <storageLocation>.hoodie/20200422201250.clean.inflight. Elapsed seconds: 0.
   20/04/22 20:24:36 INFO S3NativeFileSystem: rename <storageLocation>.hoodie/20200422201250.clean.inflight <storageLocation>.hoodie/20200422201250.clean
   20/04/22 20:24:36 INFO HoodieActiveTimeline: Completed [==>20200422201250__clean__INFLIGHT]
   20/04/22 20:24:36 INFO HoodieWriteClient: Marked clean started on 20200422201250 as complete
   20/04/22 20:24:36 INFO HoodieWriteClient: Committed 20200422201250
   20/04/22 20:24:36 INFO HoodieSparkSqlWriter$: Commit 20200422201250 successful!
   20/04/22 20:24:36 INFO DefaultSource: Constructing hoodie (as parquet) data source with options :Map(hoodie.datasource.write.insert.drop.duplicates -> false, hoodie.datasource.hive_sync.database -> default, hoodie.insert.shuffle.parallelism -> 200, path -> <storageLocation>, hoodie.datasource.write.precombine.field -> dedupeKey, hoodie.datasource.hive_sync.partition_fields -> , hoodie.datasource.write.payload.class -> org.apache.hudi.OverwriteWithLatestAvroPayload, hoodie.datasource.hive_sync.partition_extractor_class -> org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor, hoodie.datasource.write.streaming.retry.interval.ms -> 2000, hoodie.datasource.hive_sync.table -> unknown, hoodie.datasource.write.streaming.ignore.failed.batch -> true, hoodie.datasource.write.operation -> upsert, hoodie.datasource.hive_sync.enable -> false, hoodie.datasource.write.recordkey.field -> customer_trx_id, hoodie.datasource.view.type -> read_optimized, hoodie.table.name -> ra_customer_trx_all, hoodie.datasource.hive_sync.jdbcurl -> jdbc:hive2://localhost:10000, hoodie.datasource.write.storage.type -> COPY_ON_WRITE, hoodie.datasource.hive_sync.username -> hive, hoodie.datasource.write.streaming.retry.count -> 3, hoodie.datasource.hive_sync.password -> hive, hoodie.datasource.write.keygenerator.class -> org.apache.hudi.SimpleKeyGenerator, hoodie.upsert.shuffle.parallelism -> 200, hoodie.datasource.hive_sync.assume_date_partitioning -> false, hoodie.datasource.write.partitionpath.field -> partition_key, hoodie.datasource.write.commitmeta.key.prefix -> _, hoodie.bulkinsert.shuffle.parallelism -> 200)
   
   
   
   Finally it says loading hoodie table for each partition like for 6040 partitions.
   Something similar to 
   
   
   20/04/22 21:06:20 INFO HoodieTableConfig: Loading dataset properties from <storageLocation>/.hoodie/hoodie.properties
   20/04/22 21:06:20 INFO S3NativeFileSystem: Opening '<storageLocation>/.hoodie/hoodie.properties' for reading
   20/04/22 21:06:20 WARN S3CryptoModuleAE: Unable to detect encryption information for object '<storageLocation>/.hoodie/hoodie.properties' in bucket 'delta-data-devo'. Returning object without decryption.
   20/04/22 21:06:20 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from <storageLocation>
   20/04/22 21:06:20 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@b9cfa18
   20/04/22 21:06:20 INFO HoodieTableFileSystemView: Adding file-groups for partition :2020/04/20, #FileGroups=1
   20/04/22 21:06:20 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=3, FileGroupsCreationTime=160, StoreTimeTaken=0
   20/04/22 21:06:20 INFO HoodieROTablePathFilter: Based on hoodie metadata from base path: <storageLocation>, caching 1 files under <storageLocation>/2020/04/20
   20/04/22 21:06:25 INFO SparkContext: Invoking stop() from shutdown hook
   20/04/22 21:06:25 INFO SparkUI: Stopped Spark web UI at ip
   20/04/22 21:06:25 INFO YarnClientSchedulerBackend: Interrupting monitor thread
   20/04/22 21:06:25 INFO YarnClientSchedulerBackend: Shutting down all executors
   20/04/22 21:06:25 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
   20/04/22 21:06:25 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
   
   It runs till 20/04/22 21:06:25 
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] harshi2506 edited a comment on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
harshi2506 edited a comment on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-618070975


   hi @lamber-ken, I am already setting parallelism to 200
   `hudiOptions = (HoodieWriteConfig.TABLE_NAME -> "table_name",
             DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> "COPY_ON_WRITE",
             DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "primaryKey",
             DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "dedupeKey",
             DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "partition_key"(date in format(yyyy/mm/dd)),
             "hoodie.upsert.shuffle.parallelism" ->"200",
             DataSourceWriteOptions.OPERATION_OPT_KEY -> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
   inputDF.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save(load_path)`
   I will try setting the spark executor instances. Thanks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
lamber-ken edited a comment on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-619449248


   hi @harshi2506, build steps:
   **1. Build Env**
   - JDK8
   - Unix
   
   **2. Commands**
   ```
   git clone https://github.com/apache/incubator-hudi.git
   mvn clean install -DskipTests -DskipITs -Dcheckstyle.skip=true -Drat.skip=true
   ```
   
   **3. Run env**
   - Spark-2.4.4+
   - avro-1.8.0
   ```
   // run in local env
   export SPARK_HOME=/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7
   ${SPARK_HOME}/bin/spark-shell \
     --driver-memory 6G \
     --packages org.apache.spark:spark-avro_2.11:2.4.4 \
     --jars `ls packaging/hudi-spark-bundle/target/hudi-spark-bundle_*.*-*.*.*-SNAPSHOT.jar` \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
   // run in yarn env
   export SPARK_HOME=/BigData/install/spark-2.4.4-bin-hadoop2.7
   ${SPARK_HOME}/bin/spark-shell \
     --master yarn \
     --driver-memory 6G \
     --executor-memory 6G \
     --num-executors 5 \
     --executor-cores 5 \
     --queue root.default \
     --packages org.apache.spark:spark-avro_2.11:2.4.4 \
     --jars `ls packaging/hudi-spark-bundle/target/hudi-spark-bundle_*.*-*.*.*-SNAPSHOT.jar` \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
   // scripts
   import org.apache.spark.sql.functions._
   
   val tableName = "hudi_mor_table"
   val basePath = "file:///tmp/hudi_mor_tablen"
   // val basePath = "hdfs:///hudi/test"
   
   val hudiOptions = Map[String,String](
     "hoodie.upsert.shuffle.parallelism" -> "10",
     "hoodie.datasource.write.recordkey.field" -> "key",
     "hoodie.datasource.write.partitionpath.field" -> "dt", 
     "hoodie.table.name" -> tableName,
     "hoodie.datasource.write.precombine.field" -> "timestamp"
   )
   
   val inputDF = spark.range(1, 7).
      withColumn("key", $"id").
      withColumn("data", lit("data")).
      withColumn("timestamp", unix_timestamp()).
      withColumn("dtstamp", unix_timestamp() + ($"id" * 24 * 3600)).
      withColumn("dt", from_unixtime($"dtstamp", "yyyy/MM/dd"))
   
   inputDF.write.format("org.apache.hudi").
     options(hudiOptions).
     mode("Overwrite").
     save(basePath)
   
   spark.read.format("org.apache.hudi").load(basePath + "/*/*/*").show();
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] lamber-ken commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
lamber-ken commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-620893610


   #### BUG status
   
   | bug | status | way |
   | :-----| ----: | :----: |
   | upsert long time first time | fixed | upgrate version (0.5.0 to master) |
   | FileStatusExt: Metadata Entry doesn't exist | fixed | using a whole new base path |
   |java.io.FileNotFoundException: No such file or directory | fiexed | but don't know the root cause |
   | 17 minutes lag between HoodieActiveTimeline and CleanActionExecutor | fighting | xxx |


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
lamber-ken edited a comment on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-617976874


   hi @harshi2506, based on the above analysis, please try to increate the upsert parallelism(`hoodie.upsert.shuffle.parallelism`) and spark executor instances, for example
   ```
   export SPARK_HOME=/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7
   ${SPARK_HOME}/bin/spark-shell \
     --master yarn \
     --driver-memory 6G \
     --num-executors 10 \
     --executor-cores 5 \
     --packages org.apache.hudi:hudi-spark-bundle:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4 \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
   import org.apache.spark.sql.functions._
   
   val tableName = "hudi_mor_table"
   val basePath = "file:///tmp/hudi_mor_tablen"
   
   val hudiOptions = Map[String,String](
     "hoodie.upsert.shuffle.parallelism" -> "200",
     "hoodie.datasource.write.recordkey.field" -> "id",
     "hoodie.datasource.write.partitionpath.field" -> "key", 
     "hoodie.table.name" -> tableName,
     "hoodie.datasource.write.precombine.field" -> "timestamp",
     "hoodie.memory.merge.max.size" -> "2004857600000"
   )
   
   val inputDF = spark.range(1, 300).
      withColumn("key", $"id").
      withColumn("data", lit("data")).
      withColumn("timestamp", current_timestamp()).
      withColumn("dt", date_format($"timestamp", "yyyy-MM-dd"))
   
   inputDF.write.format("org.apache.hudi").
     options(hudiOptions).
     mode("Append").
     save(basePath)
   
   spark.read.format("org.apache.hudi").load(basePath + "/*/*").show();
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] harshi2506 commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
harshi2506 commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-618068101


   @vinothchandar yes, it is happening every time, I tried it 3 times and its always the same. I see a hoodie commit being added in atmost 5-6 minutes and then it takes almost an hour to complete all the steps. I am using AWS EMR, HUDI Version: 0.5.0-incubating.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] lamber-ken commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
lamber-ken commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-619306738


   ## Spark log analysis brainstorming 
   
   hi @vinothchandar, preliminary analysis about this issue
   (based on `release-0.5.0` branch https://github.com/apache/incubator-hudi/tree/release-0.5.0)
   
   ### Simplified origin spark log
   
   ```
   // Upsert part
   Warning: Ignoring non-spark config property: "spark.sql.hive.convertMetastoreParquet=false"
   Warning: Ignoring non-spark config property: "spark.serializer=org.apache.spark.serializer.KryoSerializer"
   20/04/22 20:12:14 WARN S3CryptoModuleAE: Unable to detect encryption information for object '<JARPATH>' in bucket '<storageBucket>'. Returning object without decryption.
   20/04/22 20:12:14 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist
   20/04/22 20:12:14 INFO SparkContext: Running Spark version 2.4.4
   20/04/22 20:12:14 INFO SparkContext: Submitted application: HUDI incremental data loading
   20/04/22 20:12:15 INFO SecurityManager: Changing view acls to: hadoop
   20/04/22 20:12:15 INFO SecurityManager: Changing modify acls to: hadoop
   20/04/22 20:12:15 INFO SecurityManager: Changing view acls groups to: 
   20/04/22 20:15:31 INFO S3NativeFileSystem: Opening '<storageLocation>.hoodie/hoodie.properties' for reading
   20/04/22 20:15:31 WARN S3CryptoModuleAE: Unable to detect encryption information for object '<storgaeLocation>/.hoodie/hoodie.properties' in bucket '<storageBucket>'. Returning object without decryption.
   20/04/22 20:15:31 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from <storageLocation>
   20/04/22 20:15:31 INFO HoodieTableMetaClient: Loading Active commit timeline for <storageLocation>
   20/04/22 20:15:31 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@678852b5
   20/04/22 20:24:33 INFO HoodieCopyOnWriteTable: Partitions to clean up : [2001/05/02, 2001/05/07, 2001/05/09, 2001/05/10, 2001/05/17, 2001/05/18, 2001/05/21, 2001/06/01, 2001/06/04, 2001/06/08, 2001/06/20, 2001/06/21, 2001/07/17, 2001/07/23, 2001/07/25, 2001/07/30, 2001/08/02, 2001/08/03, 2001/08/07, 2001/08/08, 2001/08/09, 2001/08/14, 2001/08/23, 2001/09/05, 2001/09/06, 2001/09/07, 2001/09/13, 2001/09/14, 2001/10/02, 2001/10/03, 2001/10/04, 2001/10/09, 2001/11/01, 2001/11/09, 2001/11/14, 2001/11/15, 2001/11/16, 2001/11/19, 2001/11/20, 2001/11/21,
   20/04/22 20:24:36 INFO HoodieWriteClient: Marked clean started on 20200422201250 as complete
   20/04/22 20:24:36 INFO HoodieWriteClient: Committed 20200422201250
   20/04/22 20:24:36 INFO HoodieSparkSqlWriter$: Commit 20200422201250 successful!
   
   
   // CreateRelation(sqlContext, parameters, df.schema)
   20/04/22 20:24:36 INFO DefaultSource: Constructing hoodie (as parquet) data source with options ...
   
   20/04/22 20:24:38 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from <storgaeLocation>
   20/04/22 20:24:38 INFO FSUtils: Hadoop Configuration: fs.defaultFS:...], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@3c0cfaad]
   ...
   20/04/22 20:24:38 INFO S3NativeFileSystem: Opening '<storageLocation>2001/05/10/.hoodie_partition_metadata' for reading
   20/04/22 20:24:38 WARN S3CryptoModuleAE: Unable to detect encryption information for object '<storgaeLocation>/2001/05/10/.hoodie_partition_metadata' in bucket '<storageBucket>'. Returning object without decryption.
   
   20/04/22 20:24:38 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from <storgaeLocation>
   20/04/22 20:24:38 INFO FSUtils: Hadoop Configuration: fs.defaultFS:...], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@3c0cfaad]
   ...
   20/04/22 20:24:39 INFO S3NativeFileSystem: Opening '<storageLocation>2001/05/17/.hoodie_partition_metadata' for reading
   20/04/22 20:24:39 WARN S3CryptoModuleAE: Unable to detect encryption information for object '<storgaeLocation>/2001/05/17/.hoodie_partition_metadata' in bucket '<storageBucket>'. Returning object without decryption.
   
   20/04/22 20:24:39 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from <storgaeLocation>
   20/04/22 20:24:39 INFO FSUtils: Hadoop Configuration: fs.defaultFS:...], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@3c0cfaad]
   ...
   20/04/22 20:24:39 INFO S3NativeFileSystem: Opening '<storageLocation>2001/05/18/.hoodie_partition_metadata' for reading
   20/04/22 20:24:39 WARN S3CryptoModuleAE: Unable to detect encryption information for object '<storgaeLocation>/2001/05/18/.hoodie_partition_metadata' in bucket '<storageBucket>'. Returning object without decryption.
   
   // Finish part
   20/04/22 21:06:25 INFO SparkContext: Successfully stopped SparkContext
   20/04/22 21:06:25 INFO ShutdownHookManager: Shutdown hook called
   20/04/22 21:06:25 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-c24e90ad-7be5-4a04-b4f1-2575cf68bd5a
   20/04/22 21:06:25 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-c18b2f4f-e525-43de-b896-75244fe63591
   ```
   
   #### Analysis
   
   ![image](https://user-images.githubusercontent.com/20113411/80269021-c7225880-86de-11ea-84fe-464d08fc4e9b.png)
   
   
   ### Analysis about step 2
   1. Call stackstrace
   ```
   "main" Id=1 cpuUsage=94% RUNNABLE
       ...
       at org.apache.hudi.common.util.FSUtils.processFiles(FSUtils.java:227)
       at org.apache.hudi.common.util.FSUtils.getAllFoldersWithPartitionMetaFile(FSUtils.java:191)
       at org.apache.hudi.common.util.FSUtils.getAllPartitionPaths(FSUtils.java)
       at org.apache.hudi.table.HoodieCopyOnWriteTable.clean(HoodieCopyOnWriteTable.java:285)
       at org.apache.hudi.HoodieWriteClient.clean(HoodieWriteClient.java:950)
   ```
   2. IMO, `Step 2` affected by NN's RPC performance.
   
   ### Analysis about step 4
   When I set `WARN` log level, the clean operation run faster. Performance loss due to the interaction between driver and executor?
   
   ![image](https://user-images.githubusercontent.com/20113411/80269122-75c69900-86df-11ea-9a34-b805b694f9e9.png)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] lamber-ken commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
lamber-ken commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-617895350


   Hi @harshi2506, please share the Spark stage UI, thanks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] vinothchandar commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-620781382


   @harshi2506 On master, there is no `CopyOnWriteLazyInsertIterable` 
   https://github.com/apache/incubator-hudi/tree/master/hudi-client/src/main/java/org/apache/hudi/execution . Wondering if you are behind on master. 
   
   In any case, are you running with consistency check enabled? That error seems to indicate that close() cannot find the file it wrote. 
   
   https://hudi.apache.org/docs/configurations.html#withConsistencyCheckEnabled 
   This has been done many times.. So must be some issue..
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] harshi2506 commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
harshi2506 commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-619491671


   > hi @harshi2506, build steps:
   > **1. Build Env**
   > 
   > * JDK8
   > * Unix
   > 
   > **2. Commands**
   > 
   > ```
   > git clone https://github.com/apache/incubator-hudi.git
   > mvn clean install -DskipTests -DskipITs -Dcheckstyle.skip=true -Drat.skip=true
   > ```
   > 
   > **3. Run env**
   > 
   > * Spark-2.4.4+
   > * avro-1.8.0
   > 
   > ```
   > // run in local env
   > export SPARK_HOME=/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7
   > ${SPARK_HOME}/bin/spark-shell \
   >   --driver-memory 6G \
   >   --packages org.apache.spark:spark-avro_2.11:2.4.4 \
   >   --jars `ls packaging/hudi-spark-bundle/target/hudi-spark-bundle_*.*-*.*.*-SNAPSHOT.jar` \
   >   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   > 
   > // run in yarn env
   > export SPARK_HOME=/BigData/install/spark-2.4.4-bin-hadoop2.7
   > ${SPARK_HOME}/bin/spark-shell \
   >   --master yarn \
   >   --driver-memory 6G \
   >   --executor-memory 6G \
   >   --num-executors 5 \
   >   --executor-cores 5 \
   >   --queue root.default \
   >   --packages org.apache.spark:spark-avro_2.11:2.4.4 \
   >   --jars `ls packaging/hudi-spark-bundle/target/hudi-spark-bundle_*.*-*.*.*-SNAPSHOT.jar` \
   >   --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   > 
   > // scripts
   > import org.apache.spark.sql.functions._
   > 
   > val tableName = "hudi_mor_table"
   > val basePath = "file:///tmp/hudi_mor_tablen"
   > // val basePath = "hdfs:///hudi/test"
   > 
   > val hudiOptions = Map[String,String](
   >   "hoodie.upsert.shuffle.parallelism" -> "10",
   >   "hoodie.datasource.write.recordkey.field" -> "key",
   >   "hoodie.datasource.write.partitionpath.field" -> "dt", 
   >   "hoodie.table.name" -> tableName,
   >   "hoodie.datasource.write.precombine.field" -> "timestamp"
   > )
   > 
   > val inputDF = spark.range(1, 7).
   >    withColumn("key", $"id").
   >    withColumn("data", lit("data")).
   >    withColumn("timestamp", unix_timestamp()).
   >    withColumn("dtstamp", unix_timestamp() + ($"id" * 24 * 3600)).
   >    withColumn("dt", from_unixtime($"dtstamp", "yyyy/MM/dd"))
   > 
   > inputDF.write.format("org.apache.hudi").
   >   options(hudiOptions).
   >   mode("Overwrite").
   >   save(basePath)
   > 
   > spark.read.format("org.apache.hudi").load(basePath + "/*/*/*").show();
   > ```
   
   @lamber-ken will try and let you know. Thanks


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
lamber-ken edited a comment on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-620246831


   ## The lastest spark log (master branch)
   
   ### Simplified origin spark log
   ```
   // Upsert part
   Warning: Ignoring non-spark config property: "spark.sql.hive.convertMetastoreParquet=false"
   Warning: Ignoring non-spark config property: "spark.serializer=org.apache.spark.serializer.KryoSerializer"
   20/04/26 17:35:58 WARN S3CryptoModuleAE: Unable to detect encryption information for object 'jar' in bucket '<storageBucket>'. Returning object without decryption.
   
   // Start rollback
   20/04/26 17:36:34 INFO HoodieWriteClient: Begin rollback of instant 20200426170940
   20/04/26 17:36:34 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from <storageLocation>
   20/04/26 17:36:34 INFO FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://ip-10-0-55-99.ec2.internal:8020]
   
   // Output repeatly 
   20/04/26 17:36:40 INFO FileStatusExt: Metadata Entry doesn't exist
   ...
   20/04/26 18:19:07 INFO FileStatusExt: Metadata Entry doesn't exist
   
   // Finish
   20/04/26 18:19:08 INFO CleanActionExecutor: Total Partitions to clean : 6052, with policy KEEP_LATEST_COMMITS
   20/04/26 18:19:08 INFO CleanActionExecutor: Using cleanerParallelism: 6000
   20/04/26 18:19:08 INFO SparkContext: Starting job: collect at CleanActionExecutor.java:87
   20/04/26 18:19:08 INFO DAGScheduler: Got job 10 (collect at CleanActionExecutor.java:87) with 6000 output partitions
   20/04/26 18:19:08 INFO DAGScheduler: Final stage: ResultStage 30 (collect at CleanActionExecutor.java:87)
   20/04/26 18:19:08 INFO DAGScheduler: Parents of final stage: List()
   20/04/26 18:19:08 INFO DAGScheduler: Missing parents: List()
   20/04/26 18:19:08 INFO DAGScheduler: Submitting ResultStage 30 (MapPartitionsRDD[63] at map at CleanActionExecutor.java:86), which has no missing parents
   20/04/26 18:19:08 INFO MemoryStore: Block broadcast_20 stored as values in memory (estimated size 247.7 KB, free 1008.3 MB)
   20/04/26 18:19:08 INFO MemoryStore: Block broadcast_20_piece0 stored as bytes in memory (estimated size 87.1 KB, free 1008.2 MB)
   20/04/26 18:19:08 INFO BlockManagerInfo: Added broadcast_20_piece0 in memory on ip-10-0-55-99.ec2.internal:44415 (size: 87.1 KB, free: 1008.8 MB)
   20/04/26 18:19:08 INFO SparkContext: Created broadcast 20 from broadcast at DAGScheduler.scala:1201
   20/04/26 18:19:08 INFO DAGScheduler: Submitting 6000 missing tasks from ResultStage 30 (MapPartitionsRDD[63] at map at CleanActionExecutor.java:86) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
   20/04/26 18:19:08 INFO YarnScheduler: Adding task set 30.0 with 6000 tasks
   20/04/26 18:19:23 INFO YarnScheduler: Removed TaskSet 32.0, whose tasks have all completed, from pool 
   20/04/26 18:19:23 INFO DAGScheduler: ResultStage 32 (collect at CleanActionExecutor.java:155) finished in 3.444 s
   20/04/26 18:19:23 INFO DAGScheduler: Job 11 finished: collect at CleanActionExecutor.java:155, took 8.313284 s
   20/04/26 18:19:23 INFO HoodieActiveTimeline: Loaded instants [[20200425201231__clean__COMPLETED], [20200425201231__commit__COMPLETED], [==>20200426173625__clean__INFLIGHT], [20200426173625__commit__COMPLETED]]
   20/04/26 18:19:24 INFO HoodieActiveTimeline: Checking for file exists ?<storageLocation>.hoodie/20200426173625.clean.inflight
   20/04/26 18:19:24 INFO CSEMultipartUploadOutputStream: close closed:false <storageLocation>.hoodie/20200426173625.clean
   20/04/26 18:19:24 INFO DefaultMultipartUploadDispatcher: Completed multipart upload of 1 parts 273450 bytes
   20/04/26 18:19:24 INFO CSEMultipartUploadOutputStream: Finished uploading <storageLocation>.hoodie/20200426173625.clean. Elapsed seconds: 0.
   20/04/26 18:19:24 INFO HoodieActiveTimeline: Create new file for toInstant ?<storageLocation>.hoodie/20200426173625.clean
   20/04/26 18:19:24 INFO CleanActionExecutor: Marked clean started on 20200426173625 as complete
   20/04/26 18:19:24 INFO AbstractHoodieWriteClient: Committed 20200426173625
   20/04/26 18:19:24 INFO HoodieSparkSqlWriter$: Commit 20200426173625 successful!
   20/04/26 18:19:28 INFO SparkContext: Invoking stop() from shutdown hook
   20/04/26 18:19:28 INFO SparkUI: Stopped Spark web UI at ip
   20/04/26 18:19:28 INFO YarnClientSchedulerBackend: Interrupting monitor thread
   20/04/26 18:19:28 INFO YarnClientSchedulerBackend: Shutting down all executors
   20/04/26 18:19:28 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
   20/04/26 18:19:28 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
   (serviceOption=None,
    services=List(),
    started=false)
   20/04/26 18:19:28 INFO YarnClientSchedulerBackend: Stopped
   20/04/26 18:19:28 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
   20/04/26 18:19:28 INFO MemoryStore: MemoryStore cleared
   20/04/26 18:19:28 INFO BlockManager: BlockManager stopped
   20/04/26 18:19:28 INFO BlockManagerMaster: BlockManagerMaster stopped
   20/04/26 18:19:28 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
   20/04/26 18:19:28 INFO SparkContext: Successfully stopped SparkContext
   20/04/26 18:19:28 INFO ShutdownHookManager: Shutdown hook called
   20/04/26 18:19:28 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-e12c26c7-150a-4837-8fb2-72db585dd2f3
   20/04/26 18:19:28 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-1a4b587a-a177-4cab-8f93-6849d64386d9
   Command exiting with ret '0'
   ```
   
   #### Analysis
   
   ![image](https://user-images.githubusercontent.com/20113411/80422808-6b8be100-8911-11ea-9dd1-bfa6b0bafd9c.png)
   
   ### Analysis about step 2
   
   `Rollback` cost 42mins, output `FileStatusExt: Metadata Entry doesn't exist` repeatly.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
lamber-ken edited a comment on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-618657572


   User report: upsert hoodie log, cost about 30min
   ```
   Started at 20/04/22 20:12:14 
   
   
   20/04/22 20:15:30 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from <storgaeLocation>
   20/04/22 20:15:30 INFO HoodieTableMetaClient: Loading Active commit timeline for <storageLocation>
   20/04/22 20:15:30 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@4af81941
   20/04/22 20:15:30 INFO HoodieCommitArchiveLog: No Instants to archive
   20/04/22 20:15:30 INFO HoodieWriteClient: Auto cleaning is enabled. Running cleaner now
   20/04/22 20:15:30 INFO HoodieWriteClient: Cleaner started
   20/04/22 20:15:30 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from <storageLocation>
   20/04/22 20:15:30 INFO FSUtils: Hadoop Configuration: fs.defaultFS:hdfs, Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, emrfs-site.xml, __spark_hadoop_conf__.xml, file:/etc/spark/conf.dist/hive-site.xml], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@3c0cfaad]
   20/04/22 20:15:31 INFO HoodieTableConfig: Loading dataset properties from <storageLocation>.hoodie/hoodie.properties
   20/04/22 20:15:31 INFO S3NativeFileSystem: Opening <storageLocation>.hoodie/hoodie.properties' for reading
   20/04/22 20:15:31 WARN S3CryptoModuleAE: Unable to detect encryption information for object '<storageLocation>.hoodie/hoodie.properties' in bucket 'delta-data-devo'. Returning object without decryption.
   20/04/22 20:15:31 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from <storageLocation>
   20/04/22 20:15:31 INFO HoodieTableMetaClient: Loading Active commit timeline for <storageLocation>
   20/04/22 20:15:31 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@72dd302
   20/04/22 20:15:31 INFO FileSystemViewManager: Creating View Manager with storage type :MEMORY
   20/04/22 20:15:31 INFO FileSystemViewManager: Creating in-memory based Table View
   20/04/22 20:15:31 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from <storageLocation>
   20/04/22 20:15:31 INFO FSUtils: Hadoop Configuration: fs.defaultFS: hdfs, Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, emrfs-site.xml, __spark_hadoop_conf__.xml, file:/etc/spark/conf.dist/hive-site.xml], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@3c0cfaad]
   20/04/22 20:15:31 INFO HoodieTableConfig: Loading dataset properties from <storageLocation>.hoodie/hoodie.properties
   20/04/22 20:15:31 INFO S3NativeFileSystem: Opening '<storageLocation>.hoodie/hoodie.properties' for reading
   20/04/22 20:15:31 WARN S3CryptoModuleAE: Unable to detect encryption information for object '<storageLocation>.hoodie/hoodie.properties' in bucket 'delta-data-devo'. Returning object without decryption.
   20/04/22 20:15:31 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from <storageLocation>
   20/04/22 20:15:31 INFO HoodieTableMetaClient: Loading Active commit timeline for <storageLocation>
   20/04/22 20:15:31 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@678852b5
   20/04/22 20:24:33 INFO HoodieCopyOnWriteTable: Partitions to clean up : [2001/05/02, 2001/05/07, 2001/05/09, 2001/05/10, 2001/05/17, 2001/05/18, 2001/05/21, 2001/06/01, 2001/06/04, 2001/06/08, 2001/06/20, 2001/06/21, 2001/07/17, 2001/07/23, 2001/07/25, 2001/07/30, 2001/08/02, 2001/08/03, 2001/08/07, 2001/08/08, 2001/08/09, 2001/08/14, 2001/08/23, 2001/09/05, 2001/09/06, 2001/09/07, 2001/09/13, 2001/09/14, 2001/10/02, 2001/10/03, 2001/10/04, 2001/10/09, 2001/11/01, 2001/11/09, 2001/11/14, 2001/11/15, 2001/11/16, 2001/11/19, 2001/11/20, 2001/11/21, 2001/11/27, 2001/11/28, 2001/11/29, 2001/11/30, 2001/12/03, 2001/12/07, 2001/12/10, 2001/12/11, 2001/12/12, 2001/12/13, 2001/12/17, 2001/12/20, 2001/12/21, 2001/12/25, 2001/12/26, 2001/12/27, 2001/12/28, 2001/12/29, 2001/12/31, 2002/01/02, 2002/01/03, 2002/01/07, 2002/01/08, 2002/01/09, 2002/01/11, 2002/01/13, 2002/01/14, 2002/01/15, 2002/01/16, 2002/01/17, 2002/01/18, 2002/01/21, 2002/01/22, 2002/01/23, 2002/01/25, 2002/01/28, 2002/01/29, 2002/01/30, 2002/02/03, 2002/02/05, 2002/02/06, 2002/02/07, 2002/02/11, 2002/02/12, 2002/02/14, 2002/02/15, 2002/02/18, 2002/02/19, 2002/02/20, 2002/02/21, 2002/02/22, 2002/02/26, 2002/03/02, 2002/03/04, 2002/03/06, 2002/03/10, 2002/03/15, 2002/03/17, 2002/03/19, 2002/03/20, 2002/03/21, 2002/03/22, 2002/03/25, 2002/03/26, 2002/03/27, 2002/03/28, 2002/03/30, 2002/04/02, 2002/04/03, 2002/04/04, 2002/04/05, 2002/04/07, 2002/04/09, 2002/04/10, 2002/04/11, 2002/04/14, 2002/04/16, 2002/04/17, 2002/04/22, 2002/04/23, 2002/04/25, 2002/04/30, 2002/05/01, 2002/05/02, 2002/05/06, 2002/05/08, 2002/05/09, 2002/05/12, 2002/05/13, 2002/05/14, 2002/05/17, 2002/05/19, 2002/05/20, 2002/05/21, 2002/05/22, 2002/05/24, 2002/05/27, 2002/05/28, 2002/05/29, 2002/06/03, 2002/06/04, 2002/06/06, 2002/06/10, 2002/06/11, 2002/06/12, 2002/06/14, 2002/06/18, 2002/06/19, 2002/06/24, 2002/07/01, 2002/07/02, 2002/07/04, 2002/07/17, 2002/07/21, 2002/07/31, 2002/08/01, 2002/08/02, 2002/08/07, 2002/08/21, 2002/08/26, 2002/08/27, 2002/08/28, 2002/08/29, 2002/09/03, 2002/09/05, 2002/09/06, 2002/09/09, 2002/09/10, 2002/09/11, 2002/09/12, 2002/09/13, 2002/09/16, 2002/09/17, 2002/09/18, 2002/09/19, 2002/09/23, 2002/09/24, 2002/09/25, 2002/09/28, 2002/09/30, 2002/10/07, 2002/10/08, 2002/10/09, 2002/10/10, 2002/10/11, 2002/10/14, 2002/10/15, 2002/10/16, 2002/10/17, 2002/10/21, 2002/10/23, 2002/10/25, 2002/10/28, 2002/10/29, 2002/10/30, 2002/10/31, 2002/11/07, 2002/11/11, 2002/11/12, 2002/11/13, 2002/11/14, 2002/11/15, 2002/11/18, 2002/11/19, 2002/11/20, 2002/11/21, 2002/11/22, 2002/11/25, 2002/11/26, 2002/11/27, 2002/12/02, 2002/12/04, 2002/12/05, 2002/12/06, 2002/12/09, 2002/12/10, 2002/12/11, 2002/12/12, 2002/12/13, 2002/12/16, 2002/12/17, 2002/12/18, 2002/12/19, 2002/12/20, 2002/12/23, 2002/12/24, 2002/12/25, 2002/12/30, 2002/12/31, 2003/01/03, 2003/01/04, 2003/01/08, 2003/01/09, 2003/01/10, 2003/01/13, 2003/01/14, 2003/01/15, 2003/01/16, 2003/01/17, 2003/01/20, 2003/01/21, 2003/01/22, 2003/01/23, 2003/01/24, 2003/01/27, 2003/01/28, 2003/01/29, 2003/01/30, 2003/01/31, 2003/02/03, 2003/02/06, 2003/02/07, 2003/02/10, 2003/02/11, 2003/02/12, 2003/02/13, 2003/02/14, 2003/02/17, 2003/02/18, 2003/02/19, 2003/02/24, 2003/02/25, 2003/02/26, 2003/02/27, 2003/02/28, 2003/03/04, 2003/03/05, 2003/03/06, 2003/03/07, 2003/03/10, 2003/03/11, 2003/03/12, 2003/03/13, 2003/03/14, 2003/03/17, 2003/03/18, 2003/03/19, 2003/03/24, 2003/03/25, 2003/03/26, 2003/03/27, 2003/03/31, 2003/04/01, 2003/04/02, 2003/04/05, 2003/04/07, 2003/04/08, 2003/04/09, 2003/04/10, 2003/04/11, 2003/04/14, 2003/04/15, 2003/04/16, 2003/04/17, 2003/04/18, 2003/04/21, 2003/04/22, 2003/04/23, 2003/04/24, 2003/04/25, 2003/04/28, 2003/04/29, 2003/04/30, 2003/05/02, 2003/05/05, 2003/05/06, 2003/05/07, 2003/05/12, 2003/05/15, 2003/05/19, 2003/05/20, 2003/05/21, 2003/05/27, 2003/05/28, 2003/05/29, 2003/05/30, 2003/06/02, 2003/06/03, 2003/06/04, 2003/06/05, 2003/06/10, 2003/06/11, 2003/06/18, 2003/06/19, 2003/06/23, 2003/06/24, 2003/06/25, 2003/06/27, 2003/07/01, 2003/07/02, 2003/07/05, 2003/07/07, 2003/07/08, 2003/07/09, 2003/07/10, 2003/07/11, 2003/07/14, 2003/07/15, 2003/07/16, 2003/07/17, 2003/07/18, 2003/07/21, 2003/07/22, 2003/07/23, 2003/07/24, 2003/07/28, 2003/07/29, 2003/07/30, 2003/08/01, 2003/08/05, 2003/08/06, 2003/08/07, 2003/08/11, 2003/08/12, 2003/08/13, 2003/08/15, 2003/08/18, 2003/08/19, 2003/08/20, 2003/08/21, 2003/08/22, 2003/08/26, 2003/08/27, 2003/08/28, 2003/08/29, 2003/09/02, 2003/09/03, 2003/09/04, 2003/09/08, 2003/09/09, 2003/09/10, 2003/09/12, 2003/09/15, 2003/09/16, 2003/09/17, 2003/09/19, 2003/09/22, 2003/09/23, 2003/09/24, 2003/09/25, 2003/09/26, 2003/09/29, 2003/09/30, 2003/10/01, 2003/10/02, 2003/10/08, 2003/10/13, 2003/10/14, 2003/10/15, 2003/10/16, 2003/10/17, 2003/10/19, 2003/10/20, 2003/10/21, 2003/10/22, 2003/10/23, 2003/10/29, 2003/10/30, 2003/10/31, 2003/11/03, 2003/11/04, 2003/11/05, 2003/11/06, 2003/11/07, 2003/11/10, 2003/11/11, 2003/11/12, 2003/11/17, 2003/11/18, 2003/11/19, 2003/11/20, 2003/11/24, 2003/11/25, 2003/11/26, 2003/12/01, 2003/12/02, 2003/12/03, 2003/12/04, 2003/12/05, 2003/12/08, 2003/12/09, 2003/12/10, 2003/12/11, 2003/12/14, 2003/12/15, 2003/12/17, 2003/12/18, 2003/12/19, 2003/12/20, 2003/12/21, 2003/12/22, 2003/12/23, 2003/12/24, 2003/12/25, 2003/12/26, 2003/12/28, 2003/12/30, 2003/12/31, 2004/01/02, 2004/01/03, 2004/01/07, 2004/01/09, 2004/01/12, 2004/01/14, 2004/01/16, 2004/01/19, 2004/01/20, 2004/01/21, 2004/01/23, 2004/01/26, 2004/01/27, 2004/01/28, 2004/01/29, 2004/02/01, 2004/02/02, 2004/02/03, 2004/02/04, 2004/02/06, 2004/02/09, 2004/02/10, 2004/02/11, 2004/02/12, 2004/02/16, 2004/02/18, 2004/02/20, 2004/02/23, 2004/02/24, 2004/02/25, 2004/02/26, 2004/02/27, 2004/03/01, 2004/03/02, 2004/03/03, 2004/03/04, 2004/03/05, 2004/03/08, 2004/03/09, 2004/03/10, 2004/03/11, 2004/03/14, 2004/03/16, 2004/03/17, 2004/03/22, 2004/03/23, 2004/03/24, 2004/03/25, 2004/03/26, 2004/03/28, 2004/03/29, 2004/03/30, 2004/03/31, 2004/04/01, 2004/04/02, 2004/04/03, 2004/04/05, 2004/04/09, 2004/04/12, 2004/04/13, 2004/04/14, 2004/04/16, 2004/04/19, 2004/04/20, 2004/04/21, 2004/04/22, 2004/04/23, 2004/04/26, 2004/04/27, 2004/04/28, 2004/04/29, 2004/04/30, 2004/05/03, 2004/05/04, 2004/05/05, 2004/05/07, 2004/05/11, 2004/05/12, 2004/05/13, 2004/05/14, 2004/05/17, 2004/05/18, 2004/05/20, 2004/05/24, 2004/05/25, 2004/05/26, 2004/05/27, 2004/05/28, 2004/05/29, 2004/05/30, 2004/05/31, 2004/06/02, 2004/06/03, 2004/06/07, 2004/06/08, 2004/06/09, 2004/06/10, 2004/06/11, 2004/06/13, 2004/06/14, 2004/06/15, 2004/06/16, 2004/06/20, 2004/06/21, 2004/06/22, 2004/06/23, 2004/06/24, 2004/06/25, 2004/06/27, 2004/06/28, 2004/06/29, 2004/06/30, 2004/07/06, 2004/07/07, 2004/07/08, 2004/07/09, 2004/07/11, 2004/07/12, 2004/07/14, 2004/07/15, 2004/07/17, 2004/07/21, 2004/07/28, 2004/07/31, 2004/08/02, 2004/08/03, 2004/08/04, 2004/08/05, 2004/08/09, 2004/08/10, 2004/08/12, 2004/08/13, 2004/08/17, 2004/08/19, 2004/08/20, 2004/08/21, 2004/08/22, 2004/08/23, 2004/08/24, 2004/08/25, 2004/08/26, 2004/08/27, 2004/08/28, 2004/08/29, 2004/08/30, 2004/08/31, 2004/09/01, 2004/09/02, 2004/09/09, 2004/09/10, 2004/09/11, 2004/09/12, 2004/09/13, 2004/09/14, 2004/09/15, 2004/09/16, 2004/09/17, 2004/09/18, 2004/09/19, 2004/09/20, 2004/09/21, 2004/09/22, 2004/09/23, 2004/09/24, 2004/09/27, 2004/09/28, 2004/09/29, 2004/09/30, 2004/10/01, 2004/10/02, 2004/10/03, 2004/10/04, 2004/10/05, 2004/10/06, 2004/10/07, 2004/10/08, 2004/10/09, 2004/10/10, 2004/10/11, 2004/10/12, 2004/10/13, 2004/10/14, 2004/10/15, 2004/10/16, 2004/10/17, 2004/10/18, 2004/10/19, 2004/10/20, 2004/10/21, 2004/10/22, 2004/10/23, 2004/10/24, 2004/10/25, 2004/10/26, 2004/10/27, 2004/10/28, 2004/10/29, 2004/10/30, 2004/10/31, 2004/11/02, 2004/11/03, 2004/11/04, 2004/11/05, 2004/11/06, 2004/11/07, 2004/11/08, 2004/11/09, 2004/11/10, 2004/11/11, 2004/11/12, 2004/11/13, 2004/11/16, 2004/11/17, 2004/11/18, 2004/11/19, 2004/11/20, 2004/11/21, 2004/11/22, 2004/11/23,......] - 6040 partitions
   
   
   20/04/22 20:24:33 INFO HoodieCopyOnWriteTable: Using cleanerParallelism: 200
   20/04/22 20:24:33 INFO SparkContext: Starting job: collect at HoodieCopyOnWriteTable.java:396
   20/04/22 20:24:33 INFO DAGScheduler: Registering RDD 58 (repartition at HoodieCopyOnWriteTable.java:392)
   20/04/22 20:24:33 INFO DAGScheduler: Registering RDD 62 (mapPartitionsToPair at HoodieCopyOnWriteTable.java:393)
   20/04/22 20:24:33 INFO DAGScheduler: Got job 8 (collect at HoodieCopyOnWriteTable.java:396) with 200 output partitions
   20/04/22 20:24:33 INFO DAGScheduler: Final stage: ResultStage 29 (collect at HoodieCopyOnWriteTable.java:396)
   20/04/22 20:24:33 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 28)
   20/04/22 20:24:33 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 28)
   20/04/22 20:24:33 INFO DAGScheduler: Submitting ShuffleMapStage 27 (MapPartitionsRDD[58] at repartition at HoodieCopyOnWriteTable.java:392), which has no missing parents
   20/04/22 20:24:33 INFO MemoryStore: Block broadcast_17 stored as values in memory (estimated size 174.2 KB, free 1008.1 MB)
   20/04/22 20:24:33 INFO MemoryStore: Block broadcast_17_piece0 stored as bytes in memory (estimated size 60.8 KB, free 1008.0 MB)
   20/04/22 20:24:33 INFO BlockManagerInfo: Added broadcast_17_piece0 in memory on ip (size: 60.8 KB, free: 1008.7 MB)
   20/04/22 20:24:33 INFO SparkContext: Created broadcast 17 from broadcast at DAGScheduler.scala:1201
   20/04/22 20:24:33 INFO DAGScheduler: Submitting 200 missing tasks from ShuffleMapStage 27 (MapPartitionsRDD[58] at repartition at HoodieCopyOnWriteTable.java:392) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
   
   
   20/04/22 20:24:35 INFO DAGScheduler: ShuffleMapStage 27 (repartition at HoodieCopyOnWriteTable.java:392) finished in 1.401 s
   20/04/22 20:24:35 INFO DAGScheduler: looking for newly runnable stages
   20/04/22 20:24:35 INFO DAGScheduler: running: Set()
   20/04/22 20:24:35 INFO DAGScheduler: waiting: Set(ShuffleMapStage 28, ResultStage 29)
   20/04/22 20:24:35 INFO DAGScheduler: failed: Set()
   20/04/22 20:24:35 INFO DAGScheduler: Submitting ShuffleMapStage 28 (MapPartitionsRDD[62] at mapPartitionsToPair at HoodieCopyOnWriteTable.java:393), which has no missing parents
   20/04/22 20:24:35 INFO MemoryStore: Block broadcast_18 stored as values in memory (estimated size 174.8 KB, free 1007.9 MB)
   20/04/22 20:24:35 INFO MemoryStore: Block broadcast_18_piece0 stored as bytes in memory (estimated size 61.5 KB, free 1007.8 MB)
   20/04/22 20:24:35 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory on ip (size: 61.5 KB, free: 1008.7 MB)
   20/04/22 20:24:35 INFO SparkContext: Created broadcast 18 from broadcast at DAGScheduler.scala:1201
   20/04/22 20:24:35 INFO DAGScheduler: Submitting 200 missing tasks from ShuffleMapStage 28 (MapPartitionsRDD[62] at mapPartitionsToPair at HoodieCopyOnWriteTable.java:393) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
   
   
   
   20/04/22 20:24:35 INFO DAGScheduler: ResultStage 29 (collect at HoodieCopyOnWriteTable.java:396) finished in 0.057 s
   20/04/22 20:24:35 INFO DAGScheduler: Job 8 finished: collect at HoodieCopyOnWriteTable.java:396, took 1.627846 s
   20/04/22 20:24:35 INFO FileSystemViewManager: Creating InMemory based view for basePath <storageLocation>
   20/04/22 20:24:35 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from <storageLocation>
   20/04/22 20:24:35 INFO FSUtils: Hadoop Configuration: fs.defaultFS: hdfs-ip, Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, emrfs-site.xml, __spark_hadoop_conf__.xml, file:/etc/spark/conf.dist/hive-site.xml], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@3c0cfaad]
   20/04/22 20:24:35 INFO HoodieTableConfig: Loading dataset properties from <storageLocation>.hoodie/hoodie.properties
   20/04/22 20:24:35 INFO S3NativeFileSystem: Opening '<storageLocation>.hoodie/hoodie.properties' for reading
   20/04/22 20:24:35 WARN S3CryptoModuleAE: Unable to detect encryption information for object '<storageLocation>.hoodie/hoodie.properties' in bucket 'delta-data-devo'. Returning object without decryption.
   20/04/22 20:24:35 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from <storageLocation>
   20/04/22 20:24:35 INFO HoodieTableMetaClient: Loading Active commit timeline for <storageLocation>
   20/04/22 20:24:35 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@1b7440c3
   20/04/22 20:24:35 INFO HoodieWriteClient: Cleaned 0 files
   20/04/22 20:24:36 INFO HoodieActiveTimeline: Marking instant complete [==>20200422201250__clean__INFLIGHT]
   20/04/22 20:24:36 INFO CSEMultipartUploadOutputStream: close closed:false <storageLocation>.hoodie/20200422201250.clean.inflight
   20/04/22 20:24:36 INFO DefaultMultipartUploadDispatcher: Completed multipart upload of 1 parts 0 bytes
   20/04/22 20:24:36 INFO CSEMultipartUploadOutputStream: Finished uploading <storageLocation>.hoodie/20200422201250.clean.inflight. Elapsed seconds: 0.
   20/04/22 20:24:36 INFO HoodieActiveTimeline: Created a new file in meta path: <storageLocation>.hoodie/20200422201250.clean.inflight
   20/04/22 20:24:36 INFO CSEMultipartUploadOutputStream: close closed:false <storageLocation>.hoodie/20200422201250.clean.inflight
   20/04/22 20:24:36 INFO DefaultMultipartUploadDispatcher: Completed multipart upload of 1 parts 272855 bytes
   20/04/22 20:24:36 INFO CSEMultipartUploadOutputStream: Finished uploading <storageLocation>.hoodie/20200422201250.clean.inflight. Elapsed seconds: 0.
   20/04/22 20:24:36 INFO S3NativeFileSystem: rename <storageLocation>.hoodie/20200422201250.clean.inflight <storageLocation>.hoodie/20200422201250.clean
   20/04/22 20:24:36 INFO HoodieActiveTimeline: Completed [==>20200422201250__clean__INFLIGHT]
   20/04/22 20:24:36 INFO HoodieWriteClient: Marked clean started on 20200422201250 as complete
   20/04/22 20:24:36 INFO HoodieWriteClient: Committed 20200422201250
   20/04/22 20:24:36 INFO HoodieSparkSqlWriter$: Commit 20200422201250 successful!
   20/04/22 20:24:36 INFO DefaultSource: Constructing hoodie (as parquet) data source with options :Map(hoodie.datasource.write.insert.drop.duplicates -> false, hoodie.datasource.hive_sync.database -> default, hoodie.insert.shuffle.parallelism -> 200, path -> <storageLocation>, hoodie.datasource.write.precombine.field -> dedupeKey, hoodie.datasource.hive_sync.partition_fields -> , hoodie.datasource.write.payload.class -> org.apache.hudi.OverwriteWithLatestAvroPayload, hoodie.datasource.hive_sync.partition_extractor_class -> org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor, hoodie.datasource.write.streaming.retry.interval.ms -> 2000, hoodie.datasource.hive_sync.table -> unknown, hoodie.datasource.write.streaming.ignore.failed.batch -> true, hoodie.datasource.write.operation -> upsert, hoodie.datasource.hive_sync.enable -> false, hoodie.datasource.write.recordkey.field -> customer_trx_id, hoodie.datasource.view.type -> read_optimized, hoodie.table.name -> ra_customer_trx_all, hoodie.datasource.hive_sync.jdbcurl -> jdbc:hive2://localhost:10000, hoodie.datasource.write.storage.type -> COPY_ON_WRITE, hoodie.datasource.hive_sync.username -> hive, hoodie.datasource.write.streaming.retry.count -> 3, hoodie.datasource.hive_sync.password -> hive, hoodie.datasource.write.keygenerator.class -> org.apache.hudi.SimpleKeyGenerator, hoodie.upsert.shuffle.parallelism -> 200, hoodie.datasource.hive_sync.assume_date_partitioning -> false, hoodie.datasource.write.partitionpath.field -> partition_key, hoodie.datasource.write.commitmeta.key.prefix -> _, hoodie.bulkinsert.shuffle.parallelism -> 200)
   
   
   
   Finally it says loading hoodie table for each partition like for 6040 partitions.
   Something similar to 
   
   
   20/04/22 21:06:20 INFO HoodieTableConfig: Loading dataset properties from <storageLocation>/.hoodie/hoodie.properties
   20/04/22 21:06:20 INFO S3NativeFileSystem: Opening '<storageLocation>/.hoodie/hoodie.properties' for reading
   20/04/22 21:06:20 WARN S3CryptoModuleAE: Unable to detect encryption information for object '<storageLocation>/.hoodie/hoodie.properties' in bucket 'delta-data-devo'. Returning object without decryption.
   20/04/22 21:06:20 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from <storageLocation>
   20/04/22 21:06:20 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@b9cfa18
   20/04/22 21:06:20 INFO HoodieTableFileSystemView: Adding file-groups for partition :2020/04/20, #FileGroups=1
   20/04/22 21:06:20 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=3, FileGroupsCreationTime=160, StoreTimeTaken=0
   20/04/22 21:06:20 INFO HoodieROTablePathFilter: Based on hoodie metadata from base path: <storageLocation>, caching 1 files under <storageLocation>/2020/04/20
   20/04/22 21:06:25 INFO SparkContext: Invoking stop() from shutdown hook
   20/04/22 21:06:25 INFO SparkUI: Stopped Spark web UI at ip
   20/04/22 21:06:25 INFO YarnClientSchedulerBackend: Interrupting monitor thread
   20/04/22 21:06:25 INFO YarnClientSchedulerBackend: Shutting down all executors
   20/04/22 21:06:25 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
   20/04/22 21:06:25 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
   
   It runs till 20/04/22 21:06:25 
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
lamber-ken edited a comment on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-618657572


   User report: upsert hoodie log
   ```
   Started at 20/04/22 20:12:14 
   
   
   20/04/22 20:15:30 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from <storgaeLocation>
   20/04/22 20:15:30 INFO HoodieTableMetaClient: Loading Active commit timeline for <storageLocation>
   20/04/22 20:15:30 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@4af81941
   20/04/22 20:15:30 INFO HoodieCommitArchiveLog: No Instants to archive
   20/04/22 20:15:30 INFO HoodieWriteClient: Auto cleaning is enabled. Running cleaner now
   20/04/22 20:15:30 INFO HoodieWriteClient: Cleaner started
   20/04/22 20:15:30 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from <storageLocation>
   20/04/22 20:15:30 INFO FSUtils: Hadoop Configuration: fs.defaultFS:hdfs, Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, emrfs-site.xml, __spark_hadoop_conf__.xml, file:/etc/spark/conf.dist/hive-site.xml], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@3c0cfaad]
   20/04/22 20:15:31 INFO HoodieTableConfig: Loading dataset properties from <storageLocation>.hoodie/hoodie.properties
   20/04/22 20:15:31 INFO S3NativeFileSystem: Opening <storageLocation>.hoodie/hoodie.properties' for reading
   20/04/22 20:15:31 WARN S3CryptoModuleAE: Unable to detect encryption information for object '<storageLocation>.hoodie/hoodie.properties' in bucket 'delta-data-devo'. Returning object without decryption.
   20/04/22 20:15:31 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from <storageLocation>
   20/04/22 20:15:31 INFO HoodieTableMetaClient: Loading Active commit timeline for <storageLocation>
   20/04/22 20:15:31 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@72dd302
   20/04/22 20:15:31 INFO FileSystemViewManager: Creating View Manager with storage type :MEMORY
   20/04/22 20:15:31 INFO FileSystemViewManager: Creating in-memory based Table View
   20/04/22 20:15:31 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from <storageLocation>
   20/04/22 20:15:31 INFO FSUtils: Hadoop Configuration: fs.defaultFS: hdfs, Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, emrfs-site.xml, __spark_hadoop_conf__.xml, file:/etc/spark/conf.dist/hive-site.xml], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@3c0cfaad]
   20/04/22 20:15:31 INFO HoodieTableConfig: Loading dataset properties from <storageLocation>.hoodie/hoodie.properties
   20/04/22 20:15:31 INFO S3NativeFileSystem: Opening '<storageLocation>.hoodie/hoodie.properties' for reading
   20/04/22 20:15:31 WARN S3CryptoModuleAE: Unable to detect encryption information for object '<storageLocation>.hoodie/hoodie.properties' in bucket 'delta-data-devo'. Returning object without decryption.
   20/04/22 20:15:31 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from <storageLocation>
   20/04/22 20:15:31 INFO HoodieTableMetaClient: Loading Active commit timeline for <storageLocation>
   20/04/22 20:15:31 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@678852b5
   20/04/22 20:24:33 INFO HoodieCopyOnWriteTable: Partitions to clean up : [2001/05/02, 2001/05/07, 2001/05/09, 2001/05/10, 2001/05/17, 2001/05/18, 2001/05/21, 2001/06/01, 2001/06/04, 2001/06/08, 2001/06/20, 2001/06/21, 2001/07/17, 2001/07/23, 2001/07/25, 2001/07/30, 2001/08/02, 2001/08/03, 2001/08/07, 2001/08/08, 2001/08/09, 2001/08/14, 2001/08/23, 2001/09/05, 2001/09/06, 2001/09/07, 2001/09/13, 2001/09/14, 2001/10/02, 2001/10/03, 2001/10/04, 2001/10/09, 2001/11/01, 2001/11/09, 2001/11/14, 2001/11/15, 2001/11/16, 2001/11/19, 2001/11/20, 2001/11/21, 2001/11/27, 2001/11/28, 2001/11/29, 2001/11/30, 2001/12/03, 2001/12/07, 2001/12/10, 2001/12/11, 2001/12/12, 2001/12/13, 2001/12/17, 2001/12/20, 2001/12/21, 2001/12/25, 2001/12/26, 2001/12/27, 2001/12/28, 2001/12/29, 2001/12/31, 2002/01/02, 2002/01/03, 2002/01/07, 2002/01/08, 2002/01/09, 2002/01/11, 2002/01/13, 2002/01/14, 2002/01/15, 2002/01/16, 2002/01/17, 2002/01/18, 2002/01/21, 2002/01/22, 2002/01/23, 2002/01/25, 2002/01/28, 2002/01/29, 2002/01/30, 2002/02/03, 2002/02/05, 2002/02/06, 2002/02/07, 2002/02/11, 2002/02/12, 2002/02/14, 2002/02/15, 2002/02/18, 2002/02/19, 2002/02/20, 2002/02/21, 2002/02/22, 2002/02/26, 2002/03/02, 2002/03/04, 2002/03/06, 2002/03/10, 2002/03/15, 2002/03/17, 2002/03/19, 2002/03/20, 2002/03/21, 2002/03/22, 2002/03/25, 2002/03/26, 2002/03/27, 2002/03/28, 2002/03/30, 2002/04/02, 2002/04/03, 2002/04/04, 2002/04/05, 2002/04/07, 2002/04/09, 2002/04/10, 2002/04/11, 2002/04/14, 2002/04/16, 2002/04/17, 2002/04/22, 2002/04/23, 2002/04/25, 2002/04/30, 2002/05/01, 2002/05/02, 2002/05/06, 2002/05/08, 2002/05/09, 2002/05/12, 2002/05/13, 2002/05/14, 2002/05/17, 2002/05/19, 2002/05/20, 2002/05/21, 2002/05/22, 2002/05/24, 2002/05/27, 2002/05/28, 2002/05/29, 2002/06/03, 2002/06/04, 2002/06/06, 2002/06/10, 2002/06/11, 2002/06/12, 2002/06/14, 2002/06/18, 2002/06/19, 2002/06/24, 2002/07/01, 2002/07/02, 2002/07/04, 2002/07/17, 2002/07/21, 2002/07/31, 2002/08/01, 2002/08/02, 2002/08/07, 2002/08/21, 2002/08/26, 2002/08/27, 2002/08/28, 2002/08/29, 2002/09/03, 2002/09/05, 2002/09/06, 2002/09/09, 2002/09/10, 2002/09/11, 2002/09/12, 2002/09/13, 2002/09/16, 2002/09/17, 2002/09/18, 2002/09/19, 2002/09/23, 2002/09/24, 2002/09/25, 2002/09/28, 2002/09/30, 2002/10/07, 2002/10/08, 2002/10/09, 2002/10/10, 2002/10/11, 2002/10/14, 2002/10/15, 2002/10/16, 2002/10/17, 2002/10/21, 2002/10/23, 2002/10/25, 2002/10/28, 2002/10/29, 2002/10/30, 2002/10/31, 2002/11/07, 2002/11/11, 2002/11/12, 2002/11/13, 2002/11/14, 2002/11/15, 2002/11/18, 2002/11/19, 2002/11/20, 2002/11/21, 2002/11/22, 2002/11/25, 2002/11/26, 2002/11/27, 2002/12/02, 2002/12/04, 2002/12/05, 2002/12/06, 2002/12/09, 2002/12/10, 2002/12/11, 2002/12/12, 2002/12/13, 2002/12/16, 2002/12/17, 2002/12/18, 2002/12/19, 2002/12/20, 2002/12/23, 2002/12/24, 2002/12/25, 2002/12/30, 2002/12/31, 2003/01/03, 2003/01/04, 2003/01/08, 2003/01/09, 2003/01/10, 2003/01/13, 2003/01/14, 2003/01/15, 2003/01/16, 2003/01/17, 2003/01/20, 2003/01/21, 2003/01/22, 2003/01/23, 2003/01/24, 2003/01/27, 2003/01/28, 2003/01/29, 2003/01/30, 2003/01/31, 2003/02/03, 2003/02/06, 2003/02/07, 2003/02/10, 2003/02/11, 2003/02/12, 2003/02/13, 2003/02/14, 2003/02/17, 2003/02/18, 2003/02/19, 2003/02/24, 2003/02/25, 2003/02/26, 2003/02/27, 2003/02/28, 2003/03/04, 2003/03/05, 2003/03/06, 2003/03/07, 2003/03/10, 2003/03/11, 2003/03/12, 2003/03/13, 2003/03/14, 2003/03/17, 2003/03/18, 2003/03/19, 2003/03/24, 2003/03/25, 2003/03/26, 2003/03/27, 2003/03/31, 2003/04/01, 2003/04/02, 2003/04/05, 2003/04/07, 2003/04/08, 2003/04/09, 2003/04/10, 2003/04/11, 2003/04/14, 2003/04/15, 2003/04/16, 2003/04/17, 2003/04/18, 2003/04/21, 2003/04/22, 2003/04/23, 2003/04/24, 2003/04/25, 2003/04/28, 2003/04/29, 2003/04/30, 2003/05/02, 2003/05/05, 2003/05/06, 2003/05/07, 2003/05/12, 2003/05/15, 2003/05/19, 2003/05/20, 2003/05/21, 2003/05/27, 2003/05/28, 2003/05/29, 2003/05/30, 2003/06/02, 2003/06/03, 2003/06/04, 2003/06/05, 2003/06/10, 2003/06/11, 2003/06/18, 2003/06/19, 2003/06/23, 2003/06/24, 2003/06/25, 2003/06/27, 2003/07/01, 2003/07/02, 2003/07/05, 2003/07/07, 2003/07/08, 2003/07/09, 2003/07/10, 2003/07/11, 2003/07/14, 2003/07/15, 2003/07/16, 2003/07/17, 2003/07/18, 2003/07/21, 2003/07/22, 2003/07/23, 2003/07/24, 2003/07/28, 2003/07/29, 2003/07/30, 2003/08/01, 2003/08/05, 2003/08/06, 2003/08/07, 2003/08/11, 2003/08/12, 2003/08/13, 2003/08/15, 2003/08/18, 2003/08/19, 2003/08/20, 2003/08/21, 2003/08/22, 2003/08/26, 2003/08/27, 2003/08/28, 2003/08/29, 2003/09/02, 2003/09/03, 2003/09/04, 2003/09/08, 2003/09/09, 2003/09/10, 2003/09/12, 2003/09/15, 2003/09/16, 2003/09/17, 2003/09/19, 2003/09/22, 2003/09/23, 2003/09/24, 2003/09/25, 2003/09/26, 2003/09/29, 2003/09/30, 2003/10/01, 2003/10/02, 2003/10/08, 2003/10/13, 2003/10/14, 2003/10/15, 2003/10/16, 2003/10/17, 2003/10/19, 2003/10/20, 2003/10/21, 2003/10/22, 2003/10/23, 2003/10/29, 2003/10/30, 2003/10/31, 2003/11/03, 2003/11/04, 2003/11/05, 2003/11/06, 2003/11/07, 2003/11/10, 2003/11/11, 2003/11/12, 2003/11/17, 2003/11/18, 2003/11/19, 2003/11/20, 2003/11/24, 2003/11/25, 2003/11/26, 2003/12/01, 2003/12/02, 2003/12/03, 2003/12/04, 2003/12/05, 2003/12/08, 2003/12/09, 2003/12/10, 2003/12/11, 2003/12/14, 2003/12/15, 2003/12/17, 2003/12/18, 2003/12/19, 2003/12/20, 2003/12/21, 2003/12/22, 2003/12/23, 2003/12/24, 2003/12/25, 2003/12/26, 2003/12/28, 2003/12/30, 2003/12/31, 2004/01/02, 2004/01/03, 2004/01/07, 2004/01/09, 2004/01/12, 2004/01/14, 2004/01/16, 2004/01/19, 2004/01/20, 2004/01/21, 2004/01/23, 2004/01/26, 2004/01/27, 2004/01/28, 2004/01/29, 2004/02/01, 2004/02/02, 2004/02/03, 2004/02/04, 2004/02/06, 2004/02/09, 2004/02/10, 2004/02/11, 2004/02/12, 2004/02/16, 2004/02/18, 2004/02/20, 2004/02/23, 2004/02/24, 2004/02/25, 2004/02/26, 2004/02/27, 2004/03/01, 2004/03/02, 2004/03/03, 2004/03/04, 2004/03/05, 2004/03/08, 2004/03/09, 2004/03/10, 2004/03/11, 2004/03/14, 2004/03/16, 2004/03/17, 2004/03/22, 2004/03/23, 2004/03/24, 2004/03/25, 2004/03/26, 2004/03/28, 2004/03/29, 2004/03/30, 2004/03/31, 2004/04/01, 2004/04/02, 2004/04/03, 2004/04/05, 2004/04/09, 2004/04/12, 2004/04/13, 2004/04/14, 2004/04/16, 2004/04/19, 2004/04/20, 2004/04/21, 2004/04/22, 2004/04/23, 2004/04/26, 2004/04/27, 2004/04/28, 2004/04/29, 2004/04/30, 2004/05/03, 2004/05/04, 2004/05/05, 2004/05/07, 2004/05/11, 2004/05/12, 2004/05/13, 2004/05/14, 2004/05/17, 2004/05/18, 2004/05/20, 2004/05/24, 2004/05/25, 2004/05/26, 2004/05/27, 2004/05/28, 2004/05/29, 2004/05/30, 2004/05/31, 2004/06/02, 2004/06/03, 2004/06/07, 2004/06/08, 2004/06/09, 2004/06/10, 2004/06/11, 2004/06/13, 2004/06/14, 2004/06/15, 2004/06/16, 2004/06/20, 2004/06/21, 2004/06/22, 2004/06/23, 2004/06/24, 2004/06/25, 2004/06/27, 2004/06/28, 2004/06/29, 2004/06/30, 2004/07/06, 2004/07/07, 2004/07/08, 2004/07/09, 2004/07/11, 2004/07/12, 2004/07/14, 2004/07/15, 2004/07/17, 2004/07/21, 2004/07/28, 2004/07/31, 2004/08/02, 2004/08/03, 2004/08/04, 2004/08/05, 2004/08/09, 2004/08/10, 2004/08/12, 2004/08/13, 2004/08/17, 2004/08/19, 2004/08/20, 2004/08/21, 2004/08/22, 2004/08/23, 2004/08/24, 2004/08/25, 2004/08/26, 2004/08/27, 2004/08/28, 2004/08/29, 2004/08/30, 2004/08/31, 2004/09/01, 2004/09/02, 2004/09/09, 2004/09/10, 2004/09/11, 2004/09/12, 2004/09/13, 2004/09/14, 2004/09/15, 2004/09/16, 2004/09/17, 2004/09/18, 2004/09/19, 2004/09/20, 2004/09/21, 2004/09/22, 2004/09/23, 2004/09/24, 2004/09/27, 2004/09/28, 2004/09/29, 2004/09/30, 2004/10/01, 2004/10/02, 2004/10/03, 2004/10/04, 2004/10/05, 2004/10/06, 2004/10/07, 2004/10/08, 2004/10/09, 2004/10/10, 2004/10/11, 2004/10/12, 2004/10/13, 2004/10/14, 2004/10/15, 2004/10/16, 2004/10/17, 2004/10/18, 2004/10/19, 2004/10/20, 2004/10/21, 2004/10/22, 2004/10/23, 2004/10/24, 2004/10/25, 2004/10/26, 2004/10/27, 2004/10/28, 2004/10/29, 2004/10/30, 2004/10/31, 2004/11/02, 2004/11/03, 2004/11/04, 2004/11/05, 2004/11/06, 2004/11/07, 2004/11/08, 2004/11/09, 2004/11/10, 2004/11/11, 2004/11/12, 2004/11/13, 2004/11/16, 2004/11/17, 2004/11/18, 2004/11/19, 2004/11/20, 2004/11/21, 2004/11/22, 2004/11/23,......] - 6040 partitions
   
   
   20/04/22 20:24:33 INFO HoodieCopyOnWriteTable: Using cleanerParallelism: 200
   20/04/22 20:24:33 INFO SparkContext: Starting job: collect at HoodieCopyOnWriteTable.java:396
   20/04/22 20:24:33 INFO DAGScheduler: Registering RDD 58 (repartition at HoodieCopyOnWriteTable.java:392)
   20/04/22 20:24:33 INFO DAGScheduler: Registering RDD 62 (mapPartitionsToPair at HoodieCopyOnWriteTable.java:393)
   20/04/22 20:24:33 INFO DAGScheduler: Got job 8 (collect at HoodieCopyOnWriteTable.java:396) with 200 output partitions
   20/04/22 20:24:33 INFO DAGScheduler: Final stage: ResultStage 29 (collect at HoodieCopyOnWriteTable.java:396)
   20/04/22 20:24:33 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 28)
   20/04/22 20:24:33 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 28)
   20/04/22 20:24:33 INFO DAGScheduler: Submitting ShuffleMapStage 27 (MapPartitionsRDD[58] at repartition at HoodieCopyOnWriteTable.java:392), which has no missing parents
   20/04/22 20:24:33 INFO MemoryStore: Block broadcast_17 stored as values in memory (estimated size 174.2 KB, free 1008.1 MB)
   20/04/22 20:24:33 INFO MemoryStore: Block broadcast_17_piece0 stored as bytes in memory (estimated size 60.8 KB, free 1008.0 MB)
   20/04/22 20:24:33 INFO BlockManagerInfo: Added broadcast_17_piece0 in memory on ip (size: 60.8 KB, free: 1008.7 MB)
   20/04/22 20:24:33 INFO SparkContext: Created broadcast 17 from broadcast at DAGScheduler.scala:1201
   20/04/22 20:24:33 INFO DAGScheduler: Submitting 200 missing tasks from ShuffleMapStage 27 (MapPartitionsRDD[58] at repartition at HoodieCopyOnWriteTable.java:392) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
   
   
   20/04/22 20:24:35 INFO DAGScheduler: ShuffleMapStage 27 (repartition at HoodieCopyOnWriteTable.java:392) finished in 1.401 s
   20/04/22 20:24:35 INFO DAGScheduler: looking for newly runnable stages
   20/04/22 20:24:35 INFO DAGScheduler: running: Set()
   20/04/22 20:24:35 INFO DAGScheduler: waiting: Set(ShuffleMapStage 28, ResultStage 29)
   20/04/22 20:24:35 INFO DAGScheduler: failed: Set()
   20/04/22 20:24:35 INFO DAGScheduler: Submitting ShuffleMapStage 28 (MapPartitionsRDD[62] at mapPartitionsToPair at HoodieCopyOnWriteTable.java:393), which has no missing parents
   20/04/22 20:24:35 INFO MemoryStore: Block broadcast_18 stored as values in memory (estimated size 174.8 KB, free 1007.9 MB)
   20/04/22 20:24:35 INFO MemoryStore: Block broadcast_18_piece0 stored as bytes in memory (estimated size 61.5 KB, free 1007.8 MB)
   20/04/22 20:24:35 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory on ip (size: 61.5 KB, free: 1008.7 MB)
   20/04/22 20:24:35 INFO SparkContext: Created broadcast 18 from broadcast at DAGScheduler.scala:1201
   20/04/22 20:24:35 INFO DAGScheduler: Submitting 200 missing tasks from ShuffleMapStage 28 (MapPartitionsRDD[62] at mapPartitionsToPair at HoodieCopyOnWriteTable.java:393) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
   
   
   
   20/04/22 20:24:35 INFO DAGScheduler: ResultStage 29 (collect at HoodieCopyOnWriteTable.java:396) finished in 0.057 s
   20/04/22 20:24:35 INFO DAGScheduler: Job 8 finished: collect at HoodieCopyOnWriteTable.java:396, took 1.627846 s
   20/04/22 20:24:35 INFO FileSystemViewManager: Creating InMemory based view for basePath <storageLocation>
   20/04/22 20:24:35 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from <storageLocation>
   20/04/22 20:24:35 INFO FSUtils: Hadoop Configuration: fs.defaultFS: hdfs-ip, Config:[Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, emrfs-site.xml, __spark_hadoop_conf__.xml, file:/etc/spark/conf.dist/hive-site.xml], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@3c0cfaad]
   20/04/22 20:24:35 INFO HoodieTableConfig: Loading dataset properties from <storageLocation>.hoodie/hoodie.properties
   20/04/22 20:24:35 INFO S3NativeFileSystem: Opening '<storageLocation>.hoodie/hoodie.properties' for reading
   20/04/22 20:24:35 WARN S3CryptoModuleAE: Unable to detect encryption information for object '<storageLocation>.hoodie/hoodie.properties' in bucket 'delta-data-devo'. Returning object without decryption.
   20/04/22 20:24:35 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from <storageLocation>
   20/04/22 20:24:35 INFO HoodieTableMetaClient: Loading Active commit timeline for <storageLocation>
   20/04/22 20:24:35 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@1b7440c3
   20/04/22 20:24:35 INFO HoodieWriteClient: Cleaned 0 files
   20/04/22 20:24:36 INFO HoodieActiveTimeline: Marking instant complete [==>20200422201250__clean__INFLIGHT]
   20/04/22 20:24:36 INFO CSEMultipartUploadOutputStream: close closed:false <storageLocation>.hoodie/20200422201250.clean.inflight
   20/04/22 20:24:36 INFO DefaultMultipartUploadDispatcher: Completed multipart upload of 1 parts 0 bytes
   20/04/22 20:24:36 INFO CSEMultipartUploadOutputStream: Finished uploading <storageLocation>.hoodie/20200422201250.clean.inflight. Elapsed seconds: 0.
   20/04/22 20:24:36 INFO HoodieActiveTimeline: Created a new file in meta path: <storageLocation>.hoodie/20200422201250.clean.inflight
   20/04/22 20:24:36 INFO CSEMultipartUploadOutputStream: close closed:false <storageLocation>.hoodie/20200422201250.clean.inflight
   20/04/22 20:24:36 INFO DefaultMultipartUploadDispatcher: Completed multipart upload of 1 parts 272855 bytes
   20/04/22 20:24:36 INFO CSEMultipartUploadOutputStream: Finished uploading <storageLocation>.hoodie/20200422201250.clean.inflight. Elapsed seconds: 0.
   20/04/22 20:24:36 INFO S3NativeFileSystem: rename <storageLocation>.hoodie/20200422201250.clean.inflight <storageLocation>.hoodie/20200422201250.clean
   20/04/22 20:24:36 INFO HoodieActiveTimeline: Completed [==>20200422201250__clean__INFLIGHT]
   20/04/22 20:24:36 INFO HoodieWriteClient: Marked clean started on 20200422201250 as complete
   20/04/22 20:24:36 INFO HoodieWriteClient: Committed 20200422201250
   20/04/22 20:24:36 INFO HoodieSparkSqlWriter$: Commit 20200422201250 successful!
   20/04/22 20:24:36 INFO DefaultSource: Constructing hoodie (as parquet) data source with options :Map(hoodie.datasource.write.insert.drop.duplicates -> false, hoodie.datasource.hive_sync.database -> default, hoodie.insert.shuffle.parallelism -> 200, path -> <storageLocation>, hoodie.datasource.write.precombine.field -> dedupeKey, hoodie.datasource.hive_sync.partition_fields -> , hoodie.datasource.write.payload.class -> org.apache.hudi.OverwriteWithLatestAvroPayload, hoodie.datasource.hive_sync.partition_extractor_class -> org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor, hoodie.datasource.write.streaming.retry.interval.ms -> 2000, hoodie.datasource.hive_sync.table -> unknown, hoodie.datasource.write.streaming.ignore.failed.batch -> true, hoodie.datasource.write.operation -> upsert, hoodie.datasource.hive_sync.enable -> false, hoodie.datasource.write.recordkey.field -> customer_trx_id, hoodie.datasource.view.type -> read_optimized, hoodie.table.name -> ra_customer_trx_all, hoodie.datasource.hive_sync.jdbcurl -> jdbc:hive2://localhost:10000, hoodie.datasource.write.storage.type -> COPY_ON_WRITE, hoodie.datasource.hive_sync.username -> hive, hoodie.datasource.write.streaming.retry.count -> 3, hoodie.datasource.hive_sync.password -> hive, hoodie.datasource.write.keygenerator.class -> org.apache.hudi.SimpleKeyGenerator, hoodie.upsert.shuffle.parallelism -> 200, hoodie.datasource.hive_sync.assume_date_partitioning -> false, hoodie.datasource.write.partitionpath.field -> partition_key, hoodie.datasource.write.commitmeta.key.prefix -> _, hoodie.bulkinsert.shuffle.parallelism -> 200)
   
   
   
   Finally it says loading hoodie table for each partition like for 6040 partitions.
   Something similar to 
   
   
   20/04/22 21:06:20 INFO HoodieTableConfig: Loading dataset properties from <storageLocation>/.hoodie/hoodie.properties
   20/04/22 21:06:20 INFO S3NativeFileSystem: Opening '<storageLocation>/.hoodie/hoodie.properties' for reading
   20/04/22 21:06:20 WARN S3CryptoModuleAE: Unable to detect encryption information for object '<storageLocation>/.hoodie/hoodie.properties' in bucket 'delta-data-devo'. Returning object without decryption.
   20/04/22 21:06:20 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from <storageLocation>
   20/04/22 21:06:20 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@b9cfa18
   20/04/22 21:06:20 INFO HoodieTableFileSystemView: Adding file-groups for partition :2020/04/20, #FileGroups=1
   20/04/22 21:06:20 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=3, FileGroupsCreationTime=160, StoreTimeTaken=0
   20/04/22 21:06:20 INFO HoodieROTablePathFilter: Based on hoodie metadata from base path: <storageLocation>, caching 1 files under <storageLocation>/2020/04/20
   20/04/22 21:06:25 INFO SparkContext: Invoking stop() from shutdown hook
   20/04/22 21:06:25 INFO SparkUI: Stopped Spark web UI at ip
   20/04/22 21:06:25 INFO YarnClientSchedulerBackend: Interrupting monitor thread
   20/04/22 21:06:25 INFO YarnClientSchedulerBackend: Shutting down all executors
   20/04/22 21:06:25 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
   20/04/22 21:06:25 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
   
   It runs till 20/04/22 21:06:25 
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
lamber-ken edited a comment on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-619306738


   ## Spark log analysis brainstorming 
   
   hi @vinothchandar, preliminary analysis about this issue
   (based on `release-0.5.0` branch https://github.com/apache/incubator-hudi/tree/release-0.5.0)
   
   ### Simplified origin spark log
   
   ```
   // Upsert part
   Warning: Ignoring non-spark config property: "spark.sql.hive.convertMetastoreParquet=false"
   Warning: Ignoring non-spark config property: "spark.serializer=org.apache.spark.serializer.KryoSerializer"
   20/04/22 20:12:14 WARN S3CryptoModuleAE: Unable to detect encryption information for object '<JARPATH>' in bucket '<storageBucket>'. Returning object without decryption.
   20/04/22 20:12:14 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist
   20/04/22 20:12:14 INFO SparkContext: Running Spark version 2.4.4
   20/04/22 20:12:14 INFO SparkContext: Submitted application: HUDI incremental data loading
   20/04/22 20:12:15 INFO SecurityManager: Changing view acls to: hadoop
   20/04/22 20:12:15 INFO SecurityManager: Changing modify acls to: hadoop
   20/04/22 20:12:15 INFO SecurityManager: Changing view acls groups to: 
   20/04/22 20:15:31 INFO S3NativeFileSystem: Opening '<storageLocation>.hoodie/hoodie.properties' for reading
   20/04/22 20:15:31 WARN S3CryptoModuleAE: Unable to detect encryption information for object '<storgaeLocation>/.hoodie/hoodie.properties' in bucket '<storageBucket>'. Returning object without decryption.
   20/04/22 20:15:31 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from <storageLocation>
   20/04/22 20:15:31 INFO HoodieTableMetaClient: Loading Active commit timeline for <storageLocation>
   20/04/22 20:15:31 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@678852b5
   20/04/22 20:24:33 INFO HoodieCopyOnWriteTable: Partitions to clean up : [2001/05/02, 2001/05/07, 2001/05/09, 2001/05/10, 2001/05/17, 2001/05/18, 2001/05/21, 2001/06/01, 2001/06/04, 2001/06/08, 2001/06/20, 2001/06/21, 2001/07/17, 2001/07/23, 2001/07/25, 2001/07/30, 2001/08/02, 2001/08/03, 2001/08/07, 2001/08/08, 2001/08/09, 2001/08/14, 2001/08/23, 2001/09/05, 2001/09/06, 2001/09/07, 2001/09/13, 2001/09/14, 2001/10/02, 2001/10/03, 2001/10/04, 2001/10/09, 2001/11/01, 2001/11/09, 2001/11/14, 2001/11/15, 2001/11/16, 2001/11/19, 2001/11/20, 2001/11/21,
   20/04/22 20:24:36 INFO HoodieWriteClient: Marked clean started on 20200422201250 as complete
   20/04/22 20:24:36 INFO HoodieWriteClient: Committed 20200422201250
   20/04/22 20:24:36 INFO HoodieSparkSqlWriter$: Commit 20200422201250 successful!
   
   
   // CreateRelation(sqlContext, parameters, df.schema)
   20/04/22 20:24:36 INFO DefaultSource: Constructing hoodie (as parquet) data source with options ...
   
   20/04/22 20:24:38 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from <storgaeLocation>
   20/04/22 20:24:38 INFO FSUtils: Hadoop Configuration: fs.defaultFS:...], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@3c0cfaad]
   ...
   20/04/22 20:24:38 INFO S3NativeFileSystem: Opening '<storageLocation>2001/05/10/.hoodie_partition_metadata' for reading
   20/04/22 20:24:38 WARN S3CryptoModuleAE: Unable to detect encryption information for object '<storgaeLocation>/2001/05/10/.hoodie_partition_metadata' in bucket '<storageBucket>'. Returning object without decryption.
   
   20/04/22 20:24:38 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from <storgaeLocation>
   20/04/22 20:24:38 INFO FSUtils: Hadoop Configuration: fs.defaultFS:...], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@3c0cfaad]
   ...
   20/04/22 20:24:39 INFO S3NativeFileSystem: Opening '<storageLocation>2001/05/17/.hoodie_partition_metadata' for reading
   20/04/22 20:24:39 WARN S3CryptoModuleAE: Unable to detect encryption information for object '<storgaeLocation>/2001/05/17/.hoodie_partition_metadata' in bucket '<storageBucket>'. Returning object without decryption.
   
   20/04/22 20:24:39 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from <storgaeLocation>
   20/04/22 20:24:39 INFO FSUtils: Hadoop Configuration: fs.defaultFS:...], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@3c0cfaad]
   ...
   20/04/22 20:24:39 INFO S3NativeFileSystem: Opening '<storageLocation>2001/05/18/.hoodie_partition_metadata' for reading
   20/04/22 20:24:39 WARN S3CryptoModuleAE: Unable to detect encryption information for object '<storgaeLocation>/2001/05/18/.hoodie_partition_metadata' in bucket '<storageBucket>'. Returning object without decryption.
   
   // Finish part
   20/04/22 21:06:25 INFO SparkContext: Successfully stopped SparkContext
   20/04/22 21:06:25 INFO ShutdownHookManager: Shutdown hook called
   20/04/22 21:06:25 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-c24e90ad-7be5-4a04-b4f1-2575cf68bd5a
   20/04/22 21:06:25 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-c18b2f4f-e525-43de-b896-75244fe63591
   ```
   
   #### Analysis
   
   ![image](https://user-images.githubusercontent.com/20113411/80269021-c7225880-86de-11ea-84fe-464d08fc4e9b.png)
   
   
   ### Analysis about step 2
   
   Load about 6000 partitions, cost 9mins
   
   1. Call stackstrace
   ```
   "main" Id=1 cpuUsage=94% RUNNABLE
       ...
       at org.apache.hudi.common.util.FSUtils.processFiles(FSUtils.java:227)
       at org.apache.hudi.common.util.FSUtils.getAllFoldersWithPartitionMetaFile(FSUtils.java:191)
       at org.apache.hudi.common.util.FSUtils.getAllPartitionPaths(FSUtils.java)
       at org.apache.hudi.table.HoodieCopyOnWriteTable.clean(HoodieCopyOnWriteTable.java:285)
       at org.apache.hudi.HoodieWriteClient.clean(HoodieWriteClient.java:950)
   ```
   2. IMO, `Step 2` affected by NN's RPC performance.
   
   ### Analysis about step 4
   When I set `WARN` log level, the clean operation run faster. Performance loss due to the interaction between driver and executor?
   
   ![image](https://user-images.githubusercontent.com/20113411/80269122-75c69900-86df-11ea-9a34-b805b694f9e9.png)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] harshi2506 edited a comment on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
harshi2506 edited a comment on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-618070975


   hi @lamber-ken, I am already setting parallelism to 200
   hudiOptions = (HoodieWriteConfig.TABLE_NAME -> "table_name",
             DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> "COPY_ON_WRITE",
             DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "primaryKey",
             DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "dedupeKey",
             DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "partition_key"(date in format(yyyy/mm/dd)),
             "hoodie.upsert.shuffle.parallelism" ->"200",
             DataSourceWriteOptions.OPERATION_OPT_KEY -> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
   inputDF.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save(load_path)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] harshi2506 edited a comment on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
harshi2506 edited a comment on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-618070975


   hi @lamber-ken, I am already setting parallelism to 200
   hudiOptions += (HoodieWriteConfig.TABLE_NAME -> "table_name",
             DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> "COPY_ON_WRITE",
             DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "primaryKey",
             DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "dedupeKey",
             DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "partition_key"(date in format(yyyy/mm/dd)),
             "hoodie.upsert.shuffle.parallelism" ->"200",
             DataSourceWriteOptions.OPERATION_OPT_KEY -> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
   inputDF.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save(load_path)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] harshi2506 edited a comment on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
harshi2506 edited a comment on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-618068101


   @vinothchandar yes, it is happening every time, I tried it 3 times and its always the same. I see a hoodie commit being added in atmost 5-6 minutes and then it takes almost an hour to complete the EMR step. I am using AWS EMR, HUDI Version: 0.5.0-incubating.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] lamber-ken commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
lamber-ken commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-619308174


   > @harshi2506 I am suspecting this may be due to a recent bug we fixed on master (still not 100%). Are you open to building hudi off master branch and giving that a shot? I am suspecting #1394
   
   IMO, this solution will help you, please try to building hudi master branch. @harshi2506 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] lamber-ken commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
lamber-ken commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-620246831


   ## The lastest spark log (master branch)
   
   ### Simplified origin spark log
   ```
   // Upsert part
   Warning: Ignoring non-spark config property: "spark.sql.hive.convertMetastoreParquet=false"
   Warning: Ignoring non-spark config property: "spark.serializer=org.apache.spark.serializer.KryoSerializer"
   20/04/26 17:35:58 WARN S3CryptoModuleAE: Unable to detect encryption information for object 'jar' in bucket '<storageBucket>'. Returning object without decryption.
   
   // Start rollback
   20/04/26 17:36:34 INFO HoodieWriteClient: Begin rollback of instant 20200426170940
   20/04/26 17:36:34 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from <storageLocation>
   20/04/26 17:36:34 INFO FSUtils: Hadoop Configuration: fs.defaultFS: [hdfs://ip-10-0-55-99.ec2.internal:8020]
   
   // Output repeatly 
   20/04/26 17:36:40 INFO FileStatusExt: Metadata Entry doesn't exist
   ...
   20/04/26 18:19:07 INFO FileStatusExt: Metadata Entry doesn't exist
   
   // Finish
   20/04/26 18:19:08 INFO CleanActionExecutor: Total Partitions to clean : 6052, with policy KEEP_LATEST_COMMITS
   20/04/26 18:19:08 INFO CleanActionExecutor: Using cleanerParallelism: 6000
   20/04/26 18:19:08 INFO SparkContext: Starting job: collect at CleanActionExecutor.java:87
   20/04/26 18:19:08 INFO DAGScheduler: Got job 10 (collect at CleanActionExecutor.java:87) with 6000 output partitions
   20/04/26 18:19:08 INFO DAGScheduler: Final stage: ResultStage 30 (collect at CleanActionExecutor.java:87)
   20/04/26 18:19:08 INFO DAGScheduler: Parents of final stage: List()
   20/04/26 18:19:08 INFO DAGScheduler: Missing parents: List()
   20/04/26 18:19:08 INFO DAGScheduler: Submitting ResultStage 30 (MapPartitionsRDD[63] at map at CleanActionExecutor.java:86), which has no missing parents
   20/04/26 18:19:08 INFO MemoryStore: Block broadcast_20 stored as values in memory (estimated size 247.7 KB, free 1008.3 MB)
   20/04/26 18:19:08 INFO MemoryStore: Block broadcast_20_piece0 stored as bytes in memory (estimated size 87.1 KB, free 1008.2 MB)
   20/04/26 18:19:08 INFO BlockManagerInfo: Added broadcast_20_piece0 in memory on ip-10-0-55-99.ec2.internal:44415 (size: 87.1 KB, free: 1008.8 MB)
   20/04/26 18:19:08 INFO SparkContext: Created broadcast 20 from broadcast at DAGScheduler.scala:1201
   20/04/26 18:19:08 INFO DAGScheduler: Submitting 6000 missing tasks from ResultStage 30 (MapPartitionsRDD[63] at map at CleanActionExecutor.java:86) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
   20/04/26 18:19:08 INFO YarnScheduler: Adding task set 30.0 with 6000 tasks
   20/04/26 18:19:23 INFO YarnScheduler: Removed TaskSet 32.0, whose tasks have all completed, from pool 
   20/04/26 18:19:23 INFO DAGScheduler: ResultStage 32 (collect at CleanActionExecutor.java:155) finished in 3.444 s
   20/04/26 18:19:23 INFO DAGScheduler: Job 11 finished: collect at CleanActionExecutor.java:155, took 8.313284 s
   20/04/26 18:19:23 INFO HoodieActiveTimeline: Loaded instants [[20200425201231__clean__COMPLETED], [20200425201231__commit__COMPLETED], [==>20200426173625__clean__INFLIGHT], [20200426173625__commit__COMPLETED]]
   20/04/26 18:19:24 INFO HoodieActiveTimeline: Checking for file exists ?<storageLocation>.hoodie/20200426173625.clean.inflight
   20/04/26 18:19:24 INFO CSEMultipartUploadOutputStream: close closed:false <storageLocation>.hoodie/20200426173625.clean
   20/04/26 18:19:24 INFO DefaultMultipartUploadDispatcher: Completed multipart upload of 1 parts 273450 bytes
   20/04/26 18:19:24 INFO CSEMultipartUploadOutputStream: Finished uploading <storageLocation>.hoodie/20200426173625.clean. Elapsed seconds: 0.
   20/04/26 18:19:24 INFO HoodieActiveTimeline: Create new file for toInstant ?<storageLocation>.hoodie/20200426173625.clean
   20/04/26 18:19:24 INFO CleanActionExecutor: Marked clean started on 20200426173625 as complete
   20/04/26 18:19:24 INFO AbstractHoodieWriteClient: Committed 20200426173625
   20/04/26 18:19:24 INFO HoodieSparkSqlWriter$: Commit 20200426173625 successful!
   20/04/26 18:19:28 INFO SparkContext: Invoking stop() from shutdown hook
   20/04/26 18:19:28 INFO SparkUI: Stopped Spark web UI at ip
   20/04/26 18:19:28 INFO YarnClientSchedulerBackend: Interrupting monitor thread
   20/04/26 18:19:28 INFO YarnClientSchedulerBackend: Shutting down all executors
   20/04/26 18:19:28 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
   20/04/26 18:19:28 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
   (serviceOption=None,
    services=List(),
    started=false)
   20/04/26 18:19:28 INFO YarnClientSchedulerBackend: Stopped
   20/04/26 18:19:28 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
   20/04/26 18:19:28 INFO MemoryStore: MemoryStore cleared
   20/04/26 18:19:28 INFO BlockManager: BlockManager stopped
   20/04/26 18:19:28 INFO BlockManagerMaster: BlockManagerMaster stopped
   20/04/26 18:19:28 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
   20/04/26 18:19:28 INFO SparkContext: Successfully stopped SparkContext
   20/04/26 18:19:28 INFO ShutdownHookManager: Shutdown hook called
   20/04/26 18:19:28 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-e12c26c7-150a-4837-8fb2-72db585dd2f3
   20/04/26 18:19:28 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-1a4b587a-a177-4cab-8f93-6849d64386d9
   Command exiting with ret '0'
   
   ```
   
   ### Simplified origin spark log
   
   ```
   // Upsert part
   Warning: Ignoring non-spark config property: "spark.sql.hive.convertMetastoreParquet=false"
   Warning: Ignoring non-spark config property: "spark.serializer=org.apache.spark.serializer.KryoSerializer"
   20/04/22 20:12:14 WARN S3CryptoModuleAE: Unable to detect encryption information for object '<JARPATH>' in bucket '<storageBucket>'. Returning object without decryption.
   20/04/22 20:12:14 WARN HiveConf: HiveConf of name hive.server2.thrift.url does not exist
   20/04/22 20:12:14 INFO SparkContext: Running Spark version 2.4.4
   20/04/22 20:12:14 INFO SparkContext: Submitted application: HUDI incremental data loading
   20/04/22 20:12:15 INFO SecurityManager: Changing view acls to: hadoop
   20/04/22 20:12:15 INFO SecurityManager: Changing modify acls to: hadoop
   20/04/22 20:12:15 INFO SecurityManager: Changing view acls groups to: 
   20/04/22 20:15:31 INFO S3NativeFileSystem: Opening '<storageLocation>.hoodie/hoodie.properties' for reading
   20/04/22 20:15:31 WARN S3CryptoModuleAE: Unable to detect encryption information for object '<storgaeLocation>/.hoodie/hoodie.properties' in bucket '<storageBucket>'. Returning object without decryption.
   20/04/22 20:15:31 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE from <storageLocation>
   20/04/22 20:15:31 INFO HoodieTableMetaClient: Loading Active commit timeline for <storageLocation>
   20/04/22 20:15:31 INFO HoodieActiveTimeline: Loaded instants java.util.stream.ReferencePipeline$Head@678852b5
   20/04/22 20:24:33 INFO HoodieCopyOnWriteTable: Partitions to clean up : [2001/05/02, 2001/05/07, 2001/05/09, 2001/05/10, 2001/05/17, 2001/05/18, 2001/05/21, 2001/06/01, 2001/06/04, 2001/06/08, 2001/06/20, 2001/06/21, 2001/07/17, 2001/07/23, 2001/07/25, 2001/07/30, 2001/08/02, 2001/08/03, 2001/08/07, 2001/08/08, 2001/08/09, 2001/08/14, 2001/08/23, 2001/09/05, 2001/09/06, 2001/09/07, 2001/09/13, 2001/09/14, 2001/10/02, 2001/10/03, 2001/10/04, 2001/10/09, 2001/11/01, 2001/11/09, 2001/11/14, 2001/11/15, 2001/11/16, 2001/11/19, 2001/11/20, 2001/11/21,
   20/04/22 20:24:36 INFO HoodieWriteClient: Marked clean started on 20200422201250 as complete
   20/04/22 20:24:36 INFO HoodieWriteClient: Committed 20200422201250
   20/04/22 20:24:36 INFO HoodieSparkSqlWriter$: Commit 20200422201250 successful!
   
   
   // CreateRelation(sqlContext, parameters, df.schema)
   20/04/22 20:24:36 INFO DefaultSource: Constructing hoodie (as parquet) data source with options ...
   
   20/04/22 20:24:38 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from <storgaeLocation>
   20/04/22 20:24:38 INFO FSUtils: Hadoop Configuration: fs.defaultFS:...], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@3c0cfaad]
   ...
   20/04/22 20:24:38 INFO S3NativeFileSystem: Opening '<storageLocation>2001/05/10/.hoodie_partition_metadata' for reading
   20/04/22 20:24:38 WARN S3CryptoModuleAE: Unable to detect encryption information for object '<storgaeLocation>/2001/05/10/.hoodie_partition_metadata' in bucket '<storageBucket>'. Returning object without decryption.
   
   20/04/22 20:24:38 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from <storgaeLocation>
   20/04/22 20:24:38 INFO FSUtils: Hadoop Configuration: fs.defaultFS:...], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@3c0cfaad]
   ...
   20/04/22 20:24:39 INFO S3NativeFileSystem: Opening '<storageLocation>2001/05/17/.hoodie_partition_metadata' for reading
   20/04/22 20:24:39 WARN S3CryptoModuleAE: Unable to detect encryption information for object '<storgaeLocation>/2001/05/17/.hoodie_partition_metadata' in bucket '<storageBucket>'. Returning object without decryption.
   
   20/04/22 20:24:39 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from <storgaeLocation>
   20/04/22 20:24:39 INFO FSUtils: Hadoop Configuration: fs.defaultFS:...], FileSystem: [com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@3c0cfaad]
   ...
   20/04/22 20:24:39 INFO S3NativeFileSystem: Opening '<storageLocation>2001/05/18/.hoodie_partition_metadata' for reading
   20/04/22 20:24:39 WARN S3CryptoModuleAE: Unable to detect encryption information for object '<storgaeLocation>/2001/05/18/.hoodie_partition_metadata' in bucket '<storageBucket>'. Returning object without decryption.
   
   // Finish part
   20/04/22 21:06:25 INFO SparkContext: Successfully stopped SparkContext
   20/04/22 21:06:25 INFO ShutdownHookManager: Shutdown hook called
   20/04/22 21:06:25 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-c24e90ad-7be5-4a04-b4f1-2575cf68bd5a
   20/04/22 21:06:25 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-c18b2f4f-e525-43de-b896-75244fe63591
   ```
   
   #### Analysis
   
   ![image](https://user-images.githubusercontent.com/20113411/80269021-c7225880-86de-11ea-84fe-464d08fc4e9b.png)
   ![image](https://user-images.githubusercontent.com/20113411/80422808-6b8be100-8911-11ea-9dd1-bfa6b0bafd9c.png)
   
   ### Analysis about step 2
   
   `Rollback` cost 42mins, output `FileStatusExt: Metadata Entry doesn't exist` repeatly.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] vinothchandar commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-618180277


   @harshi2506 I am suspecting this may be due to a recent bug we fixed on master (still not 100%). Are you open to building hudi off master branch and giving that a shot? I am suspecting #1394 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] harshi2506 commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
harshi2506 commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-623389443


   > fighting
   
   
   
   > #### BUG status
   > bug	status	way
   > upsert long time first time	fixed	upgrate version (0.5.0 to master)
   > FileStatusExt: Metadata Entry doesn't exist	fixed	using a whole new base path
   > java.io.FileNotFoundException: No such file or directory	fiexed	but don't know the root cause
   > 17 minutes lag between HoodieActiveTimeline and CleanActionExecutor	fighting	xxx
   
   Hi @lamber-ken , few changes to the bug status report
   
   bug | status | way
   -- | -- | --
   upsert long time first time | fixed | upgrate version (0.5.0 to master)
   FileStatusExt: Metadata Entry doesn't exist | fixed | By enabling EMRFS on EMR cluster or by using hoodie file   consistency config
   java.io.FileNotFoundException: No such file   or directory | fixed | By enabling EMRFS on EMR cluster or by using hoodie file   consistency config("hoodie.consistency.check.enabled"   -> "true")
   17 minutes lag between HoodieActiveTimeline   and CleanActionExecutor | fighting | xxx
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] lamber-ken commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
lamber-ken commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-617976874


   hi @harshi2506, based on the above analysis, please try to increate the upsert parallelism and spark executor instances, for example
   ```
   export SPARK_HOME=/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7
   ${SPARK_HOME}/bin/spark-shell \
     --master yarn \
     --driver-memory 6G \
     --num-executors 10 \
     --executor-cores 5 \
     --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4 \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
   import org.apache.spark.sql.functions._
   
   val tableName = "hudi_mor_table"
   val basePath = "file:///tmp/hudi_mor_tablen"
   
   val hudiOptions = Map[String,String](
     "hoodie.upsert.shuffle.parallelism" -> "200",
     "hoodie.datasource.write.recordkey.field" -> "id",
     "hoodie.datasource.write.partitionpath.field" -> "key", 
     "hoodie.table.name" -> tableName,
     "hoodie.datasource.write.precombine.field" -> "timestamp",
     "hoodie.memory.merge.max.size" -> "2004857600000"
   )
   
   val inputDF = spark.range(1, 300).
      withColumn("key", $"id").
      withColumn("data", lit("data")).
      withColumn("timestamp", current_timestamp()).
      withColumn("dt", date_format($"timestamp", "yyyy-MM-dd"))
   
   inputDF.write.format("org.apache.hudi").
     options(hudiOptions).
     mode("Append").
     save(basePath)
   
   spark.read.format("org.apache.hudi").load(basePath + "/*/*").show();
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] lamber-ken commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
lamber-ken commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-618683897


   hi @harshi2506, need more spark info log, you can put the logfile to google drive, 
   e.g https://drive.google.com/file/d/1zzyaySDJqPgAdTSLnKwOG667QGvZhd03


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] harshi2506 commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
harshi2506 commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-617906795


   HI @lamber-ken , thanks for the response, attaching the screenshots
   <img width="1431" alt="Screenshot 2020-04-22 at 10 31 32 PM" src="https://user-images.githubusercontent.com/64137937/80011182-24ed4f80-84e9-11ea-8a44-938a8d352b6f.png">
   <img width="1418" alt="Screenshot 2020-04-22 at 10 31 43 PM" src="https://user-images.githubusercontent.com/64137937/80011201-29b20380-84e9-11ea-86af-f526ff2d44d7.png">
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] lamber-ken commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
lamber-ken commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-619449248


   hi @harshi2506, build steps:
   **1. Build Env**
   - JDK8
   - Unix
   
   **2. Commands**
   ```
   git clone https://github.com/apache/incubator-hudi.git
   mvn clean install -DskipTests -DskipITs -Dcheckstyle.skip=true -Drat.skip=true
   
   ```
   
   **3. Run env**
   - Spark-2.4.4+
   - avro-1.8.0
   ```
   // run in local env
   export SPARK_HOME=/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7
   ${SPARK_HOME}/bin/spark-shell \
     --driver-memory 6G \
     --packages org.apache.spark:spark-avro_2.11:2.4.4 \
     --jars `ls packaging/hudi-spark-bundle/target/hudi-spark-bundle_*.*-*.*.*-SNAPSHOT.jar` \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
   // run in yarn env
   export SPARK_HOME=/BigData/install/spark-2.4.4-bin-hadoop2.7
   ${SPARK_HOME}/bin/spark-shell \
     --master yarn \
     --driver-memory 6G \
     --executor-memory 6G \
     --num-executors 5 \
     --executor-cores 5 \
     --queue root.default \
     --packages org.apache.spark:spark-avro_2.11:2.4.4 \
     --jars `ls packaging/hudi-spark-bundle/target/hudi-spark-bundle_*.*-*.*.*-SNAPSHOT.jar` \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
   // scripts
   import org.apache.spark.sql.functions._
   
   val tableName = "hudi_mor_table"
   val basePath = "file:///tmp/hudi_mor_tablen"
   // val basePath = "hdfs:///hudi/test"
   
   val hudiOptions = Map[String,String](
     "hoodie.upsert.shuffle.parallelism" -> "10",
     "hoodie.datasource.write.recordkey.field" -> "key",
     "hoodie.datasource.write.partitionpath.field" -> "dt", 
     "hoodie.table.name" -> tableName,
     "hoodie.datasource.write.precombine.field" -> "timestamp"
   )
   
   val inputDF = spark.range(1, 7).
      withColumn("key", $"id").
      withColumn("data", lit("data")).
      withColumn("timestamp", unix_timestamp()).
      withColumn("dtstamp", unix_timestamp() + ($"id" * 24 * 3600)).
      withColumn("dt", from_unixtime($"dtstamp", "yyyy/MM/dd"))
   
   inputDF.write.format("org.apache.hudi").
     options(hudiOptions).
     mode("Overwrite").
     save(basePath)
   
   spark.read.format("org.apache.hudi").load(basePath + "/*/*/*").show();
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] lamber-ken commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
lamber-ken commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-619278766


   @harshi2506, I tried to reproduce your steps, maybe got the answer
   
   - Generate about 7000 partitions
   - Upsert 300 partitions
   
   From following demo code
   
   - `sc.setLogLevel("INFO")`, upsert cost 15min
   - `sc.setLogLevel("WARN")`, upsert cost 3min
   
   ```
   export HADOOP_HOME=/BigData/install/hadoop/hadoop-2.7.7-streamjar
   export HADOOP_MAPRED_HOME=$HADOOP_HOME
   export HADOOP_COMMON_HOME=$HADOOP_HOME
   export HADOOP_HDFS_HOME=$HADOOP_HOME
   export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
   export YARN_HOME=$HADOOP_HOME
   export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
   export HADOOP_USER_NAME=dcadmin
   
   export SPARK_HOME=/BigData/install/spark-2.4.4-bin-hadoop2.7
   ${SPARK_HOME}/bin/spark-shell \
     --master yarn \
     --driver-memory 6G \
     --executor-memory 6G \
     --num-executors 5 \
     --executor-cores 5 \
     --queue root.default \
     --packages org.apache.hudi:hudi-spark-bundle:0.5.0-incubating,org.apache.spark:spark-avro_2.11:2.4.4 \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
   import org.apache.spark.sql.functions._
   
   val tableName = "hudi_mor_table"
   val basePath = "viewfs://dcfs/ns-log/wxdprtc/flink/hudi"
   
   val hudiOptions = Map[String,String](
     "hoodie.upsert.shuffle.parallelism" -> "10",
     "hoodie.datasource.write.recordkey.field" -> "key",
     "hoodie.datasource.write.partitionpath.field" -> "dt", 
     "hoodie.table.name" -> tableName,
     "hoodie.datasource.write.precombine.field" -> "timestamp"
   )
   
   val inputDF = spark.range(1, 7000).
      withColumn("key", $"id").
      withColumn("data", lit("data")).
      withColumn("timestamp", unix_timestamp()).
      withColumn("dtstamp", unix_timestamp() + ($"id" * 24 * 3600)).
      withColumn("dt", from_unixtime($"dtstamp", "yyyy/MM/dd"))
   
   inputDF.write.format("org.apache.hudi").
     options(hudiOptions).
     mode("Overwrite").
     save(basePath)
   
   spark.read.format("org.apache.hudi").load(basePath + "/*/*/*").count();
   
   
   // cost long time, about 15min
   sc.setLogLevel("INFO")
   
   val inputDF = spark.range(1, 300).
      withColumn("key", $"id").
      withColumn("data", lit("data")).
      withColumn("timestamp", unix_timestamp()).
      withColumn("dtstamp", unix_timestamp() + ($"id" * 24 * 3600)).
      withColumn("dt", from_unixtime($"dtstamp", "yyyy/MM/dd"))
   
   inputDF.write.format("org.apache.hudi").
     options(hudiOptions).
     mode("Append").
     save(basePath)
   
   
   // cost less time, about 3min
   sc.setLogLevel("WARN")
   
   val inputDF = spark.range(1, 300).
      withColumn("key", $"id").
      withColumn("data", lit("data")).
      withColumn("timestamp", unix_timestamp()).
      withColumn("dtstamp", unix_timestamp() + ($"id" * 24 * 3600)).
      withColumn("dt", from_unixtime($"dtstamp", "yyyy/MM/dd"))
   
   inputDF.write.format("org.apache.hudi").
     options(hudiOptions).
     mode("Append").
     save(basePath)
   ```
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-hudi] harshi2506 commented on issue #1552: Time taken for upserting hudi table is increasing with increase in number of partitions

Posted by GitBox <gi...@apache.org>.
harshi2506 commented on issue #1552:
URL: https://github.com/apache/incubator-hudi/issues/1552#issuecomment-620512620


   @vinothchandar, I tried building jar from mater branch and loaded a snapshot, it is failing every time saying 
   
   ```
   20/04/28 09:40:14 WARN TaskSetManager: Lost task 178.2 in stage 4.0 (TID 37246, ip-10-0-61-179.ec2.internal, executor 17): java.lang.RuntimeException: org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieException: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieInsertException: Failed to close the Insert Handle for path <storageLocation>/2018/07/03/6452abed-658d-4840-92cf-1b7b970d2b61-30_178-4-37246_20200428084112.parquet
   	at org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121)
   	at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
   	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
   	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
   	at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
   	at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:349)
   	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
   	at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
   	at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
   	at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
   	at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
   	at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
   	at org.apache.spark.scheduler.Task.run(Task.scala:123)
   	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieException: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieInsertException: Failed to close the Insert Handle for path <storageLocation>/2018/07/03/6452abed-658d-4840-92cf-1b7b970d2b61-30_178-4-37246_20200428084112.parquet
   	at org.apache.hudi.execution.CopyOnWriteLazyInsertIterable.computeNext(CopyOnWriteLazyInsertIterable.java:110)
   	at org.apache.hudi.execution.CopyOnWriteLazyInsertIterable.computeNext(CopyOnWriteLazyInsertIterable.java:46)
   	at org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:119)
   	... 23 more
   Caused by: org.apache.hudi.exception.HoodieException: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieInsertException: Failed to close the Insert Handle for path <storageLocation>/2018/07/03/6452abed-658d-4840-92cf-1b7b970d2b61-30_178-4-37246_20200428084112.parquet
   	at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:143)
   	at org.apache.hudi.execution.CopyOnWriteLazyInsertIterable.computeNext(CopyOnWriteLazyInsertIterable.java:106)
   	... 25 more
   Caused by: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieInsertException: Failed to close the Insert Handle for path <storageLocation>/2018/07/03/6452abed-658d-4840-92cf-1b7b970d2b61-30_178-4-37246_20200428084112.parquet
   	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
   	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
   	at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:141)
   	... 26 more
   Caused by: org.apache.hudi.exception.HoodieInsertException: Failed to close the Insert Handle for path <storageLocation>/2018/07/03/6452abed-658d-4840-92cf-1b7b970d2b61-30_178-4-37246_20200428084112.parquet
   	at org.apache.hudi.io.HoodieCreateHandle.close(HoodieCreateHandle.java:183)
   	at org.apache.hudi.execution.CopyOnWriteLazyInsertIterable$CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteLazyInsertIterable.java:152)
   	at org.apache.hudi.execution.CopyOnWriteLazyInsertIterable$CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteLazyInsertIterable.java:132)
   	at org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
   	at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
   	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   	... 3 more
   Caused by: java.io.FileNotFoundException: No such file or directory '<storageLocation>/2018/07/03/6452abed-658d-4840-92cf-1b7b970d2b61-30_178-4-37246_20200428084112.parquet'
   	at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:830)
   	at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.getFileStatus(EmrFileSystem.java:553)
   	at org.apache.hudi.common.fs.HoodieWrapperFileSystem.getFileStatus(HoodieWrapperFileSystem.java:304)
   	at org.apache.hudi.common.fs.FSUtils.getFileSize(FSUtils.java:143)
   	at org.apache.hudi.io.HoodieCreateHandle.close(HoodieCreateHandle.java:169)
   	... 8 more
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org