You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/09/06 06:54:51 UTC

[GitHub] [hudi] Ambarish-Giri opened a new issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Ambarish-Giri opened a new issue #3605:
URL: https://github.com/apache/hudi/issues/3605


   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)?
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   Hi Team, 
   I was testing Hudi for doing inserts/updates/deletes on data in S3.  Below are benchmark metrics captured so far on varied data sizes: 
   
   Run 1 - Fresh Insert
   -----------------------
   Total Data size = 7 GB
   
   
   COW = 22 mins
   MOR = 25 mins
   
   
   
   Run 2 - Upsert
   --------------------
   Total Data Size=6.7 GB
   
   COW = 61 mins
   MOR = 64 mins
   
   
   Run 3 - Upsert
   -------------------
   Total Data size:  2.5 GB
   
   COW = 39 mins
   MOR = 53 mins
   
   Below are cluster configurations used:
   EMR Version : 5.33.0
   Hudi: 0.7.0
   Spark: 2.4.7
   Scala: 2.11.12
   Static cluster with 1 Master (m5.xlarge)  , 4 * (m5.2xlarge) core and 4 * (m5.2xlarge) task nodes
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Execute Hudi insert/usert on text data stored in S3 
   2. The spark-submit is issued on EMR 5.33.0 
   3. Hudi 0.7.0 and Scala 2.11.12 is used
   4.
   
   **Expected behavior**
   
   Not expecting that Hudi will take so much time to write to Hudi Store. Expectation was it should take 15-20 mins time at max for data of size (7-8 GB) both inserts/upserts. Also for even writes CoW write strategy was performing better compared to MoR which I thought would have been vice versa.
   
   **Environment Description**
   
   * Hudi version : 0.7.0
   
   * Spark version : 2.4.7
   
   * Hive version : 2.3.7
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   
   **Additional context**
   This is a complete batch job, we receive daily loads and upserts are supposed to be performed over existing Hudi Tables.
   
   Static EMR cluster:  1 Master (m5.xlarge) node  , 4 * (m5.2xlarge) core nodes and 4 * (m5.2xlarge) task nodes
   Spark submit command  ::
   spark-submit --master yarn --num-executors 8 --driver-memory 4G --executor-memory 20G \
        --conf spark.yarn.executor.memoryOverhead=4096 \
        --conf spark.yarn.maxAppAttempts=3 \
        --conf spark.executor.cores=5 \
        --conf spark.segment.etl.numexecutors=8 \
        --conf spark.network.timeout=800 \
        --conf spark.shuffle.minNumPartitionsToHighlyCompress=32 \
        --conf spark.segment.processor.partition.count=500 \
        --conf spark.segment.processor.output-shard.count=60 \
        --conf spark.segment.processor.binseg.partition.threshold.bytes=500000000000 \
        --conf spark.driver.maxResultSize=0 \
        --conf spark.hadoop.fs.s3.maxRetries=20 \
        --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
        --conf spark.sql.shuffle.partitions=500 \
        --conf spark.kryo.registrationRequired=false \
        --class <class-name> \
        --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar  \
        s3://<jar-name>
   
   HUDI insert and upsert parameters:
   userSegDf.write
         .format("hudi")
         .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, if(hudiWriteStrg=="MOR") DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL else DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL)
         .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, keyGenClass)
         .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, key)
         .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, partitionKey)
         .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, combineKey)
         .option(HoodieWriteConfig.TABLE_NAME, tableName)
         .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
         .option("hoodie.upsert.shuffle.parallelism", "2")
         .mode(SaveMode.Overwrite)
         .save(s"$basePath/$tableName/")
   
   userSegDf.write
         .format("hudi")
         .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, if(hudiWriteStrg=="MOR") DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL else DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL)
         .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, keyGenClass)
         .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, key)
         .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, partitionKey)
         .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, combineKey)
         .option(HoodieWriteConfig.TABLE_NAME, tableName)
         .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
         .mode(SaveMode.Append)
         .save(s"$basePath/$tableName/")
   
   
   I have tried to run a full production load on 53 GB of data size on production cluster with the below cluster configuration and spark submit command for Hudi insert using COW write strategy ...I observed that it is taking more than 2 hrs just for insert and it is quite evident from the earlier runs that I will take even more time for upsert operation.
   
   Tota Data size: 53 GB
   Cluster Size:1 Master (m5.xlarge) node  , 2* (r5a.24xlarge) core nodes and 6 * (r5a.24xlarge) task nodes
   Spark submit command  ::
   spark-submit --master yarn --num-executors 192 --driver-memory 4G --executor-memory 20G \
        --conf spark.yarn.executor.memoryOverhead=4096 \
        --conf spark.yarn.maxAppAttempts=3 \
        --conf spark.executor.cores=4 \
        --conf spark.segment.etl.numexecutors=192 \
        --conf spark.network.timeout=800 \
        --conf spark.shuffle.minNumPartitionsToHighlyCompress=32 \
        --conf spark.segment.processor.partition.count=1536 \
        --conf spark.segment.processor.output-shard.count=60 \
        --conf spark.segment.processor.binseg.partition.threshold.bytes=500000000000 \
        --conf spark.driver.maxResultSize=0 \
        --conf spark.hadoop.fs.s3.maxRetries=20 \
        --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
        --conf spark.sql.shuffle.partitions=1536 \
        --conf spark.kryo.registrationRequired=false \
        --class <class-name> \
         --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar  \
        s3://<jar-name>
    
   Hudi insert and Upsert parameters being same as above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Ambarish-Giri commented on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
Ambarish-Giri commented on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-915102513


   Hi Team, following up on the ticket to check if there is any update.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Ambarish-Giri commented on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
Ambarish-Giri commented on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-917361121


   Hi @danny0405 can you explain a bit more on "if the BloomFilter got false positive"?  
   In my case the record key is concat(uuid4,segmentId). SegmentId is an integer value i.e. it can be same for multiple records and uuid4 is standard unique random value, but a combination of both identifies a record uniquely and partition key is again segmentId  as it has low cardinality.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Ambarish-Giri commented on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
Ambarish-Giri commented on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-927503142


   Hi @nsivabalan let me know in case you need any further details?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Ambarish-Giri commented on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
Ambarish-Giri commented on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-917646992


   Hi @nsivabalan ,
   
   Sure will try bulk-insert once and update.
   
   1# Upserts can be spread across partitions or can be specific as well as per the data received for that day, and it can have just appends as well.
   2# No the records key doesn't have any timestamp affinity, as mentioned the record key is concat(segmentId,uuid4). SegmentId is an integer value i.e. it can be same for multiple records and uuid4 is standard unique random value ( note: "-" are being removed from the uuid4 values though), but a combination of both identifies a record uniquely and partition key is again segmentId as it has low cardinality 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Ambarish-Giri commented on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
Ambarish-Giri commented on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-936162315


   Hi @nsivabalan , I analysed the Hudi code as well to check if there is any room for improvement but couldn't find much. Let me know if there is any updates from your end.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Ambarish-Giri commented on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
Ambarish-Giri commented on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-920561496


   Hi @nsivabalan ,
   
   We have been trying to optimize the upsert but still the 44GB upsert over a 54 GB bulk-insert in a fairly big cluster is taking more than 3 hrs. Below in the EMR cluster configuration and the Upsert config:
   
   userSegDf.write
         .format("hudi")
         .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL)
          .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, keyGenClass)
         .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, key)
         .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, partitionKey)
         .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, combineKey)
         .option(HoodieWriteConfig.TABLE_NAME, tableName)
         .option(HoodieIndexConfig.INDEX_TYPE_PROP,HoodieIndex.IndexType.SIMPLE.toString())
         .option(HoodieIndexConfig.SIMPLE_INDEX_PARALLELISM_PROP,50)
         .option(HoodieMetadataConfig.METADATA_ENABLE_PROP, true)
         .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
         .option(DataSourceWriteOptions.ENABLE_ROW_WRITER_OPT_KEY, true)
         .option(HoodieWriteConfig.UPSERT_PARALLELISM, 200)
         .option(HoodieWriteConfig.COMBINE_BEFORE_UPSERT_PROP, false)
         .option(HoodieWriteConfig.WRITE_BUFFER_LIMIT_BYTES, 41943040)
         .option(HoodieCompactionConfig.COPY_ON_WRITE_TABLE_RECORD_SIZE_ESTIMATE, 100)
         .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, true)
           .mode(SaveMode.Append)
         .save(s"$basePath/$tableName/")
   
   Cluster config:
   Static EMR cluster: 1 Master (m5.xlarge) node and 8 * (r5d.24xlarge) core nodes
   
   Spark-Submit Command 👍 
   
   spark-submit --master yarn --deploy-mode client \
        --num-executors 192 --driver-memory 4G --executor-memory 20G \
        --conf spark.yarn.executor.memoryOverhead=4096 \
        --conf spark.yarn.driver.memoryOverhead=2048 \
        --conf spark.yarn.max.executor.failures=100 \
        --conf spark.task.cpus=1 \
        --conf spark.rdd.compress=true \
        --conf spark.kryoserializer.buffer.max=512m \
        --conf spark.yarn.maxAppAttempts=3 \
        --conf spark.executor.cores=4 \
        --conf spark.segment.etl.numexecutors=192 \
        --conf spark.network.timeout=800 \
        --conf spark.shuffle.service.enabled=true \
        --conf spark.sql.hive.convertMetastoreParquet=false \
   	 --conf spark.task.maxFailures=4 \
        --conf spark.shuffle.minNumPartitionsToHighlyCompress=32 \
        --conf spark.segment.processor.partition.count=1536 \
        --conf spark.segment.processor.output-shard.count=60 \
        --conf spark.segment.processor.binseg.partition.threshold.bytes=500000000000 \
        --conf spark.driver.maxResultSize=0 \
        --conf spark.hadoop.fs.s3.maxRetries=20 \
        --conf spark.kryoserializer.buffer.max=512m \
        --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
        --conf spark.sql.shuffle.partitions=3000 \
        --class <class-name>\
        --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar  \
        s3://<application>.jar


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Ambarish-Giri edited a comment on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
Ambarish-Giri edited a comment on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-916252147


   Hi  @danny0405 as mentioned my use case is purely batch....does Flink Hudi is for streaming or batch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
danny0405 commented on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-916578830


   Yeah, Spark is good for batch case, but the Bloom index is not vary stable when you updates are kind of random(for target partitions), if the BloomFilter got false positive, hoodie would scan the whole parquet file which is the reason why it is slow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Ambarish-Giri commented on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
Ambarish-Giri commented on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-923797306


   Hi @nsivabalan @danny0405  any updates on the above issue??


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-917429823


   Hey hi @Ambarish-Giri : 
   For initial bulk loading of data into hudi, you can try "bulk_insert" operation. it is expected to be faster compared to regular operations. Ensure you set the right value for [avg record size config](https://hudi.apache.org/docs/configurations/#hoodiecopyonwriterecordsizeestimate) . for subsequent operations, hudi will infer the record size from older commits. But for first commit (bulk import/bulk_insert), hudi relies on this config to pack records to right sized files. 
   
   Couple of questions before we dive into perf in detail: 
   1. may I know whats your upsert characteristics? Is it spread across all partitions, or just very few recent partitions. 
   2. Does your record key have any timestamp affinity or characteristics. If record keys are completely random, we can try SIMPLE index, since bloom may not be very effective for completely random keys. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-937435741


   sorry, whats the shuffle parallelism you are setting for these writes? In your original description, I see you are setting it to 2. definitely that would give you bad perf. Try to give something like 100 to 200 and see how it pans out. 
   We have diff configs for diff operation. So, ensure you set the right config. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-937435741


   sorry, whats the shuffle parallelism you are setting for these writes? In your original description, I see you are setting it to 2. definitely that would give you bad perf. Try to give something like in the range of 100 to 1000 depending on your data size and see how it pans out. 
   We have diff configs for diff operation. So, ensure you set the right config. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Ambarish-Giri edited a comment on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
Ambarish-Giri edited a comment on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-926578764


   Below are the Hudi Spark stages which are consuming maximum time:
   BulkInsert (MoR):
   
   ![image](https://user-images.githubusercontent.com/85560823/134672503-66e0ea24-44d5-4103-aa18-4a08c2f25996.png)
   
   
   Upsert (MoR):
   
   ![image](https://user-images.githubusercontent.com/85560823/134672918-8ad358e8-4e53-431a-82f5-a5e30dc8cbf8.png)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
danny0405 commented on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-916146423


   You can try Flink hudi instead, very good performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Ambarish-Giri commented on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
Ambarish-Giri commented on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-926577602


   Hi @nsivabalan ,
   
   1# Correct I was considering {segmentId,uuid} , ComplexKey as record key as combined key uniquely identifies records, since partitioning is done on segmentId it makes sense to have just uuid as record key. I have taken care of the orthogonal issue you pointed out.
   
   2# Partitioning by segmentId for the data seems to be appropriate and its not of that low cardinality for eg. 50 GB data will have nearly 3000 unique segments and the consecutive upserts will just add to that number probably 1000 more for upsert of equivalent data size .
   
   3# I am using MOR write strategy.
   
   4# Below are my cluster configuration:
   1* r5.2xlarge master node and 100* r5.4xlarge core nodes
   
   5# spark submit command:
   
   `spark-submit --master yarn --deploy-mode client --num-executors 100 --driver-memory 12G --executor-memory 48G \
        --conf spark.yarn.executor.memoryOverhead=8192 \
   	 --conf spark.executor.extraJavaOptions="-XX:+UseG1GC" \
   	 --conf spark.shuffle.io.numConnectionsPerPeer=3 \
   	 --conf spark.shuffle.file.buffer=512k \
   	 --conf spark.memory.fraction=0.7 \
   	 --conf spark.memory.storageFraction=0.5 \
   	 --conf spark.kryo.unsafe=true \
   	 --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
   	 --conf spark.hadoop.fs.s3a.connection.maximum=2000 \
   	 --conf spark.hadoop.fs.s3a.fast.upload=true \
   	 --conf spark.hadoop.fs.s3a.connection.establish.timeout=500 \
   	 --conf spark.hadoop.fs.s3a.connection.timeout=5000 \
   	 --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 \
   	 --conf spark.hadoop.com.amazonaws.services.s3.enableV4=true \
   	 --conf spark.hadoop.com.amazonaws.services.s3.enforceV4=true \
   	 --conf spark.yarn.nodemanager.pmem-check-enabled=true \
        --conf spark.yarn.nodemanager.vmem-check-enabled=true \
   	 --conf spark.driver.cores=4 \
   	 --conf spark.executor.cores=3 \
   	 --conf spark.yarn.driver.memoryOverhead=4096 \
        --conf spark.yarn.max.executor.failures=100 \
   	 --conf spark.task.cpus=1 \
   	 --conf spark.rdd.compress=true \
        --conf spark.yarn.maxAppAttempts=3 \
        --conf spark.segment.etl.numexecutors=100 \
        --conf spark.network.timeout=800 \
        --conf spark.shuffle.service.enabled=true \
        --conf spark.sql.hive.convertMetastoreParquet=false \
        --conf spark.task.maxFailures=4 \
        --conf spark.shuffle.minNumPartitionsToHighlyCompress=32 \
        --conf spark.segment.processor.partition.count=1536 \
        --conf spark.segment.processor.output-shard.count=60 \
        --conf spark.segment.processor.binseg.partition.threshold.bytes=500000000000 \
        --conf spark.driver.maxResultSize=2g \
        --conf spark.hadoop.fs.s3.maxRetries=2 \
        --conf spark.kryoserializer.buffer.max=512m \
        --conf spark.kryo.registrationRequired=false \
        --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
        --conf spark.sql.shuffle.partitions=1536 \
       --class  <class-name> \
        --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar  \
        <jar-file-name>.jar`
   
   
   
   6# Below are the benchmarking metrics: 
        BulkInsert MoR (54 GB data) : 1 hr
        Upsert MoR (44 GB data) : 1.6 hr
   
   7# Below are the Hudi Config:
   BulkInsert: 
   `Df.write
         .format("hudi")
         .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY,
           DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
         .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, keyGenClass)
         .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, key)
         .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, partitionKey)
         .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, combineKey)
         .option(HoodieWriteConfig.TABLE_NAME, tableName)
         .option(HoodieIndexConfig.INDEX_TYPE_PROP, HoodieIndex.IndexType.SIMPLE.toString)
         .option(HoodieIndexConfig.SIMPLE_INDEX_PARALLELISM_PROP, 100)
         .option(HoodieIndexConfig.SIMPLE_INDEX_INPUT_STORAGE_LEVEL, "DISK_ONLY")
         .option(HoodieWriteConfig.WRITE_STATUS_STORAGE_LEVEL, "DISK_ONLY")
         .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
          .option(HoodieWriteConfig.UPSERT_PARALLELISM, 2000)
         .option(HoodieWriteConfig.COMBINE_BEFORE_UPSERT_PROP, "false")
         .option(HoodieStorageConfig.LOGFILE_SIZE_MAX_BYTES, 256 * 1024 * 1024)
         .option(HoodieStorageConfig.LOGFILE_TO_PARQUET_COMPRESSION_RATIO, 0.35)
         .option(HoodieCompactionConfig.COPY_ON_WRITE_TABLE_RECORD_SIZE_ESTIMATE, 1024)
          .option(HoodieCompactionConfig.COPY_ON_WRITE_TABLE_AUTO_SPLIT_INSERTS, "false")
         .option(HoodieCompactionConfig.COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE, 200 * 1000)
         .option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES, 0)
         .option(HoodieStorageConfig.PARQUET_FILE_MAX_BYTES, 50 * 1024 * 1024)
         .option(HoodieStorageConfig.PARQUET_BLOCK_SIZE_BYTES, 50 * 2014 * 1024)
         .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "false")
         .mode(SaveMode.Append)
         .save(s"$basePath/$tableName/")`
   
   Upsert:
   `Df.write
         .format("hudi")
         .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY,
           DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
         .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, keyGenClass)
         .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, key)
         .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY,partitionKey)
         .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, combineKey)
         .option(HoodieWriteConfig.TABLE_NAME, tableName)
         .option(HoodieIndexConfig.INDEX_TYPE_PROP, HoodieIndex.IndexType.SIMPLE.toString)
         .option(HoodieIndexConfig.SIMPLE_INDEX_PARALLELISM_PROP, 100)
         .option(HoodieIndexConfig.SIMPLE_INDEX_INPUT_STORAGE_LEVEL, "DISK_ONLY")
         .option(HoodieWriteConfig.WRITE_STATUS_STORAGE_LEVEL, "DISK_ONLY")
         .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL)
          .option(HoodieWriteConfig.BULKINSERT_PARALLELISM, 2000)
          .option(HoodieWriteConfig.COMBINE_BEFORE_INSERT_PROP, false)
         .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, false)
         .mode(SaveMode.Overwrite)
         .save(s"$basePath/$tableName/")`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-937436336


   also, can you post your spark stages UI so that we can see some metrics wrt data skewness. and how much parallelism we are hitting. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Ambarish-Giri commented on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
Ambarish-Giri commented on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-918144599


   Hi @nsivabalan,
   
   I have tried changing the index type to Simple Index as well and below are my upsert and bulk-insert configurations respectively:
   Upsert
   ------
   
   userSegDf.write
         .format("hudi")
         .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL)
          .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, keyGenClass)
         .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, key)
         .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, partitionKey)
         .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, combineKey)
         .option(HoodieWriteConfig.TABLE_NAME, tableName)
         .option(HoodieIndexConfig.INDEX_TYPE_PROP,HoodieIndex.IndexType.SIMPLE.toString())
         .option(HoodieIndexConfig.SIMPLE_INDEX_PARALLELISM_PROP,200)
         .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
         .option(DataSourceWriteOptions.ENABLE_ROW_WRITER_OPT_KEY, true)
         .option(HoodieWriteConfig.UPSERT_PARALLELISM, customNumPartitions)
         .option(HoodieWriteConfig.COMBINE_BEFORE_UPSERT_PROP, false)
         .option(HoodieWriteConfig.WRITE_BUFFER_LIMIT_BYTES, 41943040)
         .option(HoodieCompactionConfig.COPY_ON_WRITE_TABLE_RECORD_SIZE_ESTIMATE, 100)
         .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, true)
         .mode(SaveMode.Append)
         .save(s"$basePath/$tableName/")
   
   Bulk-Insert :
   ------------
   userSegDf.write
         .format("hudi")
         .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_TABLE_TYPE_OPT_VAL)
         .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, keyGenClass)
         .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, key)
         .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, partitionKey)
         .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, combineKey)
         .option(HoodieWriteConfig.TABLE_NAME, tableName)
         .option(HoodieIndexConfig.INDEX_TYPE_PROP,HoodieIndex.IndexType.SIMPLE.toString())
         .option(HoodieIndexConfig.SIMPLE_INDEX_PARALLELISM_PROP,200)
         .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL)
         .option(DataSourceWriteOptions.ENABLE_ROW_WRITER_OPT_KEY, true)
         .option(HoodieWriteConfig.COMBINE_BEFORE_INSERT_PROP, false)
         .option(HoodieWriteConfig.WRITE_BUFFER_LIMIT_BYTES, 41943040)
         .option(HoodieCompactionConfig.COPY_ON_WRITE_TABLE_RECORD_SIZE_ESTIMATE, 100)
         .option(HoodieWriteConfig.BULKINSERT_SORT_MODE, BulkInsertSortMode.NONE.toString())
          .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, true)
         .mode(SaveMode.Overwrite)
         .save(s"$basePath/$tableName/")
   
   Using simple Index helped a bit but now the below stage is running for more than 2 hrs, though it is progressing but very slowly :
   
   https://github.com/apache/hudi/blob/3e71c915271d77c7306ca0325b212f71ce723fc0/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java#L154
   
   Let me know in case any more details are required.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-937435741


   sorry, whats the shuffle parallelism you are setting for these writes? In your original description, I see you are setting it to 2. definitely that would give you bad perf. Try to give something like in the range of 100 to 1000 depending on your data size and see how it pans out. 
   We have diff configs for diff operation. So, ensure you set the right config. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-924544144


   got it, would you mind sharing the screenshots of spark stages. we will get an idea of where the time is spent more. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Ambarish-Giri edited a comment on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
Ambarish-Giri edited a comment on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-917361121


   Hi @danny0405 can you explain a bit more on "if the BloomFilter got false positive"?  
   In my case the record key is concat(uuid4,segmentId). SegmentId is an integer value i.e. it can be same for multiple records and uuid4 is standard unique random value ( note: "-" are being removed from the uuid4 values though), but a combination of both identifies a record uniquely and partition key is again segmentId  as it has low cardinality.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Ambarish-Giri commented on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
Ambarish-Giri commented on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-915765776


   Hi @nsivabalan,  I was looking @ some assistance on this.
   I have followed all the optimization provided in https://cwiki.apache.org/confluence/plugins/servlet/mobile?contentId=115510763#content/view/115510763  but then too .....Hudi insert of 53 GB gzip file in a fairly large EMR : 
   Cluster Size:1 Master (m5.xlarge) node , 2* (r5a.24xlarge) core nodes and 6 * (r5a.24xlarge) is taking almost 2 hrs. 
   
   I have given the all the details above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Ambarish-Giri edited a comment on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
Ambarish-Giri edited a comment on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-917646992


   Hi @nsivabalan ,
   
   Sure will try bulk-insert once and update. Also regarding "right value for avg record size config"  its specific to Copy On Write hoodie.copyonwrite.record.size.estimate. For Merge on Read there is no such config?
   
   1# Upserts can be spread across partitions or can be specific as well as per the data received for that day, and it can have just appends as well.
   2# No the records key doesn't have any timestamp affinity, as mentioned the record key is concat(segmentId,uuid4). SegmentId is an integer value i.e. it can be same for multiple records and uuid4 is standard unique random value ( note: "-" are being removed from the uuid4 values though), but a combination of both identifies a record uniquely and partition key is again segmentId as it has low cardinality 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Ambarish-Giri commented on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
Ambarish-Giri commented on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-926578764


   Below are the Spark stages:
   BulkInsert (MoR):
   
   ![image](https://user-images.githubusercontent.com/85560823/134672503-66e0ea24-44d5-4103-aa18-4a08c2f25996.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Ambarish-Giri commented on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
Ambarish-Giri commented on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-916252147


   Hi  @danny0405 as mentioned my use case is purely batch....does Flink Hudi is for streaming or batch?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-937477763


   I went over your latest messages. 
   guess you interchanged upsert and bulk_insert commands while posting above. nvm. 
   
   Let me comment on each command. 
   1. I see that we have added lot of custom options w/ spark submit. when I have done benchmarking, 100Gb could get bulk_inserted in 1 to 2 mins for simple record keys and partition path. So, definitely something strange going on. 
   Can we try to remove all custom options and try simple command. Does your executor have 48G memory? just confirming? 
   
   I have tried to trim few configs. But lets try to keep some minimal so that once we get a good perf run, we can add back these configs and see which one is causing the spike in perf. 
   
   ```
   spark-submit --master yarn --deploy-mode client --num-executors 100 --driver-memory 12G --executor-memory 48G \ --conf spark.yarn.executor.memoryOverhead=8192 \ --conf spark.executor.extraJavaOptions="-XX:+UseG1GC" \ --conf spark.shuffle.io.numConnectionsPerPeer=3 \ --conf spark.shuffle.file.buffer=512k \ --conf spark.memory.fraction=0.7 \ --conf spark.memory.storageFraction=0.5 \ --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \ --conf spark.hadoop.fs.s3a.connection.maximum=2000 \ --conf spark.hadoop.fs.s3a.fast.upload=true \ --conf spark.hadoop.fs.s3a.connection.establish.timeout=500 \ --conf spark.hadoop.fs.s3a.connection.timeout=5000 \ --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 \ --conf spark.hadoop.com.amazonaws.services.s3.enableV4=true \ --conf spark.hadoop.com.amazonaws.services.s3.enforceV4=true \ --conf spark.driver.cores=4 \ --conf spark.executor.cores=3 \ --conf spark.yarn.driver.memoryOverhead=8192 \ --conf spark.yarn.
 max.executor.failures=100  \ --conf spark.rdd.compress=true \ --conf spark.yarn.maxAppAttempts=3 \ --conf spark.network.timeout=800 \ --conf spark.shuffle.service.enabled=true \ --conf spark.task.maxFailures=4 \ --conf spark.driver.maxResultSize=2g \ --conf spark.hadoop.fs.s3.maxRetries=2 \ --conf spark.kryoserializer.buffer.max=1024m \ --conf spark.kryo.registrationRequired=false \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.sql.shuffle.partitions=1536 \ --class <class-name> \ --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar \ <jar-file-name>.jar
   ```
   For eg: when I did bulk_insert benchmarking, I used the below w/ spark-shell
   ```
   ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.0.1 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.kryoserializer.buffer.max=1024m' --driver-memory 8g --executor-memory 9g   --master yarn --deploy-mode client  --num-executors 15 --executor-cores 8  --conf spark.rdd.compress=true       --conf spark.driver.userClassPathFirst=true     --conf spark.executor.userClassPathFirst=true        --conf spark.ui.proxyBase=""    --conf "spark.memory.storageFraction=0.8"  --conf "spark.driver.extraClassPath=-XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70"     --conf "spark.executor.extraClassPath=-XX:NewSize=1g -XX:SurvivorRatio=2 -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:CMSInitiatingOccupancyFraction=70" --conf 'spark.executor.memoryOverhead=2000m'
   ```
   Nothing fancy, just set the appropriate memory, cores and some GC tuning configs and things worked for me. 
   
   - bulk_insert configs.
   lets increase the index parallelism to 1000. let remove the storage level configs. I mean, lets try to get some baseline and then iteratively we can add back more configs. I see you are setting parquet max file size in your upsert command. probably we need to set them here too. 
   
   - upsert configs.
   again lets set index parallelism to 1000 and remove storage level configs. 
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-937435741






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Ambarish-Giri edited a comment on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
Ambarish-Giri edited a comment on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-916252147


   Hi  @danny0405 as mentioned my use case is purely batch....does Flink Hudi is for streaming or batch?
   Moreover my core application is on Spark hence wanted to go with Spark only .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-924546245


   If your cardinality for partition is low, we can try to partition using a diff field which could have high cardinality. We can leverage more parallel processing depending on the no of partitions. Within each partition, we can't do much of parallel processing and so we are limited. I mean, hudi does assign one file group to each executor, but I am talking about indexing. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #3605: [SUPPORT]Hudi Inserts and Upserts for MoR and CoW tables are taking very long time.

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #3605:
URL: https://github.com/apache/hudi/issues/3605#issuecomment-924545165


   btw, an orthogonal point. 
   I see your record key is {segmentId,uuid} and partition path is segmentId. Not sure if you need to prefix segmentId to your record keys, if you are solely using it to uniquely identify unique records and apply updates within hudi. If there is no external facing requirement for record keys to be a pair of {segmentId,uuid}, you can just have uuid. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org