You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/06/01 11:05:36 UTC

[GitHub] [hudi] Raghvendradubey opened a new issue #1694: Slow Write into Hudi Dataset(MOR)

Raghvendradubey opened a new issue #1694:
URL: https://github.com/apache/hudi/issues/1694


   Hi Team,
   
   I am reading data from Kafka and ingesting  data into Hudi Dataset(MOR) using Hudi DataSource Api through Spark Structured Streaming.
   Pipeline Structure as like - 
   
   Kafka(Source) > Spark Structured Streaming(EMR) > MOR Hudi table(S3)
   
   Spark - 2.4.5
   Hudi - 0.5.2
   
   I am getting performance issues while writing data into Hudi Dataset. 
   following Hudi Jobs are taking time
   countByKey at HoodieBloomIndex.java
   countByKey at WorkloadProfile.java
   count at HoodieSparkSqlWriter.scala
   
   Configuration used to write hudi data set as followed 
   new_df.write.format("org.apache.hudi").option("hoodie.table.name", tableName) \
       .option("hoodie.datasource.write.operation", "upsert") \
       .option("hoodie.datasource.write.table.type", "MERGE_ON_READ") \
       .option("hoodie.datasource.write.recordkey.field", "wbn") \
       .option("hoodie.datasource.write.partitionpath.field", "ad") \
       .option("hoodie.datasource.write.precombine.field", "action_date") \
       .option("hoodie.compact.inline", "true") \
       .option("hoodie.compact.inline.max.delta.commits", "300") \
       .option("hoodie.datasource.hive_sync.enable", "true") \
       .option("hoodie.upsert.shuffle.parallelism", "5") \
       .option("hoodie.insert.shuffle.parallelism", "5") \
       .option("hoodie.bulkinsert.shuffle.parallelism", "5") \
       .option("hoodie.datasource.hive_sync.table", tableName) \
       .option("hoodie.datasource.hive_sync.partition_fields", "ad") \
       .option("hoodie.index.type","GLOBAL_BLOOM") \
       .option("hoodie.bloom.index.update.partition.path", "true") \
       .option("hoodie.datasource.hive_sync.assume_date_partitioning", "false") \
       .option("hoodie.datasource.hive_sync.partition_extractor_class",
               "org.apache.hudi.hive.MultiPartKeysValueExtractor") \
       .mode("append").save(tablePath)
   
   Spark Submit command - 
   spark-submit --deploy-mode client --master yarn
   --executor-memory 6g --executor-cores 1
    --driver-memory 4g 
   --conf spark.driver.maxResultSize=2g
    --conf spark.executor.id=driver
    --conf spark.executor.instances=300
    --conf spark.kryoserializer.buffer.max=512m
    --conf spark.shuffle.service.enabled=true
    --conf spark.sql.hive.convertMetastoreParquet=false
    --conf spark.task.cpus=1
    --conf spark.yarn.driver.memoryOverhead=1024
    --conf spark.yarn.executor.memoryOverhead=3072
    --conf spark.yarn.max.executor.failures=100
     --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar
    --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4
    --py-files s3://spark-test/hudi_job.py
   Attaching screen shot for the job details.
   ![hudi-job](https://user-images.githubusercontent.com/16387812/83402829-1cabfc80-a425-11ea-868b-e7c66204ac1e.png)
   
   countByKey at HoodieBloomIndex.java
   ![countbykey-hoodiebloomindx](https://user-images.githubusercontent.com/16387812/83402914-4cf39b00-a425-11ea-8bf3-6a21643d1480.png)
   ![countbykeyhoodiebloomindextask](https://user-images.githubusercontent.com/16387812/83402931-554bd600-a425-11ea-8047-183be072346c.png)
   
   countByKey at WorkloadProfile.java
   ![workloadprofile](https://user-images.githubusercontent.com/16387812/83402997-71e80e00-a425-11ea-9d2d-52e8765b20fc.png)
   ![workloadprofiletask](https://user-images.githubusercontent.com/16387812/83403022-7f9d9380-a425-11ea-9175-f0e43763f4a9.png)
   
   count at HoodieSparkSqlWriter.scala
   ![hoodiesparksqlwriter](https://user-images.githubusercontent.com/16387812/83403066-9348fa00-a425-11ea-9643-26cda91e854f.png)
   ![sparksqlwritertask](https://user-images.githubusercontent.com/16387812/83403078-993edb00-a425-11ea-9a20-1268dd92f813.png)
   
   Please suggest how I can tune this.
   
   Thanks
   Raghvendra


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #1694: Slow Write into Hudi Dataset(MOR)

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1694:
URL: https://github.com/apache/hudi/issues/1694#issuecomment-655205689


   Happy to work more hands-on and get this working for you. lmk


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #1694: Slow Write into Hudi Dataset(MOR)

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1694:
URL: https://github.com/apache/hudi/issues/1694#issuecomment-638519702


   Beyond the initial shuffle, hudi will auto tune everything so I am not surprised. 
   
   On countByKey at HoodieBloomindex, what’s the line number?
   
   count at HoodieSparkSqlWriter, is actual writing of data. We send 100K records to the same insert partition to write larger file sizes. Can you see if there’s a skew in that stage? It’s tunable 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #1694: Slow Write into Hudi Dataset(MOR)

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1694:
URL: https://github.com/apache/hudi/issues/1694#issuecomment-647138192


   >2 - Yes Indexing is dominating, not sure why exactly it is, but it is after setting parameter hoodie.parquet.small.file.limit = 0
   
   If you turn off small file handling, you end up writing more files, which means - indexing has to compare ranges/bloom filter across many more files.. This is the same reason why you should consider not doing this for query side as well.. small files will hurt query performance a lot as well.. 
   
   Let's do a reset here and try to design for your use-case? happy to work through this if you can share more aobut your goals here.. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #1694: Slow Write into Hudi Dataset(MOR)

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1694:
URL: https://github.com/apache/hudi/issues/1694#issuecomment-639198638


   Does that one partition seem to be inserts? Hudi reduces insert parallelism to ensure larger file sizes.. see http://hudi.apache.org/docs/configurations.html#insertSplitSize May be try lowering this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #1694: Slow Write into Hudi Dataset(MOR)

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1694:
URL: https://github.com/apache/hudi/issues/1694#issuecomment-637256106


   Is there a reason why you are setting the shuffle parallelism to 5? When it seems like you have more executors? 
   
   We can go step by step . Happy to work with you thru the tuning process.  Can you please summarize your workload - records per partition, upsets vs insert ratio, ordered vs random keys.
   
   Below are some useful resources
   
   https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide
   https://cwiki.apache.org/confluence/display/HUDI/FAQ
   https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoImodelthedatastoredinHudi


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Raghvendradubey edited a comment on issue #1694: Slow Write into Hudi Dataset(MOR)

Posted by GitBox <gi...@apache.org>.
Raghvendradubey edited a comment on issue #1694:
URL: https://github.com/apache/hudi/issues/1694#issuecomment-637414299


   Hello Vinoth,
   
   I was just playing with different combination of shuffle parallelism, I am able to reduce countByKey at WorkloadProfile.java through shuffle parallelism by setting upto 20 or so but there is no impact on countByKey at HoodieBloomIndex.java and count at HoodieSparkSqlWriter.scala
   Data Stats are as followed -
   1 - more than 500 keys/record
   2 - 7k to 10k records/ partition
   3 - upsets vs insert ratio around 70:30 but this can vary in most cases, it's not fixed  
   4 -  Keys are not ordered/partition, I have oredered the keys while inserting into Hudi Dataset thorugh spark structured streaming.
   
   
   
    
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Raghvendradubey commented on issue #1694: Slow Write into Hudi Dataset(MOR)

Posted by GitBox <gi...@apache.org>.
Raghvendradubey commented on issue #1694:
URL: https://github.com/apache/hudi/issues/1694#issuecomment-640738415






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #1694: Slow Write into Hudi Dataset(MOR)

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1694:
URL: https://github.com/apache/hudi/issues/1694#issuecomment-657091983


   If you want partition movement, then global index is the only option.. 
   
   >no certain ordering, I can order it only by timestamp.
   GLOBAL_BLOOM (or even BLOOM index) will work best if the files are sorted by a key .. so it can skip entire file ranges from being compared and then further reduce using bloom filters.. 
   
   >Any suggestion If I can achieve this without global bloom?
   We are working on record level indexes that should make it much faster in the mid term. But thats not an immediate option.. `master` branch has a `GLOBAL_SIMPLE` which can be faster than `GLOBAL_BLOOM` in cases where there is no specific range based pruning that can occur.. Give that a shot? 
   
   another optimization in `master` branch is : dynamic bloom filters that will auto tune itself for aspecific false positive rate. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Raghvendradubey commented on issue #1694: Slow Write into Hudi Dataset(MOR)

Posted by GitBox <gi...@apache.org>.
Raghvendradubey commented on issue #1694:
URL: https://github.com/apache/hudi/issues/1694#issuecomment-637414299


   Hello Vinoth,
   
   I was just playing with different combination of shuffle parallelism, I am able to reduce countByKey at WorkloadProfile.java through shuffle partition by setting upto 20 or so but there is no impact on countByKey at HoodieBloomIndex.java and count at HoodieSparkSqlWriter.scala
   Data Stats are as followed -
   1 - more than 500 keys/record
   2 - 7k to 10k records/ partition
   3 - upsets vs insert ratio around 70:30 but this can vary in most cases, it's not fixed  
   4 -  Keys are not ordered/partition, I have oredered the keys while inserting into Hudi Dataset thorugh spark structured streaming.
   
   
   
    
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #1694: Slow Write into Hudi Dataset(MOR)

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1694:
URL: https://github.com/apache/hudi/issues/1694#issuecomment-639199583


   https://github.com/apache/hudi/blob/release-0.5.2/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java 
   
   https://github.com/apache/hudi/blob/41202da7788193da77f1ae4b784127bb93eaae2c/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java#L977
   
   Altria is doing is reading the input and shuffling for de-duplication. So not sure


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar edited a comment on issue #1694: Slow Write into Hudi Dataset(MOR)

Posted by GitBox <gi...@apache.org>.
vinothchandar edited a comment on issue #1694:
URL: https://github.com/apache/hudi/issues/1694#issuecomment-657091983


   If you want partition movement, then global index is the only option.. 
   
   >no certain ordering, I can order it only by timestamp.
   
   GLOBAL_BLOOM (or even BLOOM index) will work best if the files are sorted by a key .. so it can skip entire file ranges from being compared and then further reduce using bloom filters.. 
   
   >Any suggestion If I can achieve this without global bloom?
   
   We are working on record level indexes that should make it much faster in the mid term. But thats not an immediate option.. `master` branch has a `GLOBAL_SIMPLE` which can be faster than `GLOBAL_BLOOM` in cases where there is no specific range based pruning that can occur.. Give that a shot? 
   
   another optimization in `master` branch is : dynamic bloom filters that will auto tune itself for aspecific false positive rate. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Raghvendradubey commented on issue #1694: Slow Write into Hudi Dataset(MOR)

Posted by GitBox <gi...@apache.org>.
Raghvendradubey commented on issue #1694:
URL: https://github.com/apache/hudi/issues/1694#issuecomment-658571381


   Thanks @vinothchandar for clarifications, will try GLOBAL_SIMPLE.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Raghvendradubey commented on issue #1694: Slow Write into Hudi Dataset(MOR)

Posted by GitBox <gi...@apache.org>.
Raghvendradubey commented on issue #1694:
URL: https://github.com/apache/hudi/issues/1694#issuecomment-655305119


   
   
   
   
   
      
                                                                                                                                                                                                              
   Thanks @vinothchandar 
   
   comments are inline
     
   is there a way to not use global index?
   I need partition movement of records, means need latest updated record in latest partition, and probably I can achieve this only by global bloom.
   
   can you always determine ad for each record.?
   Yes I can, but how it would be helpful for me, by not using global index? what about records which are in older partitions.
   
   is there certain ordering to wbn - 
   no certain ordering, I can order it only by timestamp.
   
   Yes I tried without global bloom, upsert is very fast, but business requirement  is like I need latest updated record in latest partition. 
   Any suggestion  If I can achieve this without global bloom?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #1694: Slow Write into Hudi Dataset(MOR)

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1694:
URL: https://github.com/apache/hudi/issues/1694#issuecomment-644113731


   Sorry.. slipped off my radar./ 
   
   > After setting hoodie.parquet.small.file.limit=0 
   This suggests that a lot of records were getting packed into existing files before.. I think you'd want to consider having the old behavior for real production use-cases.. we absorb that cost on the writer side, to queries will have well sized files for reading..
   
   >countByKey at WorkloadProfile.java started getting more time.
   In the screenshot, all I see is indexing dominating the cost.. ?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Raghvendradubey edited a comment on issue #1694: Slow Write into Hudi Dataset(MOR)

Posted by GitBox <gi...@apache.org>.
Raghvendradubey edited a comment on issue #1694:
URL: https://github.com/apache/hudi/issues/1694#issuecomment-649925956


   @vinothchandar I run the job with 5 min batch interval using MOR, now I can see commit duration are 5 min and compaction is also 5 min, and updated records are only 10% of total records written but now job is running with huge lag.
   sample commit are as below - 
   
   ```
   ═══════════╗
   ║ CommitTime     │ Total Bytes Written │ Total Files Added │ Total Files Updated │ Total Partitions Written │ Total Records Written │ Total Update Records Written │ Total Errors ║
   ╠════════════════╪═════════════════════╪═══════════════════╪═════════════════════╪══════════════════════════╪═══════════════════════╪══════════════════════════════╪══════════════╣
   ║ 20200625112117 │ 178.0 MB            │ 1                 │ 3                   │ 2                        │ 193777                │ 18939                        │ 0            ║
   ╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢
   ║ 20200625111810 │ 104.0 MB            │ 0                 │ 1                   │ 1                        │ 149946                │ 12619                        │ 0            ║
   ╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢
   ║ 20200625111610 │ 211.7 MB            │ 0                 │ 3                   │ 2                        │ 259500                │ 14721                        │ 0            ║
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Raghvendradubey commented on issue #1694: Slow Write into Hudi Dataset(MOR)

Posted by GitBox <gi...@apache.org>.
Raghvendradubey commented on issue #1694:
URL: https://github.com/apache/hudi/issues/1694#issuecomment-653368873


   @vinothchandar Excellent, how can I try this async compaction?
   I am attaching most expensive stages, I am not sure that Do I need to scale cluster or I can lower this by some config changes.
   Please suggest
   ![HudiSparkSQLWriterNew](https://user-images.githubusercontent.com/16387812/86436998-2759fa00-bd21-11ea-8c82-5d57734aa6ef.png)
   ![WorkloadProfileStageNew](https://user-images.githubusercontent.com/16387812/86437004-2aed8100-bd21-11ea-9452-ec6b0d9d1c05.png)
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Raghvendradubey commented on issue #1694: Slow Write into Hudi Dataset(MOR)

Posted by GitBox <gi...@apache.org>.
Raghvendradubey commented on issue #1694:
URL: https://github.com/apache/hudi/issues/1694#issuecomment-638773847


   On job countByKey at HoodieBloomindex, stage mapToPair at HoodieWriteCLient.java:977 is taking longer time more than a minute, and stage  countByKey at HoodieBloomindex is executed within seconds.
   yes there is skew in count at HoodieSparkSqlWriter, all partitions are getting 200 to 500KB data and one partition is getting 100mb+ data.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #1694: Slow Write into Hudi Dataset(MOR)

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1694:
URL: https://github.com/apache/hudi/issues/1694#issuecomment-647799639


   @Raghvendradubey thanks for the info.. you may also want to understand how much of the existing data changes every minute.. if its 70% updates, I would suggest using MOR as it can absorb updates more quickly..  
   
   Let's target a lenient 5 min Spark streaming batch interval, see how the commit durations look like and go from there?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Raghvendradubey commented on issue #1694: Slow Write into Hudi Dataset(MOR)

Posted by GitBox <gi...@apache.org>.
Raghvendradubey commented on issue #1694:
URL: https://github.com/apache/hudi/issues/1694#issuecomment-646596361


   Hey vinoth,
   
   1 - Could you please some shed of light on statement "old behavior for real production use-cases"?
   2 - Yes Indexing is dominating, not sure why exactly it is, but it is after setting parameter hoodie.parquet.small.file.limit = 0
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #1694: Slow Write into Hudi Dataset(MOR)

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1694:
URL: https://github.com/apache/hudi/issues/1694#issuecomment-653259982


   @Raghvendradubey  we are getting async compaction enabled for spark structured streaming sink that should bring this down by half, since compaction wont block writing.. 
   
   Can you paste the latest spark UI which shows the most expensive stage? We can work on tuning this down. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #1694: Slow Write into Hudi Dataset(MOR)

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1694:
URL: https://github.com/apache/hudi/issues/1694#issuecomment-710520474


   So to clarify, GLOBAL_SIMPLE helps when the workload is random writes and affecting every file for e.g in each write. But it is indeed slow in the sense, it ll join against the entire dataset. If you want a better option you can try hbase index. Ofc there is extra operational overhead, but it ll fix perf and also the global indexing requirement


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #1694: Slow Write into Hudi Dataset(MOR)

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1694:
URL: https://github.com/apache/hudi/issues/1694#issuecomment-655205591


   #1752 is the PR. 
   
   What I am seeing is that the range based pruning is not very effective.. and is resulting in lots of shuffled data..  
   
   is there a way to not use global index? i.e can you always determine `ad` for each record.?   `.option("hoodie.datasource.write.recordkey.field", "wbn")`, is there certain ordering to `wbn` that we can exploit.. I am referring some stuff put together here.. https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-Whatperformance/ingestlatencycanIexpectforHudiwriting
   
   In general, we need to make the upsert process not be dependent on the size of the table, but rather on size of input.. 
   
   If you are open to trying, you can switch to simple index on master, which. will be lot lighter in this particular scenario, where there does not seem to be any benefits for range/bloom information.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Raghvendradubey commented on issue #1694: Slow Write into Hudi Dataset(MOR)

Posted by GitBox <gi...@apache.org>.
Raghvendradubey commented on issue #1694:
URL: https://github.com/apache/hudi/issues/1694#issuecomment-649925956


   @vinothchandar I run the job with 5 min batch interval using MOR, now I can see commit duration are 5 min and compaction is also 5 min, and updated records are only 10% of total records written but now job is running with huge lag.
   sample commit are as below - 
   ═══════════╗
   ║ CommitTime     │ Total Bytes Written │ Total Files Added │ Total Files Updated │ Total Partitions Written │ Total Records Written │ Total Update Records Written │ Total Errors ║
   ╠════════════════╪═════════════════════╪═══════════════════╪═════════════════════╪══════════════════════════╪═══════════════════════╪══════════════════════════════╪══════════════╣
   ║ 20200625112117 │ 178.0 MB            │ 1                 │ 3                   │ 2                        │ 193777                │ 18939                        │ 0            ║
   ╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢
   ║ 20200625111810 │ 104.0 MB            │ 0                 │ 1                   │ 1                        │ 149946                │ 12619                        │ 0            ║
   ╟────────────────┼─────────────────────┼───────────────────┼─────────────────────┼──────────────────────────┼───────────────────────┼──────────────────────────────┼──────────────╢
   ║ 20200625111610 │ 211.7 MB            │ 0                 │ 3                   │ 2                        │ 259500                │ 14721                        │ 0            ║
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] Raghvendradubey commented on issue #1694: Slow Write into Hudi Dataset(MOR)

Posted by GitBox <gi...@apache.org>.
Raghvendradubey commented on issue #1694:
URL: https://github.com/apache/hudi/issues/1694#issuecomment-647325340


   sure.
   I am trying to achieve near real time data( like Read Optimized View) by updating records over S3.
   eg - 
   let's say I have records 
   a1 b1 t1
   a1, b2, t2
   a1, b3, t3
   t1,t2,t3 incremental timestamps
   so finally I want a1, b3, t3 record.
   
   Data Pipeline - 
   Reading data from Kafka through Spark Structured Streaming and performing upsert into Hudi table over s3
   
   Data Read from Kafka - 
   Size - 100-300MB/minute
   Kafka Parallel Partitions - 15
   Upsert:Insert - 7:3
   No. Of Columns - 550 
   
   Please let me know if needs more info.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] rafaelhbarros commented on issue #1694: Slow Write into Hudi Dataset(MOR)

Posted by GitBox <gi...@apache.org>.
rafaelhbarros commented on issue #1694:
URL: https://github.com/apache/hudi/issues/1694#issuecomment-692911383


   @Raghvendradubey did the `GLOBAL_SIMPLE` index solve your issue?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org