You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "soumilshah1995 (via GitHub)" <gi...@apache.org> on 2023/03/13 13:53:12 UTC

[GitHub] [hudi] soumilshah1995 opened a new issue, #8166: [SUPPORT] Hudi Bucket Index

soumilshah1995 opened a new issue, #8166:
URL: https://github.com/apache/hudi/issues/8166

   **Subject :** Question on Hudi bucket index
   
   * Bucket indexes are suitable for upsert use cases on huge datasets with a large number of file groups within partitions, relatively even data distribution across partitions, and can achieve relatively even data distribution on the bucket hash field column. It can have better upsert performance in these cases due to no index lookup involved as file groups are located based on a [hashing mechanism](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index#RFC29:HashIndex-Howhashindexworks), which is very fast. This is totally different from both simple and Bloom indexes, where an explicit index lookup step is involved during write. The buckets here has one-one mapping with the hudi file group and since the total number of buckets (defined by hoodie.bucket.index.num.buckets(default – 4)) is fixed here, it can potentially lead to skewed data (data distributed unevenly across buckets) and scalability (buckets can grow over time) issues over time. These issues will be a
 ddressed in the upcoming [consistent hashing bucket index](https://issues.apache.org/jira/browse/HUDI-3000), which is going to be a special type of bucket index.
   
   
   ##### Questions:
   
   *  Are setting mentioned below are right way to implement Bucket Index ?
   ```
     ,"hoodie.index.type":"BUCKET"
       ,"hoodie.index.bucket.engine" : 'SIMPLE'
       ,'hoodie.storage.layout.partitioner.class':'org.apache.hudi.table.action.commit.SparkBucketIndexPartitioner'
       ,'hoodie.bucket.index.num.buckets':"4"
   ```
   
   *  Assuming answer would be yes should i be expecting to see 4 folder insider which basefiles should be present when selecting this option or i should simply see  number at start of base files 000 0001
   * 
   * How do i specify if i want to perform the HASH on say column "country" and not on record key ?
   
   *  i am attaching some sample code so i can properly understand if i want to do hashing on say country how can i specify columns
   
   
   * is there a way we would elaborate documentation Hudi website on Index and add more information about bucket index and other with some examples ?
   
   * is consistent hashing only for MOR tables ?
   
   ```
   
   try:
   
       import os
       import sys
       import uuid
   
       import pyspark
       from pyspark.sql import SparkSession
       from pyspark import SparkConf, SparkContext
       from pyspark.sql.functions import col, asc, desc
       from pyspark.sql.functions import col, to_timestamp, monotonically_increasing_id, to_date, when
       from pyspark.sql.functions import *
       from pyspark.sql.types import *
       from datetime import datetime
       from functools import reduce
       from faker import Faker
   
   
   except Exception as e:
       pass
   
   
   SUBMIT_ARGS = "--packages org.apache.hudi:hudi-spark3.3-bundle_2.12:0.13.0 pyspark-shell"
   os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
   os.environ['PYSPARK_PYTHON'] = sys.executable
   os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
   
   
       
   spark = SparkSession.builder \
       .config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer') \
       .config('className', 'org.apache.hudi') \
       .config('spark.sql.hive.convertMetastoreParquet', 'false') \
       .getOrCreate()
   
   
   db_name = "hudidb"
   table_name = "hudi_bucket_table"
   
   recordkey = 'uuid'
   path = f"file:///C:/tmp/{db_name}/{table_name}"
   precombine = "date"
   method = 'upsert'
   table_type = "COPY_ON_WRITE"  # COPY_ON_WRITE | MERGE_ON_READ
   PARTITION_FIELD = "country"
   
   hudi_options = {
       'hoodie.table.name': table_name,
       'hoodie.datasource.write.recordkey.field': recordkey,
       'hoodie.datasource.write.table.name': table_name,
       'hoodie.datasource.write.operation': method,
       'hoodie.datasource.write.precombine.field': precombine
       ,"hoodie.upsert.shuffle.parallelism":100
       
       ,"hoodie.index.type":"BUCKET"
       ,"hoodie.index.bucket.engine" : 'SIMPLE'
       ,'hoodie.storage.layout.partitioner.class':'org.apache.hudi.table.action.commit.SparkBucketIndexPartitioner'
       ,'hoodie.bucket.index.num.buckets':"4"
       
   
       ,"hoodie.clean.automatic": "true"
       , "hoodie.clean.async": "true"
       , "hoodie.cleaner.policy": 'KEEP_LATEST_FILE_VERSIONS'
       , "hoodie.cleaner.fileversions.retained": "3"
       , "hoodie-conf hoodie.cleaner.parallelism": '200'
       , 'hoodie.cleaner.commits.retained': 5
       
   }
   
   
   spark_df = spark.createDataFrame(
       data=[
       (1, "insert 1", "2020-01-06 12:12:12", "IN"),
       (2, "insert 2", "2020-01-06 12:12:13", "US"),
       (3, "insert 3", "2020-01-06 12:12:15", "IN"),
       (4, "insert 4", "2020-01-06 12:13:15", "US"),
   ], 
       schema=["uuid", "message",  "date", "country"])
   spark_df.show()
   
   spark_df.write.format("hudi"). \
       options(**hudi_options). \
       mode("append"). \
       save(path)
   
   
   spark_df = spark.createDataFrame(
       data=[
       (1, "update 1", "2020-01-06 12:12:12", "IN"),
       (2, "update 2", "2020-01-06 12:12:13", "US"),
       (5, "insert 5", "2020-01-06 12:13:15", "US"),
   ], 
       schema=["uuid", "message",  "date", "country"])
   spark_df.show()
   
   spark_df.write.format("hudi"). \
       options(**hudi_options). \
       mode("append"). \
       save(path)
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] chenbodeng719 commented on issue #8166: [SUPPORT] Hudi Bucket Index

Posted by "chenbodeng719 (via GitHub)" <gi...@apache.org>.
chenbodeng719 commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1492990599

   @soumilshah1995  How can I use flink to do this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] soumilshah1995 commented on issue #8166: [SUPPORT] Hudi Bucket Index

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.
soumilshah1995 commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1470007806

   
   got it make sense @KnightChess 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] chenbodeng719 commented on issue #8166: [SUPPORT] Hudi Bucket Index

Posted by "chenbodeng719 (via GitHub)" <gi...@apache.org>.
chenbodeng719 commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1492976098

   @KnightChess  Hi, I use below conf to test bulk insert. There comes out only one parquet. Did I miss something? I expect 5 parquets( 5 buckets). My dataset is about 120GB.
   ```
   
           CREATE TABLE hbase2hudi_sink(
               uid STRING PRIMARY KEY NOT ENFORCED,
               oridata STRING,
               update_time TIMESTAMP_LTZ(3)
           ) WITH (
               'table.type' = 'MERGE_ON_READ',
               'connector' = 'hudi',
               'path' = '%s',
               'write.operation' = 'bulk_insert',
               'precombine.field' = 'update_time',
               'write.tasks' = '2',
               'index.type' = 'BUCKET',
               'hoodie.bucket.index.hash.field' = 'uid',
               'hoodie.bucket.index.num.buckets' = '5'
           )
   
   ```
   <img width="835" alt="image" src="https://user-images.githubusercontent.com/104059106/229291867-c6c4f9fa-1183-4adb-838b-c72684868b6f.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] soumilshah1995 commented on issue #8166: [SUPPORT] Hudi Bucket Index

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.
soumilshah1995 commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1493328113

   sorry i dont have much experience in flink 
   tagging @danny0405 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] soumilshah1995 commented on issue #8166: [SUPPORT] Hudi Bucket Index

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.
soumilshah1995 commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1469904815

   @KnightChess KnightChess
   expectation was to see buckets (Folder) which was not created please correct me if i am wrong 
    i was expecting to see folder 1 2 3 and 4 and inside that should be my base files


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] KnightChess commented on issue #8166: [SUPPORT] Hudi Bucket Index

Posted by "KnightChess (via GitHub)" <gi...@apache.org>.
KnightChess commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1469983031

   @soumilshah1995 yes, I got the same error. The desc in code config is  
   <img width="774" alt="image" src="https://user-images.githubusercontent.com/20125927/225317801-427f5c3f-efc4-4ac7-8c83-2b8230600a54.png">
   but in doc, has two
   <img width="963" alt="image" src="https://user-images.githubusercontent.com/20125927/225317952-fe77a0f4-fcc5-4e55-96ab-cd88a48328ed.png">
   <img width="788" alt="image" src="https://user-images.githubusercontent.com/20125927/225318054-15031733-3e7d-4769-8e5d-6969c60f24b3.png">
   look like not support other field, need be the subset record key, sorry, my question
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] soumilshah1995 commented on issue #8166: [SUPPORT] Hudi Bucket Index

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.
soumilshah1995 commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1469888165

   @KnightChess 
   i was  expecting to see 4 folder (BUCKETS) insider which base files should be present when selecting this option i didnt see partition(Folder) 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] KnightChess commented on issue #8166: [SUPPORT] Hudi Bucket Index

Posted by "KnightChess (via GitHub)" <gi...@apache.org>.
KnightChess commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1469912792

   Oh, your hash field is partition col. I think the partition folder is also "IN" "US", and with only one bucket number file inside


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] chenbodeng719 commented on issue #8166: [SUPPORT] Hudi Bucket Index

Posted by "chenbodeng719 (via GitHub)" <gi...@apache.org>.
chenbodeng719 commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1493546851

   > sorry i dont have much experience in flink tagging @danny0405
   
   Anyway, thks for you response.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] KnightChess commented on issue #8166: [SUPPORT] Hudi Bucket Index

Posted by "KnightChess (via GitHub)" <gi...@apache.org>.
KnightChess commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1469900502

   @soumilshah1995 https://hudi.apache.org/docs/quick-start-guide#insert-data in python tab, you need set `hoodie.datasource.write.partitionpath.field` according to the doc


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] soumilshah1995 commented on issue #8166: [SUPPORT] Hudi Bucket Index

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.
soumilshah1995 commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1470013677

   Thanks a lot for your help 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] soumilshah1995 commented on issue #8166: [SUPPORT] Hudi Bucket Index

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.
soumilshah1995 commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1469969427

   ##### you mentioned i can use 'hoodie.bucket.index.hash.field'
   i am getting error  when i specify the hash Feild 
   
   
   ### Code
   ```
   try:
   
       import os
       import sys
       import uuid
   
       import pyspark
       from pyspark.sql import SparkSession
       from pyspark import SparkConf, SparkContext
       from pyspark.sql.functions import col, asc, desc
       from pyspark.sql.functions import col, to_timestamp, monotonically_increasing_id, to_date, when
       from pyspark.sql.functions import *
       from pyspark.sql.types import *
       from datetime import datetime
       from functools import reduce
       from faker import Faker
   
   
   except Exception as e:
       pass
   
   
   SUBMIT_ARGS = "--packages org.apache.hudi:hudi-spark3.3-bundle_2.12:0.13.0 pyspark-shell"
   os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
   os.environ['PYSPARK_PYTHON'] = sys.executable
   os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
   
   spark = SparkSession.builder \
       .config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer') \
       .config('className', 'org.apache.hudi') \
       .config('spark.sql.hive.convertMetastoreParquet', 'false') \
       .getOrCreate()
   
   
   
   db_name = "hudidb"
   table_name = "hudi_bucket_table"
   
   recordkey = 'uuid'
   path = f"file:///C:/tmp/{db_name}/{table_name}"
   precombine = "date"
   method = 'upsert'
   table_type = "COPY_ON_WRITE"  # COPY_ON_WRITE | MERGE_ON_READ
   PARTITION_FIELD = "country"
   
   hudi_options = {
       'hoodie.table.name': table_name,
       'hoodie.datasource.write.recordkey.field': recordkey,
       'hoodie.datasource.write.table.name': table_name,
       'hoodie.datasource.write.operation': method,
       'hoodie.datasource.write.precombine.field': precombine
       ,"hoodie.upsert.shuffle.parallelism":100
       
       ,"hoodie.index.type":"BUCKET"
       ,"hoodie.index.bucket.engine" : 'SIMPLE'
       ,'hoodie.storage.layout.partitioner.class':'org.apache.hudi.table.action.commit.SparkBucketIndexPartitioner'
       ,'hoodie.bucket.index.num.buckets':"4"
   #     ,"hoodie.datasource.write.partitionpath.field":PARTITION_FIELD
       ,"hoodie.bucket.index.hash.field":PARTITION_FIELD
       
       ,"hoodie.clean.automatic": "true"
       , "hoodie.clean.async": "true"
       , "hoodie.cleaner.policy": 'KEEP_LATEST_FILE_VERSIONS'
       , "hoodie.cleaner.fileversions.retained": "3"
       , "hoodie-conf hoodie.cleaner.parallelism": '200'
       , 'hoodie.cleaner.commits.retained': 5
       
   }
   spark_df = spark.createDataFrame(
       data=[
       (1, "insert 1", "2020-01-06 12:12:12", "IN"),
       (2, "insert 2", "2020-01-06 12:12:13", "US"),
       (3, "insert 3", "2020-01-06 12:12:15", "IN"),
       (4, "insert 4", "2020-01-06 12:13:15", "US"),
   ], 
       schema=["uuid", "message",  "date", "country"])
   spark_df.show()
   
   spark_df.write.format("hudi"). \
       options(**hudi_options). \
       mode("append"). \
       save(path)
   ```
   #### Error Message 
   ```
   
   
   Py4JJavaError: An error occurred while calling o106.save.
   : org.apache.hudi.exception.HoodieIndexException: Bucket index key (if configured) must be subset of record key.
   	at org.apache.hudi.config.HoodieIndexConfig$Builder.validateBucketIndexConfig(HoodieIndexConfig.java:692)
   	at org.apache.hudi.config.HoodieIndexConfig$Builder.build(HoodieIndexConfig.java:660)
   	at org.apache.hudi.config.HoodieWriteConfig$Builder.setDefaults(HoodieWriteConfig.java:2869)
   	at org.apache.hudi.config.HoodieWriteConfig$Builder.build(HoodieWriteConfig.java:3004)
   	at org.apache.hudi.config.HoodieWriteConfig$Builder.build(HoodieWriteConfig.java:2999)
   	at org.apache.hudi.DataSourceUtils.createHoodieConfig(DataSourceUtils.java:188)
   	at org.apache.hudi.DataSourceUtils.createHoodieClient(DataSourceUtils.java:193)
   	at org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$write$14(HoodieSparkSqlWriter.scala:337)
   	at scala.Option.getOrElse(Option.scala:189)
   	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:334)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:150)
   	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
   	at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
   	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
   	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
   	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
   	at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
   	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
   	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
   	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
   	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
   	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
   	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
   	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
   	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
   	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
   	at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:94)
   	at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:81)
   	at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:79)
   	at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:116)
   	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:860)
   	at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:390)
   	at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:363)
   	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
   	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
   	at java.base/java.lang.reflect.Method.invoke(Method.java:578)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
   	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
   	at java.base/java.lang.Thread.run(Thread.java:1589)
   
   
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] soumilshah1995 commented on issue #8166: [SUPPORT] Hudi Bucket Index

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.
soumilshah1995 commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1469947114

   you can use buckets with partition is that correct ?
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] KnightChess commented on issue #8166: [SUPPORT] Hudi Bucket Index

Posted by "KnightChess (via GitHub)" <gi...@apache.org>.
KnightChess commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1469942536

   expect to see files 0000-, 0001-


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] KnightChess commented on issue #8166: [SUPPORT] Hudi Bucket Index

Posted by "KnightChess (via GitHub)" <gi...@apache.org>.
KnightChess commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1469813976

   - https://hudi.apache.org/docs/configurations/#INDEX, you can set `hoodie.bucket.index.hash.field` to specify hash field
   - consistent hashing only for MOR tables now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] soumilshah1995 commented on issue #8166: [SUPPORT] Hudi Bucket Index

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.
soumilshah1995 commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1469919620

   @KnightChess 
   When we use word Bucket i believe i was expectation folder 1 2 3 and 4 and inside those folder hudi will perform hash based on hash feild and insert into those folder that didnt happen '
   
   
   Bucket indexes are suitable for upsert use cases on huge datasets with a large number of file groups within partitions, relatively even data distribution across partitions, and can achieve relatively even data distribution on the bucket hash field column. It can have better upsert performance in these cases due to no index lookup involved as file groups are located based on a [hashing mechanism](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index#RFC29:HashIndex-Howhashindexworks), which is very fast. This is totally different from both simple and Bloom indexes, where an explicit index lookup step is involved during write. The buckets here has one-one mapping with the hudi file group and since the total number of buckets (defined by hoodie.bucket.index.num.buckets(default – 4)) is fixed here, it can potentially lead to skewed data (data distributed unevenly across buckets) and scalability (buckets can grow over time) issues over time. These issues will be add
 ressed in the upcoming [consistent hashing bucket index](https://issues.apache.org/jira/browse/HUDI-3000), which is going to be a special type of bucket index.
   
   ```
   The buckets here has one-one mapping with the hudi file group and since the total number of buckets (defined by hoodie.bucket.index.num.buckets(default – 4))
   ```
    does this mean i expect 4 folders ?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] soumilshah1995 closed issue #8166: [SUPPORT] Hudi Bucket Index

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.
soumilshah1995 closed issue #8166: [SUPPORT] Hudi Bucket Index 
URL: https://github.com/apache/hudi/issues/8166


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] soumilshah1995 commented on issue #8166: [SUPPORT] Hudi Bucket Index

Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.
soumilshah1995 commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1492976512

   Please refer to following video 
   https://www.youtube.com/watch?v=lOQFUrfJFP4&t=248s 
   hope this helps 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org