You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "soumilshah1995 (via GitHub)" <gi...@apache.org> on 2023/03/13 13:53:12 UTC
[GitHub] [hudi] soumilshah1995 opened a new issue, #8166: [SUPPORT] Hudi Bucket Index
soumilshah1995 opened a new issue, #8166:
URL: https://github.com/apache/hudi/issues/8166
**Subject :** Question on Hudi bucket index
* Bucket indexes are suitable for upsert use cases on huge datasets with a large number of file groups within partitions, relatively even data distribution across partitions, and can achieve relatively even data distribution on the bucket hash field column. It can have better upsert performance in these cases due to no index lookup involved as file groups are located based on a [hashing mechanism](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index#RFC29:HashIndex-Howhashindexworks), which is very fast. This is totally different from both simple and Bloom indexes, where an explicit index lookup step is involved during write. The buckets here has one-one mapping with the hudi file group and since the total number of buckets (defined by hoodie.bucket.index.num.buckets(default – 4)) is fixed here, it can potentially lead to skewed data (data distributed unevenly across buckets) and scalability (buckets can grow over time) issues over time. These issues will be a
ddressed in the upcoming [consistent hashing bucket index](https://issues.apache.org/jira/browse/HUDI-3000), which is going to be a special type of bucket index.
##### Questions:
* Are setting mentioned below are right way to implement Bucket Index ?
```
,"hoodie.index.type":"BUCKET"
,"hoodie.index.bucket.engine" : 'SIMPLE'
,'hoodie.storage.layout.partitioner.class':'org.apache.hudi.table.action.commit.SparkBucketIndexPartitioner'
,'hoodie.bucket.index.num.buckets':"4"
```
* Assuming answer would be yes should i be expecting to see 4 folder insider which basefiles should be present when selecting this option or i should simply see number at start of base files 000 0001
*
* How do i specify if i want to perform the HASH on say column "country" and not on record key ?
* i am attaching some sample code so i can properly understand if i want to do hashing on say country how can i specify columns
* is there a way we would elaborate documentation Hudi website on Index and add more information about bucket index and other with some examples ?
* is consistent hashing only for MOR tables ?
```
try:
import os
import sys
import uuid
import pyspark
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
from pyspark.sql.functions import col, asc, desc
from pyspark.sql.functions import col, to_timestamp, monotonically_increasing_id, to_date, when
from pyspark.sql.functions import *
from pyspark.sql.types import *
from datetime import datetime
from functools import reduce
from faker import Faker
except Exception as e:
pass
SUBMIT_ARGS = "--packages org.apache.hudi:hudi-spark3.3-bundle_2.12:0.13.0 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
spark = SparkSession.builder \
.config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer') \
.config('className', 'org.apache.hudi') \
.config('spark.sql.hive.convertMetastoreParquet', 'false') \
.getOrCreate()
db_name = "hudidb"
table_name = "hudi_bucket_table"
recordkey = 'uuid'
path = f"file:///C:/tmp/{db_name}/{table_name}"
precombine = "date"
method = 'upsert'
table_type = "COPY_ON_WRITE" # COPY_ON_WRITE | MERGE_ON_READ
PARTITION_FIELD = "country"
hudi_options = {
'hoodie.table.name': table_name,
'hoodie.datasource.write.recordkey.field': recordkey,
'hoodie.datasource.write.table.name': table_name,
'hoodie.datasource.write.operation': method,
'hoodie.datasource.write.precombine.field': precombine
,"hoodie.upsert.shuffle.parallelism":100
,"hoodie.index.type":"BUCKET"
,"hoodie.index.bucket.engine" : 'SIMPLE'
,'hoodie.storage.layout.partitioner.class':'org.apache.hudi.table.action.commit.SparkBucketIndexPartitioner'
,'hoodie.bucket.index.num.buckets':"4"
,"hoodie.clean.automatic": "true"
, "hoodie.clean.async": "true"
, "hoodie.cleaner.policy": 'KEEP_LATEST_FILE_VERSIONS'
, "hoodie.cleaner.fileversions.retained": "3"
, "hoodie-conf hoodie.cleaner.parallelism": '200'
, 'hoodie.cleaner.commits.retained': 5
}
spark_df = spark.createDataFrame(
data=[
(1, "insert 1", "2020-01-06 12:12:12", "IN"),
(2, "insert 2", "2020-01-06 12:12:13", "US"),
(3, "insert 3", "2020-01-06 12:12:15", "IN"),
(4, "insert 4", "2020-01-06 12:13:15", "US"),
],
schema=["uuid", "message", "date", "country"])
spark_df.show()
spark_df.write.format("hudi"). \
options(**hudi_options). \
mode("append"). \
save(path)
spark_df = spark.createDataFrame(
data=[
(1, "update 1", "2020-01-06 12:12:12", "IN"),
(2, "update 2", "2020-01-06 12:12:13", "US"),
(5, "insert 5", "2020-01-06 12:13:15", "US"),
],
schema=["uuid", "message", "date", "country"])
spark_df.show()
spark_df.write.format("hudi"). \
options(**hudi_options). \
mode("append"). \
save(path)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] chenbodeng719 commented on issue #8166: [SUPPORT] Hudi Bucket Index
Posted by "chenbodeng719 (via GitHub)" <gi...@apache.org>.
chenbodeng719 commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1492990599
@soumilshah1995 How can I use flink to do this?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] soumilshah1995 commented on issue #8166: [SUPPORT] Hudi Bucket Index
Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.
soumilshah1995 commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1470007806
got it make sense @KnightChess
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] chenbodeng719 commented on issue #8166: [SUPPORT] Hudi Bucket Index
Posted by "chenbodeng719 (via GitHub)" <gi...@apache.org>.
chenbodeng719 commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1492976098
@KnightChess Hi, I use below conf to test bulk insert. There comes out only one parquet. Did I miss something? I expect 5 parquets( 5 buckets). My dataset is about 120GB.
```
CREATE TABLE hbase2hudi_sink(
uid STRING PRIMARY KEY NOT ENFORCED,
oridata STRING,
update_time TIMESTAMP_LTZ(3)
) WITH (
'table.type' = 'MERGE_ON_READ',
'connector' = 'hudi',
'path' = '%s',
'write.operation' = 'bulk_insert',
'precombine.field' = 'update_time',
'write.tasks' = '2',
'index.type' = 'BUCKET',
'hoodie.bucket.index.hash.field' = 'uid',
'hoodie.bucket.index.num.buckets' = '5'
)
```
<img width="835" alt="image" src="https://user-images.githubusercontent.com/104059106/229291867-c6c4f9fa-1183-4adb-838b-c72684868b6f.png">
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] soumilshah1995 commented on issue #8166: [SUPPORT] Hudi Bucket Index
Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.
soumilshah1995 commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1493328113
sorry i dont have much experience in flink
tagging @danny0405
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] soumilshah1995 commented on issue #8166: [SUPPORT] Hudi Bucket Index
Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.
soumilshah1995 commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1469904815
@KnightChess KnightChess
expectation was to see buckets (Folder) which was not created please correct me if i am wrong
i was expecting to see folder 1 2 3 and 4 and inside that should be my base files
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] KnightChess commented on issue #8166: [SUPPORT] Hudi Bucket Index
Posted by "KnightChess (via GitHub)" <gi...@apache.org>.
KnightChess commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1469983031
@soumilshah1995 yes, I got the same error. The desc in code config is
<img width="774" alt="image" src="https://user-images.githubusercontent.com/20125927/225317801-427f5c3f-efc4-4ac7-8c83-2b8230600a54.png">
but in doc, has two
<img width="963" alt="image" src="https://user-images.githubusercontent.com/20125927/225317952-fe77a0f4-fcc5-4e55-96ab-cd88a48328ed.png">
<img width="788" alt="image" src="https://user-images.githubusercontent.com/20125927/225318054-15031733-3e7d-4769-8e5d-6969c60f24b3.png">
look like not support other field, need be the subset record key, sorry, my question
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] soumilshah1995 commented on issue #8166: [SUPPORT] Hudi Bucket Index
Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.
soumilshah1995 commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1469888165
@KnightChess
i was expecting to see 4 folder (BUCKETS) insider which base files should be present when selecting this option i didnt see partition(Folder)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] KnightChess commented on issue #8166: [SUPPORT] Hudi Bucket Index
Posted by "KnightChess (via GitHub)" <gi...@apache.org>.
KnightChess commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1469912792
Oh, your hash field is partition col. I think the partition folder is also "IN" "US", and with only one bucket number file inside
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] chenbodeng719 commented on issue #8166: [SUPPORT] Hudi Bucket Index
Posted by "chenbodeng719 (via GitHub)" <gi...@apache.org>.
chenbodeng719 commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1493546851
> sorry i dont have much experience in flink tagging @danny0405
Anyway, thks for you response.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] KnightChess commented on issue #8166: [SUPPORT] Hudi Bucket Index
Posted by "KnightChess (via GitHub)" <gi...@apache.org>.
KnightChess commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1469900502
@soumilshah1995 https://hudi.apache.org/docs/quick-start-guide#insert-data in python tab, you need set `hoodie.datasource.write.partitionpath.field` according to the doc
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] soumilshah1995 commented on issue #8166: [SUPPORT] Hudi Bucket Index
Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.
soumilshah1995 commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1470013677
Thanks a lot for your help
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] soumilshah1995 commented on issue #8166: [SUPPORT] Hudi Bucket Index
Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.
soumilshah1995 commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1469969427
##### you mentioned i can use 'hoodie.bucket.index.hash.field'
i am getting error when i specify the hash Feild
### Code
```
try:
import os
import sys
import uuid
import pyspark
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
from pyspark.sql.functions import col, asc, desc
from pyspark.sql.functions import col, to_timestamp, monotonically_increasing_id, to_date, when
from pyspark.sql.functions import *
from pyspark.sql.types import *
from datetime import datetime
from functools import reduce
from faker import Faker
except Exception as e:
pass
SUBMIT_ARGS = "--packages org.apache.hudi:hudi-spark3.3-bundle_2.12:0.13.0 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
spark = SparkSession.builder \
.config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer') \
.config('className', 'org.apache.hudi') \
.config('spark.sql.hive.convertMetastoreParquet', 'false') \
.getOrCreate()
db_name = "hudidb"
table_name = "hudi_bucket_table"
recordkey = 'uuid'
path = f"file:///C:/tmp/{db_name}/{table_name}"
precombine = "date"
method = 'upsert'
table_type = "COPY_ON_WRITE" # COPY_ON_WRITE | MERGE_ON_READ
PARTITION_FIELD = "country"
hudi_options = {
'hoodie.table.name': table_name,
'hoodie.datasource.write.recordkey.field': recordkey,
'hoodie.datasource.write.table.name': table_name,
'hoodie.datasource.write.operation': method,
'hoodie.datasource.write.precombine.field': precombine
,"hoodie.upsert.shuffle.parallelism":100
,"hoodie.index.type":"BUCKET"
,"hoodie.index.bucket.engine" : 'SIMPLE'
,'hoodie.storage.layout.partitioner.class':'org.apache.hudi.table.action.commit.SparkBucketIndexPartitioner'
,'hoodie.bucket.index.num.buckets':"4"
# ,"hoodie.datasource.write.partitionpath.field":PARTITION_FIELD
,"hoodie.bucket.index.hash.field":PARTITION_FIELD
,"hoodie.clean.automatic": "true"
, "hoodie.clean.async": "true"
, "hoodie.cleaner.policy": 'KEEP_LATEST_FILE_VERSIONS'
, "hoodie.cleaner.fileversions.retained": "3"
, "hoodie-conf hoodie.cleaner.parallelism": '200'
, 'hoodie.cleaner.commits.retained': 5
}
spark_df = spark.createDataFrame(
data=[
(1, "insert 1", "2020-01-06 12:12:12", "IN"),
(2, "insert 2", "2020-01-06 12:12:13", "US"),
(3, "insert 3", "2020-01-06 12:12:15", "IN"),
(4, "insert 4", "2020-01-06 12:13:15", "US"),
],
schema=["uuid", "message", "date", "country"])
spark_df.show()
spark_df.write.format("hudi"). \
options(**hudi_options). \
mode("append"). \
save(path)
```
#### Error Message
```
Py4JJavaError: An error occurred while calling o106.save.
: org.apache.hudi.exception.HoodieIndexException: Bucket index key (if configured) must be subset of record key.
at org.apache.hudi.config.HoodieIndexConfig$Builder.validateBucketIndexConfig(HoodieIndexConfig.java:692)
at org.apache.hudi.config.HoodieIndexConfig$Builder.build(HoodieIndexConfig.java:660)
at org.apache.hudi.config.HoodieWriteConfig$Builder.setDefaults(HoodieWriteConfig.java:2869)
at org.apache.hudi.config.HoodieWriteConfig$Builder.build(HoodieWriteConfig.java:3004)
at org.apache.hudi.config.HoodieWriteConfig$Builder.build(HoodieWriteConfig.java:2999)
at org.apache.hudi.DataSourceUtils.createHoodieConfig(DataSourceUtils.java:188)
at org.apache.hudi.DataSourceUtils.createHoodieClient(DataSourceUtils.java:193)
at org.apache.hudi.HoodieSparkSqlWriter$.$anonfun$write$14(HoodieSparkSqlWriter.scala:337)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:334)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:150)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:109)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:169)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:94)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:81)
at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:79)
at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:116)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:860)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:390)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:363)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
at java.base/java.lang.reflect.Method.invoke(Method.java:578)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:1589)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] soumilshah1995 commented on issue #8166: [SUPPORT] Hudi Bucket Index
Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.
soumilshah1995 commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1469947114
you can use buckets with partition is that correct ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] KnightChess commented on issue #8166: [SUPPORT] Hudi Bucket Index
Posted by "KnightChess (via GitHub)" <gi...@apache.org>.
KnightChess commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1469942536
expect to see files 0000-, 0001-
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] KnightChess commented on issue #8166: [SUPPORT] Hudi Bucket Index
Posted by "KnightChess (via GitHub)" <gi...@apache.org>.
KnightChess commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1469813976
- https://hudi.apache.org/docs/configurations/#INDEX, you can set `hoodie.bucket.index.hash.field` to specify hash field
- consistent hashing only for MOR tables now
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] soumilshah1995 commented on issue #8166: [SUPPORT] Hudi Bucket Index
Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.
soumilshah1995 commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1469919620
@KnightChess
When we use word Bucket i believe i was expectation folder 1 2 3 and 4 and inside those folder hudi will perform hash based on hash feild and insert into those folder that didnt happen '
Bucket indexes are suitable for upsert use cases on huge datasets with a large number of file groups within partitions, relatively even data distribution across partitions, and can achieve relatively even data distribution on the bucket hash field column. It can have better upsert performance in these cases due to no index lookup involved as file groups are located based on a [hashing mechanism](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index#RFC29:HashIndex-Howhashindexworks), which is very fast. This is totally different from both simple and Bloom indexes, where an explicit index lookup step is involved during write. The buckets here has one-one mapping with the hudi file group and since the total number of buckets (defined by hoodie.bucket.index.num.buckets(default – 4)) is fixed here, it can potentially lead to skewed data (data distributed unevenly across buckets) and scalability (buckets can grow over time) issues over time. These issues will be add
ressed in the upcoming [consistent hashing bucket index](https://issues.apache.org/jira/browse/HUDI-3000), which is going to be a special type of bucket index.
```
The buckets here has one-one mapping with the hudi file group and since the total number of buckets (defined by hoodie.bucket.index.num.buckets(default – 4))
```
does this mean i expect 4 folders ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] soumilshah1995 closed issue #8166: [SUPPORT] Hudi Bucket Index
Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.
soumilshah1995 closed issue #8166: [SUPPORT] Hudi Bucket Index
URL: https://github.com/apache/hudi/issues/8166
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] soumilshah1995 commented on issue #8166: [SUPPORT] Hudi Bucket Index
Posted by "soumilshah1995 (via GitHub)" <gi...@apache.org>.
soumilshah1995 commented on issue #8166:
URL: https://github.com/apache/hudi/issues/8166#issuecomment-1492976512
Please refer to following video
https://www.youtube.com/watch?v=lOQFUrfJFP4&t=248s
hope this helps
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org