You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hudi.apache.org by Roopa Murthy <Ro...@nortonlifelock.com.INVALID> on 2020/10/21 17:45:49 UTC

Bucketing in Hudi

Hello Hudi team,

We have a requirement to compact data on s3 but we need bucketing on top of compaction so that during query time, only the files relevant to the "id" in query would be scanned. We are told that bucketing is not currently supported in Hudi. Is it possible to extend Hudi to support it? What does it take to extend the framework in order to do this?

We are trying to analyze from timelines perspective whether this is an option to consider and need your help in analyzing and planning for it.

Thanks,
Roopa

Re: [EXT] Re: Bucketing in Hudi

Posted by "Kizhakkel Jose, Felix" <fe...@philips.com.INVALID>.

Hi Balaji,

Is bucketing implementation in HUDI adhering to Hive Style bucketing [Hive murmur3hash (for Hive 3.x.y) and hivehash (for Hive 1.x.y and 2.x.y)].? As it’s the bucketing style all downstream processing engines compatible with.

Regards,
Felix K Jose

From: Balaji Varadarajan <v....@ymail.com.INVALID>
Date: Monday, October 26, 2020 at 1:51 PM
To: dev@hudi.apache.org <de...@hudi.apache.org>, Roopa Murthy <Ro...@nortonlifelock.com>
Cc: DL-AIE <dl...@nortonlifelock.com>
Subject: Re: [EXT] Re: Bucketing in Hudi

    On Monday, October 26, 2020, 10:01:44 AM PDT, Roopa Murthy <ro...@nortonlifelock.com> wrote:

 #yiv9486299699 #yiv9486299699 -- _filtered {} _filtered {} _filtered {}#yiv9486299699 #yiv9486299699 p.yiv9486299699MsoNormal, #yiv9486299699 li.yiv9486299699MsoNormal, #yiv9486299699 div.yiv9486299699MsoNormal {margin:0in;font-size:11.0pt;font-family:sans-serif;}#yiv9486299699 a:link, #yiv9486299699 span.yiv9486299699MsoHyperlink {color:blue;text-decoration:underline;}#yiv9486299699 span.yiv9486299699EmailStyle18 {font-family:sans-serif;color:windowtext;}#yiv9486299699 .yiv9486299699MsoChpDefault {font-size:10.0pt;} _filtered {}#yiv9486299699 div.yiv9486299699WordSection1 {}#yiv9486299699
Hi Balaji,

Surely that will work.

However, we would like to discuss with you and analyze the efforts as well as estimate the timelines to get all the relevant changes in. We are evaluating other tools as well and our choice would be based on ease of use and amount of changes.

When would be a good time to chat today or tomorrow?

Thanks,

Roopa

From: Balaji Varadarajan <v....@ymail.com>
Date: Thursday, October 22, 2020 at 9:24 PM
To: "dev@hudi.apache.org" <de...@hudi.apache.org>
Cc: DL-AIE <DL...@nortonlifelock.com>
Subject: Re: [EXT] Re: Bucketing in Hudi

Hi Roopa,

Bucketing is a more general concept. I think what you are referring to is how to integrate with spark sql bucketing syntax. I was proposing a Hudi native solution where we can implement Bucket indexing which gives the same end result of writing compacted (parquet) files with keys hashed to get bucket-id. You can then use the Hudi's Spark data source integration to write to this table and get bucketized organization.

Let me know if this makes sense.

Thanks,

Balaji.V

On Thursday, October 22, 2020, 05:23:11 PM PDT, Roopa Murthy <ro...@nortonlifelock.com.invalid> wrote:

Hi Balaji,

Thanks for your response. I went through HoodieIndex in source code but I am not sure how indexing alone could help with bucketing.

Spark Bucketing would involve writing the compacted files in bucketed/clustered fashion such that when a spark sql query has a certain id, only the bucket(file) which hashes to that id would be scanned for matching records. This means, data during compaction has to be written using Spark’s saveAsTable API with bucketBy set to the desired number of buckets. Refer:https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fjaceklaskowski.gitbooks.io%2Fmastering-spark-sql%2Fcontent%2Fspark-sql-bucketing.html&amp;data=04%7C01%7C%7Cdbe78195bebc4fff824d08d879d7cc96%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637393315039168901%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=uy4Xf%2BM4J7I1dct4Fggu39lagMItzojyusvUb9iICYg%3D&amp;reserved=0. This will create a spark bucketed table having metadata different from Hive bucketed tables as Spark cannot understand Hive’s hashing algorithm.

Is this something that Hudi might support?

Thanks,
Roopa

From: Balaji Varadarajan <v....@ymail.com>
Date: Wednesday, October 21, 2020 at 9:01 PM
To: "dev@hudi.apache.org" <de...@hudi.apache.org>
Cc: DL-AIE <DL...@nortonlifelock.com>
Subject: [EXT] Re: Bucketing in Hudi

Hudi supports pluggable indexing (HoodieIndex) and the phases of index lookup is nicely abstracted out. We have a Jira for supporting Bucket Indexing :https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHUDI-55&amp;data=04%7C01%7C%7Cdbe78195bebc4fff824d08d879d7cc96%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637393315039168901%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=txj6CHx%2BidnBuIIMX%2FsTEbF4jYGuUeGwGBlk9Z1NR9o%3D&amp;reserved=0<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHUDI-55&amp;data=04%7C01%7C%7Cdbe78195bebc4fff824d08d879d7cc96%7C1a407a2d76754d178692b3ac285306e4%7C0%7C0%7C637393315039168901%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=txj6CHx%2BidnBuIIMX%2FsTEbF4jYGuUeGwGBlk9Z1NR9o%3D&amp;reserved=0>

You can get bucket indexing done by implementing that interface along with additional changes for handling initial writes to the partition and for bucketing information which IMO is not significant. If you are interested in contributing, we would be happy to help you in guiding and landing the change.

Thanks,
Balaji.V

On Wednesday, October 21, 2020, 07:51:07 PM PDT, Roopa Murthy <ro...@nortonlifelock.com.invalid> wrote:

Hello Hudi team,

We have a requirement to compact data on s3 but we need bucketing on top of compaction so that during query time, only the files relevant to the "id" in query would be scanned. We are told that bucketing is not currently supported in Hudi. Is it possible to extend Hudi to support it? What does it take to extend the framework in order to do this?

We are trying to analyze from timelines perspective whether this is an option to consider and need your help in analyzing and planning for it.

Thanks,
Roopa

________________________________
The information contained in this message may be confidential and legally protected under applicable law. The message is intended solely for the addressee(s). If you are not the intended recipient, you are hereby notified that any use, forwarding, dissemination, or reproduction of this message is strictly prohibited and may be unlawful. If you are not the intended recipient, please contact the sender by return e-mail and destroy all copies of the original message.

Re: [EXT] Re: Bucketing in Hudi

Posted by Balaji Varadarajan <v....@ymail.com.INVALID>.

 

    On Monday, October 26, 2020, 10:01:44 AM PDT, Roopa Murthy <ro...@nortonlifelock.com> wrote:  
 
 #yiv9486299699 #yiv9486299699 -- _filtered {} _filtered {} _filtered {}#yiv9486299699 #yiv9486299699 p.yiv9486299699MsoNormal, #yiv9486299699 li.yiv9486299699MsoNormal, #yiv9486299699 div.yiv9486299699MsoNormal {margin:0in;font-size:11.0pt;font-family:sans-serif;}#yiv9486299699 a:link, #yiv9486299699 span.yiv9486299699MsoHyperlink {color:blue;text-decoration:underline;}#yiv9486299699 span.yiv9486299699EmailStyle18 {font-family:sans-serif;color:windowtext;}#yiv9486299699 .yiv9486299699MsoChpDefault {font-size:10.0pt;} _filtered {}#yiv9486299699 div.yiv9486299699WordSection1 {}#yiv9486299699 
Hi Balaji,
 
  
 
Surely that will work.
 
  
 
However, we would like to discuss with you and analyze the efforts as well as estimate the timelines to get all the relevant changes in. We are evaluating other tools as well and our choice would be based on ease of use and amount of changes.
 
  
 
When would be a good time to chat today or tomorrow? 
 
  
 
Thanks,
 
Roopa
 
  
 
From: Balaji Varadarajan <v....@ymail.com>
Date: Thursday, October 22, 2020 at 9:24 PM
To: "dev@hudi.apache.org" <de...@hudi.apache.org>
Cc: DL-AIE <DL...@nortonlifelock.com>
Subject: Re: [EXT] Re: Bucketing in Hudi
 
  
 
Hi Roopa,
 
  
 
Bucketing is a more general concept. I think what you are referring to is how to integrate with spark sql bucketing syntax. I was proposing a Hudi native solution where we can implement Bucket indexing which gives the same end result of writing compacted (parquet) files with keys hashed to get bucket-id. You can then use the Hudi's Spark data source integration to write to this table and get bucketized organization.
 
  
 
Let me know if this makes sense. 
 
  
 
Thanks,
 
Balaji.V
 
  
 
On Thursday, October 22, 2020, 05:23:11 PM PDT, Roopa Murthy <ro...@nortonlifelock.com.invalid> wrote:
 
  
 
  
 
Hi Balaji,


Thanks for your response. I went through HoodieIndex in source code but I am not sure how indexing alone could help with bucketing.

Spark Bucketing would involve writing the compacted files in bucketed/clustered fashion such that when a spark sql query has a certain id, only the bucket(file) which hashes to that id would be scanned for matching records. This means, data during compaction has to be written using Spark’s saveAsTable API with bucketBy set to the desired number of buckets. Refer:https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-bucketing.html. This will create a spark bucketed table having metadata different from Hive bucketed tables as Spark cannot understand Hive’s hashing algorithm.

Is this something that Hudi might support?

Thanks,
Roopa

From: Balaji Varadarajan <v....@ymail.com>
Date: Wednesday, October 21, 2020 at 9:01 PM
To: "dev@hudi.apache.org" <de...@hudi.apache.org>
Cc: DL-AIE <DL...@nortonlifelock.com>
Subject: [EXT] Re: Bucketing in Hudi

Hudi supports pluggable indexing (HoodieIndex) and the phases of index lookup is nicely abstracted out. We have a Jira for supporting Bucket Indexing :https://issues.apache.org/jira/browse/HUDI-55<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHUDI-55&data=04%7C01%7CRoopa.Murthy%40nortonlifelock.com%7C2ce010453bdf4b0dc4f408d8763f1852%7C94986b1d466f4fc0ab4b5c725603deab%7C0%7C1%7C637389360660893281%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=BD1ahx8qXtu9S2do74OPOXIWtxmfdAqNT%2F3X64g19Rw%3D&reserved=0>
 


You can get bucket indexing done by implementing that interface along with additional changes for handling initial writes to the partition and for bucketing information which IMO is not significant. If you are interested in contributing, we would be happy to help you in guiding and landing the change.

Thanks,
Balaji.V




On Wednesday, October 21, 2020, 07:51:07 PM PDT, Roopa Murthy <ro...@nortonlifelock.com.invalid> wrote:


Hello Hudi team,

We have a requirement to compact data on s3 but we need bucketing on top of compaction so that during query time, only the files relevant to the "id" in query would be scanned. We are told that bucketing is not currently supported in Hudi. Is it possible to extend Hudi to support it? What does it take to extend the framework in order to do this?

We are trying to analyze from timelines perspective whether this is an option to consider and need your help in analyzing and planning for it.

Thanks,
Roopa

Re: [EXT] Re: Bucketing in Hudi

Posted by Balaji Varadarajan <v....@ymail.com.INVALID>.

 Hi Roopa,
Kindly ping me in hoodie slack to work out the time within next couple of days. I would also like to understand your use-case better.
Thanks,Balaji.V


    On Monday, October 26, 2020, 10:01:44 AM PDT, Roopa Murthy <ro...@nortonlifelock.com> wrote:  
 
 #yiv9486299699 #yiv9486299699 -- _filtered {} _filtered {} _filtered {}#yiv9486299699 #yiv9486299699 p.yiv9486299699MsoNormal, #yiv9486299699 li.yiv9486299699MsoNormal, #yiv9486299699 div.yiv9486299699MsoNormal {margin:0in;font-size:11.0pt;font-family:sans-serif;}#yiv9486299699 a:link, #yiv9486299699 span.yiv9486299699MsoHyperlink {color:blue;text-decoration:underline;}#yiv9486299699 span.yiv9486299699EmailStyle18 {font-family:sans-serif;color:windowtext;}#yiv9486299699 .yiv9486299699MsoChpDefault {font-size:10.0pt;} _filtered {}#yiv9486299699 div.yiv9486299699WordSection1 {}#yiv9486299699 
Hi Balaji,
 
  
 
Surely that will work.
 
  
 
However, we would like to discuss with you and analyze the efforts as well as estimate the timelines to get all the relevant changes in. We are evaluating other tools as well and our choice would be based on ease of use and amount of changes.
 
  
 
When would be a good time to chat today or tomorrow? 
 
  
 
Thanks,
 
Roopa
 
  
 
From: Balaji Varadarajan <v....@ymail.com>
Date: Thursday, October 22, 2020 at 9:24 PM
To: "dev@hudi.apache.org" <de...@hudi.apache.org>
Cc: DL-AIE <DL...@nortonlifelock.com>
Subject: Re: [EXT] Re: Bucketing in Hudi
 
  
 
Hi Roopa,
 
  
 
Bucketing is a more general concept. I think what you are referring to is how to integrate with spark sql bucketing syntax. I was proposing a Hudi native solution where we can implement Bucket indexing which gives the same end result of writing compacted (parquet) files with keys hashed to get bucket-id. You can then use the Hudi's Spark data source integration to write to this table and get bucketized organization.
 
  
 
Let me know if this makes sense. 
 
  
 
Thanks,
 
Balaji.V
 
  
 
On Thursday, October 22, 2020, 05:23:11 PM PDT, Roopa Murthy <ro...@nortonlifelock.com.invalid> wrote:
 
  
 
  
 
Hi Balaji,


Thanks for your response. I went through HoodieIndex in source code but I am not sure how indexing alone could help with bucketing.

Spark Bucketing would involve writing the compacted files in bucketed/clustered fashion such that when a spark sql query has a certain id, only the bucket(file) which hashes to that id would be scanned for matching records. This means, data during compaction has to be written using Spark’s saveAsTable API with bucketBy set to the desired number of buckets. Refer:https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-bucketing.html. This will create a spark bucketed table having metadata different from Hive bucketed tables as Spark cannot understand Hive’s hashing algorithm.

Is this something that Hudi might support?

Thanks,
Roopa

From: Balaji Varadarajan <v....@ymail.com>
Date: Wednesday, October 21, 2020 at 9:01 PM
To: "dev@hudi.apache.org" <de...@hudi.apache.org>
Cc: DL-AIE <DL...@nortonlifelock.com>
Subject: [EXT] Re: Bucketing in Hudi

Hudi supports pluggable indexing (HoodieIndex) and the phases of index lookup is nicely abstracted out. We have a Jira for supporting Bucket Indexing :https://issues.apache.org/jira/browse/HUDI-55<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHUDI-55&data=04%7C01%7CRoopa.Murthy%40nortonlifelock.com%7C2ce010453bdf4b0dc4f408d8763f1852%7C94986b1d466f4fc0ab4b5c725603deab%7C0%7C1%7C637389360660893281%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=BD1ahx8qXtu9S2do74OPOXIWtxmfdAqNT%2F3X64g19Rw%3D&reserved=0>
 


You can get bucket indexing done by implementing that interface along with additional changes for handling initial writes to the partition and for bucketing information which IMO is not significant. If you are interested in contributing, we would be happy to help you in guiding and landing the change.

Thanks,
Balaji.V




On Wednesday, October 21, 2020, 07:51:07 PM PDT, Roopa Murthy <ro...@nortonlifelock.com.invalid> wrote:


Hello Hudi team,

We have a requirement to compact data on s3 but we need bucketing on top of compaction so that during query time, only the files relevant to the "id" in query would be scanned. We are told that bucketing is not currently supported in Hudi. Is it possible to extend Hudi to support it? What does it take to extend the framework in order to do this?

We are trying to analyze from timelines perspective whether this is an option to consider and need your help in analyzing and planning for it.

Thanks,
Roopa

Re: [EXT] Re: Bucketing in Hudi

Posted by Balaji Varadarajan <v....@ymail.com.INVALID>.

 Hi Roopa,
Bucketing is a more general concept. I think what you are referring to is how to integrate with spark sql bucketing syntax.  I was proposing a Hudi native solution where we can implement Bucket indexing which gives the same end result of writing compacted (parquet) files with keys hashed to get bucket-id. You can then use the Hudi's Spark data source integration to write to this table and get bucketized organization.
Let me know if this makes sense. 

Thanks,Balaji.V
    On Thursday, October 22, 2020, 05:23:11 PM PDT, Roopa Murthy <ro...@nortonlifelock.com.invalid> wrote:  
 
 Hi Balaji,


Thanks for your response. I went through HoodieIndex in source code but I am not sure how indexing alone could help with bucketing.

Spark Bucketing would involve writing the compacted files in bucketed/clustered fashion such that when a spark sql query has a certain id, only the bucket(file) which hashes to that id would be scanned for matching records. This means, data during compaction has to be written using Spark’s saveAsTable API with bucketBy set to the desired number of buckets. Refer: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-bucketing.html . This will create a spark bucketed table having metadata different from Hive bucketed tables as Spark cannot understand Hive’s hashing algorithm.

Is this something that Hudi might support?

Thanks,
Roopa

From: Balaji Varadarajan <v....@ymail.com>
Date: Wednesday, October 21, 2020 at 9:01 PM
To: "dev@hudi.apache.org" <de...@hudi.apache.org>
Cc: DL-AIE <DL...@nortonlifelock.com>
Subject: [EXT] Re: Bucketing in Hudi

Hudi supports pluggable indexing (HoodieIndex) and the phases of index lookup is nicely abstracted out. We have a Jira for supporting Bucket Indexing : https://issues.apache.org/jira/browse/HUDI-55<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHUDI-55&data=04%7C01%7CRoopa.Murthy%40nortonlifelock.com%7C2ce010453bdf4b0dc4f408d8763f1852%7C94986b1d466f4fc0ab4b5c725603deab%7C0%7C1%7C637389360660893281%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=BD1ahx8qXtu9S2do74OPOXIWtxmfdAqNT%2F3X64g19Rw%3D&reserved=0>

You can get bucket indexing done by implementing that interface along with additional changes for handling initial writes to the partition and for bucketing information which IMO is not significant. If you are interested in contributing, we would be happy to help you in guiding and landing the change.

Thanks,
Balaji.V




On Wednesday, October 21, 2020, 07:51:07 PM PDT, Roopa Murthy <ro...@nortonlifelock.com.invalid> wrote:


Hello Hudi team,

We have a requirement to compact data on s3 but we need bucketing on top of compaction so that during query time, only the files relevant to the "id" in query would be scanned. We are told that bucketing is not currently supported in Hudi. Is it possible to extend Hudi to support it? What does it take to extend the framework in order to do this?

We are trying to analyze from timelines perspective whether this is an option to consider and need your help in analyzing and planning for it.

Thanks,
Roopa

Re: [EXT] Re: Bucketing in Hudi

Posted by Roopa Murthy <Ro...@nortonlifelock.com.INVALID>.

Hi Balaji,


Thanks for your response. I went through HoodieIndex in source code but I am not sure how indexing alone could help with bucketing.

Spark Bucketing would involve writing the compacted files in bucketed/clustered fashion such that when a spark sql query has a certain id, only the bucket(file) which hashes to that id would be scanned for matching records. This means, data during compaction has to be written using Spark’s saveAsTable API with bucketBy set to the desired number of buckets. Refer: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-bucketing.html . This will create a spark bucketed table having metadata different from Hive bucketed tables as Spark cannot understand Hive’s hashing algorithm.

Is this something that Hudi might support?

Thanks,
Roopa

From: Balaji Varadarajan <v....@ymail.com>
Date: Wednesday, October 21, 2020 at 9:01 PM
To: "dev@hudi.apache.org" <de...@hudi.apache.org>
Cc: DL-AIE <DL...@nortonlifelock.com>
Subject: [EXT] Re: Bucketing in Hudi

Hudi supports pluggable indexing (HoodieIndex) and the phases of index lookup is nicely abstracted out. We have a Jira for supporting Bucket Indexing : https://issues.apache.org/jira/browse/HUDI-55<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FHUDI-55&data=04%7C01%7CRoopa.Murthy%40nortonlifelock.com%7C2ce010453bdf4b0dc4f408d8763f1852%7C94986b1d466f4fc0ab4b5c725603deab%7C0%7C1%7C637389360660893281%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=BD1ahx8qXtu9S2do74OPOXIWtxmfdAqNT%2F3X64g19Rw%3D&reserved=0>

You can get bucket indexing done by implementing that interface along with additional changes for handling initial writes to the partition and for bucketing information which IMO is not significant. If you are interested in contributing, we would be happy to help you in guiding and landing the change.

Thanks,
Balaji.V




On Wednesday, October 21, 2020, 07:51:07 PM PDT, Roopa Murthy <ro...@nortonlifelock.com.invalid> wrote:


Hello Hudi team,

We have a requirement to compact data on s3 but we need bucketing on top of compaction so that during query time, only the files relevant to the "id" in query would be scanned. We are told that bucketing is not currently supported in Hudi. Is it possible to extend Hudi to support it? What does it take to extend the framework in order to do this?

We are trying to analyze from timelines perspective whether this is an option to consider and need your help in analyzing and planning for it.

Thanks,
Roopa

Re: Bucketing in Hudi

Posted by Balaji Varadarajan <v....@ymail.com.INVALID>.

 Hudi supports pluggable indexing (HoodieIndex) and the phases of index lookup is nicely abstracted out. We have a Jira for supporting Bucket Indexing : https://issues.apache.org/jira/browse/HUDI-55 
You can get bucket indexing done by implementing that interface along with additional changes for handling initial writes to the partition and for bucketing information which IMO is not significant. If you are interested in contributing, we would be happy to help you in guiding and landing the change.
Thanks,Balaji.V



    On Wednesday, October 21, 2020, 07:51:07 PM PDT, Roopa Murthy <ro...@nortonlifelock.com.invalid> wrote:  
 
 Hello Hudi team,

We have a requirement to compact data on s3 but we need bucketing on top of compaction so that during query time, only the files relevant to the "id" in query would be scanned. We are told that bucketing is not currently supported in Hudi. Is it possible to extend Hudi to support it? What does it take to extend the framework in order to do this?

We are trying to analyze from timelines perspective whether this is an option to consider and need your help in analyzing and planning for it.

Thanks,
Roopa