You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@carbondata.apache.org by Jacky Li <ja...@qq.com> on 2018/02/09 07:44:45 UTC

About bucket feature in carbon

Hi,

One year ago, CarbonData 1.0.0 has introduced bucket table feature, it was expected to improve join performance by avoiding shuffling if both tables are bucketed on same column with same number of buckets. 

However, after this feature was introduced, personally speaking it was not widely used in the community and it creates maintenance overhead for the developers in the community (for very new Pull Request, all bucket related testcase need to be fixed)

And now carbon has integrated with spark standard partition, developer can add bucket support using spark bucketed table feature in future if it requires.

So, I propose to remove bucket feature after CarbonData 1.3.0 version.
What do you think?

Regards,
Jacky

Re: About bucket feature in carbon

Posted by Ravindra Pesala <ra...@gmail.com>.

Yes Jacky, we will do refactor and use the partition flow.

On 9 February 2018 at 13:44, Jacky Li <13...@qq.com> wrote:

> Hi Ravindra,
>
> You mean we can do one round of refactory for bucketed table feature in
> CarbonData 1.4.
> I am fine with it.
>
> Regards,
> Jacky
>
>
> > 在 2018年2月9日，下午3:49，Ravindra Pesala <ra...@gmail.com> 写道：
> >
> > Hi Likun,
> >
> > I feel it is better to change the implementation to use sparks bucketing
> > generation just like how standard hive partitions generates. It will be
> > easy to change it after implementing of partition feature. And it is a
> > useful feature for joining big tables and hash based buckets and
> clustered
> > by enables the queries faster.  So it is better to change the
> > implementation instead of removing it.
> >
> > Regards,
> > Ravindra.
> >
> > On 9 February 2018 at 13:14, Jacky Li <ja...@qq.com> wrote:
> >
> >> Hi,
> >>
> >> One year ago, CarbonData 1.0.0 has introduced bucket table feature, it
> was
> >> expected to improve join performance by avoiding shuffling if both
> tables
> >> are bucketed on same column with same number of buckets.
> >>
> >> However, after this feature was introduced, personally speaking it was
> not
> >> widely used in the community and it creates maintenance overhead for the
> >> developers in the community (for very new Pull Request, all bucket
> related
> >> testcase need to be fixed)
> >>
> >> And now carbon has integrated with spark standard partition, developer
> can
> >> add bucket support using spark bucketed table feature in future if it
> >> requires.
> >>
> >> So, I propose to remove bucket feature after CarbonData 1.3.0 version.
> >> What do you think?
> >>
> >> Regards,
> >> Jacky
> >>
> >>
> >
> >
> > --
> > Thanks & Regards,
> > Ravi
>
>
>
>


-- 
Thanks & Regards,
Ravi

Re: About bucket feature in carbon

Posted by Jacky Li <13...@qq.com>.

Hi Ravindra,

You mean we can do one round of refactory for bucketed table feature in CarbonData 1.4.
I am fine with it.

Regards,
Jacky


> 在 2018年2月9日，下午3:49，Ravindra Pesala <ra...@gmail.com> 写道：
> 
> Hi Likun,
> 
> I feel it is better to change the implementation to use sparks bucketing
> generation just like how standard hive partitions generates. It will be
> easy to change it after implementing of partition feature. And it is a
> useful feature for joining big tables and hash based buckets and clustered
> by enables the queries faster.  So it is better to change the
> implementation instead of removing it.
> 
> Regards,
> Ravindra.
> 
> On 9 February 2018 at 13:14, Jacky Li <ja...@qq.com> wrote:
> 
>> Hi,
>> 
>> One year ago, CarbonData 1.0.0 has introduced bucket table feature, it was
>> expected to improve join performance by avoiding shuffling if both tables
>> are bucketed on same column with same number of buckets.
>> 
>> However, after this feature was introduced, personally speaking it was not
>> widely used in the community and it creates maintenance overhead for the
>> developers in the community (for very new Pull Request, all bucket related
>> testcase need to be fixed)
>> 
>> And now carbon has integrated with spark standard partition, developer can
>> add bucket support using spark bucketed table feature in future if it
>> requires.
>> 
>> So, I propose to remove bucket feature after CarbonData 1.3.0 version.
>> What do you think?
>> 
>> Regards,
>> Jacky
>> 
>> 
> 
> 
> -- 
> Thanks & Regards,
> Ravi

Re: About bucket feature in carbon

Posted by Ravindra Pesala <ra...@gmail.com>.

Hi Likun,

I feel it is better to change the implementation to use sparks bucketing
generation just like how standard hive partitions generates. It will be
easy to change it after implementing of partition feature. And it is a
useful feature for joining big tables and hash based buckets and clustered
by enables the queries faster.  So it is better to change the
implementation instead of removing it.

Regards,
Ravindra.

On 9 February 2018 at 13:14, Jacky Li <ja...@qq.com> wrote:

> Hi,
>
> One year ago, CarbonData 1.0.0 has introduced bucket table feature, it was
> expected to improve join performance by avoiding shuffling if both tables
> are bucketed on same column with same number of buckets.
>
> However, after this feature was introduced, personally speaking it was not
> widely used in the community and it creates maintenance overhead for the
> developers in the community (for very new Pull Request, all bucket related
> testcase need to be fixed)
>
> And now carbon has integrated with spark standard partition, developer can
> add bucket support using spark bucketed table feature in future if it
> requires.
>
> So, I propose to remove bucket feature after CarbonData 1.3.0 version.
> What do you think?
>
> Regards,
> Jacky
>
>

-- 
Thanks & Regards,
Ravi