You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Rahul Channe <dr...@googlemail.com> on 2016/08/19 20:46:48 UTC

Inserting data in hive bucket

Hi ,

Is there a way to create the bucket with specific file name in hive. I see
hive creates random names for the file starting with 0000

I want the file name to be in yyyymmdd format since I am partitioning by
month and bucketing by date

Thank you

Rahul

Re: Inserting data in hive bucket

Posted by Rahul Channe <dr...@googlemail.com>.
Thank you for the responses

On Sunday, August 21, 2016, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Hi Rahul,
>
> I don't believe you can drop a particular bucket in Hive
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 20 August 2016 at 23:53, Rahul Channe <drahulc@googlemail.com
> <javascript:_e(%7B%7D,'cvml','drahulc@googlemail.com');>> wrote:
>
>> Hi Mich,
>>
>> I want to know If we can drop data of particular bucket in hive
>>
>> On Friday, August 19, 2016, Mich Talebzadeh <mich.talebzadeh@gmail.com
>> <javascript:_e(%7B%7D,'cvml','mich.talebzadeh@gmail.com');>> wrote:
>>
>>> Hash partitioning (Bucketing) does not make much sense for YYYY/MM/DD/32
>>> as pointed out.
>>>
>>> So it is clear that with (mod 32), the maximum number of offsets is
>>> going to be 32, i.e. in the range between 0-31. With YYYY/MM/DD you have to
>>> account for hash collisions as well. The set of inputs is potentially many
>>> (definitely not known until we encounter them all) and if you want to
>>> spread them evenly (after all that is what hash partitioning is all about)
>>> then I think day of the month makes more sense.
>>>
>>> HTH
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 19 August 2016 at 23:15, Gopal Vijayaraghavan <go...@apache.org>
>>> wrote:
>>>
>>>>
>>>> > We are bucketing by date so we wil have max 32 buckets
>>>>
>>>> If you do want to lookup specifically by date, you could just create day
>>>> partitions and never partition by month.
>>>>
>>>> FYI, in a modern version of Hive
>>>>
>>>> select count(1) from table where YEAR(dt) = 2016 and MONTH(dt) = 12
>>>>
>>>> does prune it on the client side.
>>>>
>>>> On a different note, 31 buckets is a bad idea (32 is ok), because for
>>>> String hashes (32-1) is the magic number which hurts "yyyymmdd" and 50%
>>>> of
>>>> your buckets have 0 data.
>>>>
>>>> http://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/6
>>>>
>>>>
>>>> Use that as a number and you'll get the same number back as the
>>>> hashcode,
>>>> so it won't be stable as months change (20160816 % 32 == 16 and
>>>> 20160716 %
>>>> 32 == 12).
>>>>
>>>> The only way to have buckets correspond to a day_of_month as an int and
>>>> bucket on it with 32 - then bucket0 == 31, bucket1=1, bucket2=2 etc.
>>>>
>>>> Cheers,
>>>> Gopal
>>>>
>>>>
>>>>
>>>
>

Re: Inserting data in hive bucket

Posted by Mich Talebzadeh <mi...@gmail.com>.
Hi Rahul,

I don't believe you can drop a particular bucket in Hive

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 20 August 2016 at 23:53, Rahul Channe <dr...@googlemail.com> wrote:

> Hi Mich,
>
> I want to know If we can drop data of particular bucket in hive
>
> On Friday, August 19, 2016, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> Hash partitioning (Bucketing) does not make much sense for YYYY/MM/DD/32
>> as pointed out.
>>
>> So it is clear that with (mod 32), the maximum number of offsets is going
>> to be 32, i.e. in the range between 0-31. With YYYY/MM/DD you have to
>> account for hash collisions as well. The set of inputs is potentially many
>> (definitely not known until we encounter them all) and if you want to
>> spread them evenly (after all that is what hash partitioning is all about)
>> then I think day of the month makes more sense.
>>
>> HTH
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 19 August 2016 at 23:15, Gopal Vijayaraghavan <go...@apache.org>
>> wrote:
>>
>>>
>>> > We are bucketing by date so we wil have max 32 buckets
>>>
>>> If you do want to lookup specifically by date, you could just create day
>>> partitions and never partition by month.
>>>
>>> FYI, in a modern version of Hive
>>>
>>> select count(1) from table where YEAR(dt) = 2016 and MONTH(dt) = 12
>>>
>>> does prune it on the client side.
>>>
>>> On a different note, 31 buckets is a bad idea (32 is ok), because for
>>> String hashes (32-1) is the magic number which hurts "yyyymmdd" and 50%
>>> of
>>> your buckets have 0 data.
>>>
>>> http://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/6
>>>
>>>
>>> Use that as a number and you'll get the same number back as the hashcode,
>>> so it won't be stable as months change (20160816 % 32 == 16 and 20160716
>>> %
>>> 32 == 12).
>>>
>>> The only way to have buckets correspond to a day_of_month as an int and
>>> bucket on it with 32 - then bucket0 == 31, bucket1=1, bucket2=2 etc.
>>>
>>> Cheers,
>>> Gopal
>>>
>>>
>>>
>>

Re: Inserting data in hive bucket

Posted by Rahul Channe <dr...@googlemail.com>.
Hi Mich,

I want to know If we can drop data of particular bucket in hive

On Friday, August 19, 2016, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Hash partitioning (Bucketing) does not make much sense for YYYY/MM/DD/32
> as pointed out.
>
> So it is clear that with (mod 32), the maximum number of offsets is going
> to be 32, i.e. in the range between 0-31. With YYYY/MM/DD you have to
> account for hash collisions as well. The set of inputs is potentially many
> (definitely not known until we encounter them all) and if you want to
> spread them evenly (after all that is what hash partitioning is all about)
> then I think day of the month makes more sense.
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 19 August 2016 at 23:15, Gopal Vijayaraghavan <gopalv@apache.org
> <javascript:_e(%7B%7D,'cvml','gopalv@apache.org');>> wrote:
>
>>
>> > We are bucketing by date so we wil have max 32 buckets
>>
>> If you do want to lookup specifically by date, you could just create day
>> partitions and never partition by month.
>>
>> FYI, in a modern version of Hive
>>
>> select count(1) from table where YEAR(dt) = 2016 and MONTH(dt) = 12
>>
>> does prune it on the client side.
>>
>> On a different note, 31 buckets is a bad idea (32 is ok), because for
>> String hashes (32-1) is the magic number which hurts "yyyymmdd" and 50% of
>> your buckets have 0 data.
>>
>> http://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/6
>>
>>
>> Use that as a number and you'll get the same number back as the hashcode,
>> so it won't be stable as months change (20160816 % 32 == 16 and 20160716 %
>> 32 == 12).
>>
>> The only way to have buckets correspond to a day_of_month as an int and
>> bucket on it with 32 - then bucket0 == 31, bucket1=1, bucket2=2 etc.
>>
>> Cheers,
>> Gopal
>>
>>
>>
>

Re: Inserting data in hive bucket

Posted by Mich Talebzadeh <mi...@gmail.com>.
Hash partitioning (Bucketing) does not make much sense for YYYY/MM/DD/32 as
pointed out.

So it is clear that with (mod 32), the maximum number of offsets is going
to be 32, i.e. in the range between 0-31. With YYYY/MM/DD you have to
account for hash collisions as well. The set of inputs is potentially many
(definitely not known until we encounter them all) and if you want to
spread them evenly (after all that is what hash partitioning is all about)
then I think day of the month makes more sense.

HTH



Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 19 August 2016 at 23:15, Gopal Vijayaraghavan <go...@apache.org> wrote:

>
> > We are bucketing by date so we wil have max 32 buckets
>
> If you do want to lookup specifically by date, you could just create day
> partitions and never partition by month.
>
> FYI, in a modern version of Hive
>
> select count(1) from table where YEAR(dt) = 2016 and MONTH(dt) = 12
>
> does prune it on the client side.
>
> On a different note, 31 buckets is a bad idea (32 is ok), because for
> String hashes (32-1) is the magic number which hurts "yyyymmdd" and 50% of
> your buckets have 0 data.
>
> http://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/6
>
>
> Use that as a number and you'll get the same number back as the hashcode,
> so it won't be stable as months change (20160816 % 32 == 16 and 20160716 %
> 32 == 12).
>
> The only way to have buckets correspond to a day_of_month as an int and
> bucket on it with 32 - then bucket0 == 31, bucket1=1, bucket2=2 etc.
>
> Cheers,
> Gopal
>
>
>

Re: Inserting data in hive bucket

Posted by Gopal Vijayaraghavan <go...@apache.org>.
> We are bucketing by date so we wil have max 32 buckets

If you do want to lookup specifically by date, you could just create day
partitions and never partition by month.

FYI, in a modern version of Hive

select count(1) from table where YEAR(dt) = 2016 and MONTH(dt) = 12

does prune it on the client side.

On a different note, 31 buckets is a bad idea (32 is ok), because for
String hashes (32-1) is the magic number which hurts "yyyymmdd" and 50% of
your buckets have 0 data.

http://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/6


Use that as a number and you'll get the same number back as the hashcode,
so it won't be stable as months change (20160816 % 32 == 16 and 20160716 %
32 == 12).

The only way to have buckets correspond to a day_of_month as an int and
bucket on it with 32 - then bucket0 == 31, bucket1=1, bucket2=2 etc.

Cheers,
Gopal



Re: Inserting data in hive bucket

Posted by Rahul Channe <dr...@googlemail.com>.
Hi Mich,

We are bucketing by date so we wil have max 32 buckets

On Friday, August 19, 2016, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Hi,
>
> You are partitioning by Month and bucketing by date or day?
>
> If that is the case you only have 30-31 hash partitioning (bucketing) for
> each Month?
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 19 August 2016 at 21:46, Rahul Channe <drahulc@googlemail.com
> <javascript:_e(%7B%7D,'cvml','drahulc@googlemail.com');>> wrote:
>
>> Hi ,
>>
>> Is there a way to create the bucket with specific file name in hive. I
>> see hive creates random names for the file starting with 0000
>>
>> I want the file name to be in yyyymmdd format since I am partitioning by
>> month and bucketing by date
>>
>> Thank you
>>
>> Rahul
>>
>
>

Re: Inserting data in hive bucket

Posted by Mich Talebzadeh <mi...@gmail.com>.
Hi,

You are partitioning by Month and bucketing by date or day?

If that is the case you only have 30-31 hash partitioning (bucketing) for
each Month?

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 19 August 2016 at 21:46, Rahul Channe <dr...@googlemail.com> wrote:

> Hi ,
>
> Is there a way to create the bucket with specific file name in hive. I see
> hive creates random names for the file starting with 0000
>
> I want the file name to be in yyyymmdd format since I am partitioning by
> month and bucketing by date
>
> Thank you
>
> Rahul
>