You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Jeremy Chow <co...@gmail.com> on 2009/01/08 09:21:33 UTC

what is term "bucket" means?

Hi list,

I get a term named bucket when reading hive source code. what is it means?

Thanks,
Jeremy
-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

http://coderplay.javaeye.com

Re: what is term "bucket" means?

Posted by Jeremy Chow <co...@gmail.com>.

I've got it , thank you.!

On Thu, Jan 8, 2009 at 9:17 PM, Jeremy Chow <co...@gmail.com> wrote:

> Is that the same meaning of hash partition?
>
>
> On Thu, Jan 8, 2009 at 4:52 PM, Jeff Hammerbacher <ha...@cloudera.com>wrote:
>
>> Hey Jeremy,
>>
>> Hive stores each "table" inside of HDFS in a folder. For example, all of
>> your weblogs could be stored in a folder called "/hive/weblogs". If you want
>> to partition those weblogs by day, you can use the PARTITIONED BY clause on
>> the CREATE TABLE statement to create a subfolder for each new day, e.g.
>> "/hive/weblogs/ds=2009-01-08". If you wanted to further partition a day's
>> logfiles by userid, for example, Hive can hash partition your logfiles into
>> "buckets" (subfolders) inside that day's folder, e.g.
>> "/hive/weblogs/ds=2009-01-08/0001", where 0001 is the name of the bucket. To
>> indicate your desire to have buckets, use the CLUSTERED BY clause on the
>> CREATE TABLE statement (see
>> http://wiki.apache.org/hadoop/Hive/HiveQL#head-6fb42f2747383d4375e56cc31bbae68860c88a3d
>> ).
>>
>> You can also use buckets with the TABLESAMPLE operator to run Hive queries
>> over subsets of your data; this is useful for rapidly prototyping new
>> analyses. See
>> http://wiki.apache.org/hadoop/Hive/HiveQL#head-c7c5e4391816048d290eb70091487b4f91beebc9for the TABLESAMPLE syntax.
>>
>> Hive folks: in case I butchered that, feel free to jump in with a more
>> correct explanation. If it's correct, I'll toss it on the wiki. It would be
>> good to have actual HiveQL statements using buckets on the getting started
>> guide too, I'd imagine.
>>
>> Later,
>> Jeff
>>
>>
>> On Thu, Jan 8, 2009 at 12:21 AM, Jeremy Chow <co...@gmail.com> wrote:
>>
>>> Hi list,
>>>
>>> I get a term named bucket when reading hive source code. what is it
>>> means?
>>>
>>> Thanks,
>>> Jeremy
>>> --
>>> My research interests are distributed systems, parallel computing and
>>> bytecode based virtual machine.
>>>
>>> http://coderplay.javaeye.com
>>>
>>
>>
>
>
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> http://coderplay.javaeye.com
>



-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

http://coderplay.javaeye.com

Re: what is term "bucket" means?

Posted by Jeremy Chow <co...@gmail.com>.

Is that the same meaning of hash partition?

On Thu, Jan 8, 2009 at 4:52 PM, Jeff Hammerbacher <ha...@cloudera.com>wrote:

> Hey Jeremy,
>
> Hive stores each "table" inside of HDFS in a folder. For example, all of
> your weblogs could be stored in a folder called "/hive/weblogs". If you want
> to partition those weblogs by day, you can use the PARTITIONED BY clause on
> the CREATE TABLE statement to create a subfolder for each new day, e.g.
> "/hive/weblogs/ds=2009-01-08". If you wanted to further partition a day's
> logfiles by userid, for example, Hive can hash partition your logfiles into
> "buckets" (subfolders) inside that day's folder, e.g.
> "/hive/weblogs/ds=2009-01-08/0001", where 0001 is the name of the bucket. To
> indicate your desire to have buckets, use the CLUSTERED BY clause on the
> CREATE TABLE statement (see
> http://wiki.apache.org/hadoop/Hive/HiveQL#head-6fb42f2747383d4375e56cc31bbae68860c88a3d
> ).
>
> You can also use buckets with the TABLESAMPLE operator to run Hive queries
> over subsets of your data; this is useful for rapidly prototyping new
> analyses. See
> http://wiki.apache.org/hadoop/Hive/HiveQL#head-c7c5e4391816048d290eb70091487b4f91beebc9for the TABLESAMPLE syntax.
>
> Hive folks: in case I butchered that, feel free to jump in with a more
> correct explanation. If it's correct, I'll toss it on the wiki. It would be
> good to have actual HiveQL statements using buckets on the getting started
> guide too, I'd imagine.
>
> Later,
> Jeff
>
>
> On Thu, Jan 8, 2009 at 12:21 AM, Jeremy Chow <co...@gmail.com> wrote:
>
>> Hi list,
>>
>> I get a term named bucket when reading hive source code. what is it means?
>>
>> Thanks,
>> Jeremy
>> --
>> My research interests are distributed systems, parallel computing and
>> bytecode based virtual machine.
>>
>> http://coderplay.javaeye.com
>>
>
>


-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

http://coderplay.javaeye.com

Re: what is term "bucket" means?

Posted by Jeff Hammerbacher <ha...@cloudera.com>.

Hey Jeremy,

Hive stores each "table" inside of HDFS in a folder. For example, all of
your weblogs could be stored in a folder called "/hive/weblogs". If you want
to partition those weblogs by day, you can use the PARTITIONED BY clause on
the CREATE TABLE statement to create a subfolder for each new day, e.g.
"/hive/weblogs/ds=2009-01-08". If you wanted to further partition a day's
logfiles by userid, for example, Hive can hash partition your logfiles into
"buckets" (subfolders) inside that day's folder, e.g.
"/hive/weblogs/ds=2009-01-08/0001", where 0001 is the name of the bucket. To
indicate your desire to have buckets, use the CLUSTERED BY clause on the
CREATE TABLE statement (see
http://wiki.apache.org/hadoop/Hive/HiveQL#head-6fb42f2747383d4375e56cc31bbae68860c88a3d
).

You can also use buckets with the TABLESAMPLE operator to run Hive queries
over subsets of your data; this is useful for rapidly prototyping new
analyses. See
http://wiki.apache.org/hadoop/Hive/HiveQL#head-c7c5e4391816048d290eb70091487b4f91beebc9for
the TABLESAMPLE syntax.

Hive folks: in case I butchered that, feel free to jump in with a more
correct explanation. If it's correct, I'll toss it on the wiki. It would be
good to have actual HiveQL statements using buckets on the getting started
guide too, I'd imagine.

Later,
Jeff

On Thu, Jan 8, 2009 at 12:21 AM, Jeremy Chow <co...@gmail.com> wrote:

> Hi list,
>
> I get a term named bucket when reading hive source code. what is it means?
>
> Thanks,
> Jeremy
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> http://coderplay.javaeye.com
>