You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Lefty Leverenz <le...@gmail.com> on 2014/07/02 07:25:49 UTC

Re: Skewed vs ListBucketing

Does anyone have time to answer this?  It would be good to clarify things
in the wiki.

HIVE-3649 <https://issues.apache.org/jira/browse/HIVE-3649> added the list
bucketing feature in release 0.10.0.  The description says:

We need to differ normal skewed table from list bucketing table. we use an
> optional parameter "store as DIRECTORIES"


So I think your understanding is correct, but let's hear from the experts.

-- Lefty


On Fri, Jun 27, 2014 at 1:25 PM, Steven Willis <sw...@compete.com> wrote:

> I'm having trouble understanding the difference between a skewed table and
> a list bucketed table:
>
> https://cwiki.apache.org/confluence/display/Hive/ListBucketing
>
> Is the only difference that ListBucketing stores the data as directories
> and a "plain" skewed table stores them as files? I think that's what the
> wiki page is saying, but it's very confusing. For one, the title of the
> page is ListBucketing and in many places it seems to use the phrase "List
> Bucketing" as the general feature of partitioning a table by skewed columns
> (whether in directories or files).
>
> There's a section "Skewed Table vs. List Bucketing Table" (
> https://cwiki.apache.org/confluence/display/Hive/ListBucketing#ListBucketing-ListBucketing) that
> I would assume would spell out the differences between the two, but it says:
>
>  - Skewed Table is a table which has skewed information.
>  - List Bucketing Table is a skewed table. In addition, it tells Hive to
> use the list bucketing feature on the skewed table: create sub-directories
> for skewed values.
>
> That makes it seem like "the list bucketing feature" is just using
> sub-directories for the data. If that's the case, why is the whole article
> titled ListBucketing, and why is the section describing the basic idea
> (that apparently both skewed tables and list bucketed tables have in
> common) titled just "List Bucketing" (
> https://cwiki.apache.org/confluence/display/Hive/ListBucketing#ListBucketing-ListBucketing
> ).
>
> The article also says, "Mainly due to its sub-directory nature, list
> bucketing can't coexist with some features." So does that mean just list
> bucketing (the subdirectory feature that skewed tables can have as an
> option) is incompatible with the features mentioned, or does it mean that
> any skewed table is incompatible with said features.
>
> -Steve
>

Re: Skewed vs ListBucketing

Posted by Lefty Leverenz <le...@gmail.com>.
Well, it turns out I dropped the ball on improving the list bucketing docs
back in April.  See the message thread "Skewed Tables"
<http://mail-archives.apache.org/mod_mbox/hive-user/201404.mbox/%3c4B94C3FD-B6D2-4844-8CEB-7C992A2261F6@hortonworks.com%3e>
that Mayur Gupta started on April 21 and Prasanth Jayachandran left in my
hands on April 28:

There are two different optimizations that use "SKEWED BY” keyword. One is
> skewed join optimization and other is list bucketing optimization. I think
> we need to mention this in some place so that users are aware of the
> difference between the two. “STORED AS DIRECTORIES” is used by only one
> optimization i.e list bucketing.


I'm open to suggestions about how to improve the doc, or I'll tackle it as
best I can with the given information.

-- Lefty


On Wed, Jul 2, 2014 at 1:55 AM, Lefty Leverenz <le...@gmail.com>
wrote:

> The Skewed Tables
> <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-SkewedTables>
> section in the DDL wikidoc has more information which might be helpful.
>
> HIVE-3649 was just one of several jiras that added list bucketing in
> releases 0.10 and 0.11.  See HIVE-3026
> <https://issues.apache.org/jira/browse/HIVE-3026> for links to the rest
> of them.  (The one that added DML support hasn't been documented yet:
> HIVE-3073 <https://issues.apache.org/jira/browse/HIVE-3073>.)
>
> I'm revising the jira links in the wiki now.
>
> -- Lefty
>
>
> On Wed, Jul 2, 2014 at 1:25 AM, Lefty Leverenz <le...@gmail.com>
> wrote:
>
>> Does anyone have time to answer this?  It would be good to clarify things
>> in the wiki.
>>
>> HIVE-3649 <https://issues.apache.org/jira/browse/HIVE-3649> added the
>> list bucketing feature in release 0.10.0.  The description says:
>>
>> We need to differ normal skewed table from list bucketing table. we use
>>> an optional parameter "store as DIRECTORIES"
>>
>>
>> So I think your understanding is correct, but let's hear from the experts.
>>
>> -- Lefty
>>
>>
>> On Fri, Jun 27, 2014 at 1:25 PM, Steven Willis <sw...@compete.com>
>> wrote:
>>
>>> I'm having trouble understanding the difference between a skewed table
>>> and a list bucketed table:
>>>
>>> https://cwiki.apache.org/confluence/display/Hive/ListBucketing
>>>
>>> Is the only difference that ListBucketing stores the data as directories
>>> and a "plain" skewed table stores them as files? I think that's what the
>>> wiki page is saying, but it's very confusing. For one, the title of the
>>> page is ListBucketing and in many places it seems to use the phrase "List
>>> Bucketing" as the general feature of partitioning a table by skewed columns
>>> (whether in directories or files).
>>>
>>> There's a section "Skewed Table vs. List Bucketing Table" (
>>> https://cwiki.apache.org/confluence/display/Hive/ListBucketing#ListBucketing-ListBucketing) that
>>> I would assume would spell out the differences between the two, but it says:
>>>
>>>  - Skewed Table is a table which has skewed information.
>>>  - List Bucketing Table is a skewed table. In addition, it tells Hive to
>>> use the list bucketing feature on the skewed table: create sub-directories
>>> for skewed values.
>>>
>>> That makes it seem like "the list bucketing feature" is just using
>>> sub-directories for the data. If that's the case, why is the whole article
>>> titled ListBucketing, and why is the section describing the basic idea
>>> (that apparently both skewed tables and list bucketed tables have in
>>> common) titled just "List Bucketing" (
>>> https://cwiki.apache.org/confluence/display/Hive/ListBucketing#ListBucketing-ListBucketing
>>> ).
>>>
>>> The article also says, "Mainly due to its sub-directory nature, list
>>> bucketing can't coexist with some features." So does that mean just list
>>> bucketing (the subdirectory feature that skewed tables can have as an
>>> option) is incompatible with the features mentioned, or does it mean that
>>> any skewed table is incompatible with said features.
>>>
>>> -Steve
>>>
>>
>>
>

Re: Skewed vs ListBucketing

Posted by Lefty Leverenz <le...@gmail.com>.
The Skewed Tables
<https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-SkewedTables>
section in the DDL wikidoc has more information which might be helpful.

HIVE-3649 was just one of several jiras that added list bucketing in
releases 0.10 and 0.11.  See HIVE-3026
<https://issues.apache.org/jira/browse/HIVE-3026> for links to the rest of
them.  (The one that added DML support hasn't been documented yet:
HIVE-3073 <https://issues.apache.org/jira/browse/HIVE-3073>.)

I'm revising the jira links in the wiki now.

-- Lefty


On Wed, Jul 2, 2014 at 1:25 AM, Lefty Leverenz <le...@gmail.com>
wrote:

> Does anyone have time to answer this?  It would be good to clarify things
> in the wiki.
>
> HIVE-3649 <https://issues.apache.org/jira/browse/HIVE-3649> added the
> list bucketing feature in release 0.10.0.  The description says:
>
> We need to differ normal skewed table from list bucketing table. we use an
>> optional parameter "store as DIRECTORIES"
>
>
> So I think your understanding is correct, but let's hear from the experts.
>
> -- Lefty
>
>
> On Fri, Jun 27, 2014 at 1:25 PM, Steven Willis <sw...@compete.com>
> wrote:
>
>> I'm having trouble understanding the difference between a skewed table
>> and a list bucketed table:
>>
>> https://cwiki.apache.org/confluence/display/Hive/ListBucketing
>>
>> Is the only difference that ListBucketing stores the data as directories
>> and a "plain" skewed table stores them as files? I think that's what the
>> wiki page is saying, but it's very confusing. For one, the title of the
>> page is ListBucketing and in many places it seems to use the phrase "List
>> Bucketing" as the general feature of partitioning a table by skewed columns
>> (whether in directories or files).
>>
>> There's a section "Skewed Table vs. List Bucketing Table" (
>> https://cwiki.apache.org/confluence/display/Hive/ListBucketing#ListBucketing-ListBucketing) that
>> I would assume would spell out the differences between the two, but it says:
>>
>>  - Skewed Table is a table which has skewed information.
>>  - List Bucketing Table is a skewed table. In addition, it tells Hive to
>> use the list bucketing feature on the skewed table: create sub-directories
>> for skewed values.
>>
>> That makes it seem like "the list bucketing feature" is just using
>> sub-directories for the data. If that's the case, why is the whole article
>> titled ListBucketing, and why is the section describing the basic idea
>> (that apparently both skewed tables and list bucketed tables have in
>> common) titled just "List Bucketing" (
>> https://cwiki.apache.org/confluence/display/Hive/ListBucketing#ListBucketing-ListBucketing
>> ).
>>
>> The article also says, "Mainly due to its sub-directory nature, list
>> bucketing can't coexist with some features." So does that mean just list
>> bucketing (the subdirectory feature that skewed tables can have as an
>> option) is incompatible with the features mentioned, or does it mean that
>> any skewed table is incompatible with said features.
>>
>> -Steve
>>
>
>