You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Jeff Zhang <zj...@gmail.com> on 2010/02/25 10:01:16 UTC

Why no two aggregations can have different DISTINCT columns ?

Hi all,

I read the tutorial of Hive, and it says that "no two aggregations can have
different DISTINCT columns". Could anyone tell what is the reason ? Does the
following Distinct will been translate to map-reduce job or just do it
locally ?

    INSERT OVERWRITE TABLE pv_gender_agg
    SELECT pv_users.gender, count(DISTINCT pv_users.userid),
count(DISTINCT pv_users.ip)
    FROM pv_users
    GROUP BY pv_users.gender;


-- 
Best Regards

Jeff Zhang

Re: Why no two aggregations can have different DISTINCT columns ?

Posted by Zheng Shao <zs...@gmail.com>.

This will get a compilation error.
The reason is that we use the sort phase in reducers to make sure we
can detect duplicate values.
We can only sort the table in one way than the other.

See https://issues.apache.org/jira/browse/HIVE-537 and
https://issues.apache.org/jira/browse/HIVE-474 for details.

Zheng

On Thu, Feb 25, 2010 at 1:01 AM, Jeff Zhang <zj...@gmail.com> wrote:
>
> Hi all,
>
> I read the tutorial of Hive, and it says that "no two aggregations can have
> different DISTINCT columns". Could anyone tell what is the reason ? Does the
> following Distinct will been translate to map-reduce job or just do it
> locally ?
>
>     INSERT OVERWRITE TABLE pv_gender_agg
>     SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT
> pv_users.ip)
>     FROM pv_users
>     GROUP BY pv_users.gender;
>
> --
> Best Regards
>
> Jeff Zhang
>

-- 
Yours,
Zheng

Re: Why no two aggregations can have different DISTINCT columns ?

Posted by Amr Awadallah <aa...@cloudera.com>.

+1, please post jira/patch.

-- amr

On 2/25/2010 1:20 AM, Zheng Shao wrote:
> Yes definitely. Do you want to open a JIRA and post a patch?
> Please link the new JIRA to the other 2 JIRA that was mentioned in the
> same email thread.
>
> Zheng
>
> On Thu, Feb 25, 2010 at 1:16 AM, Mafish Liu<ma...@gmail.com>  wrote:
>    
>> Hive does not support multi-distinct in one query.
>>
>> We have implemented multi-distinct based on hive 0.4.2rc to our demand.
>> We don't know that if Hive is intresting in this feature.
>>
>> 2010/2/25 Jeff Zhang<zj...@gmail.com>:
>>      
>>> Hi all,
>>>
>>> I read the tutorial of Hive, and it says that "no two aggregations can have
>>> different DISTINCT columns". Could anyone tell what is the reason ? Does the
>>> following Distinct will been translate to map-reduce job or just do it
>>> locally ?
>>>
>>>      INSERT OVERWRITE TABLE pv_gender_agg
>>>      SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT
>>> pv_users.ip)
>>>      FROM pv_users
>>>      GROUP BY pv_users.gender;
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>>        
>>
>>
>> --
>> Mafish@gmail.com
>>
>>      
>
>
>

Re: Why no two aggregations can have different DISTINCT columns ?

Posted by Mafish Liu <ma...@gmail.com>.

Patch uploaded.
Please have a review at https://issues.apache.org/jira/browse/HIVE-474

2010/2/26 Mafish Liu <ma...@gmail.com>:
> 2010/2/25 Todd Lipcon <to...@cloudera.com>:
>> I think you can use this existing JIRA:
>> http://issues.apache.org/jira/browse/HIVE-474
> I'm using this JIRA. Thanks.
>
>>
>> Thanks
>> -Todd
>> On Thu, Feb 25, 2010 at 2:11 AM, Mafish Liu <ma...@gmail.com> wrote:
>>>
>>> 2010/2/25 Zheng Shao <zs...@gmail.com>:
>>> > Yes definitely. Do you want to open a JIRA and post a patch?
>>> > Please link the new JIRA to the other 2 JIRA that was mentioned in the
>>> > same email thread.
>>> I'll open a jira.
>>> And the patch will be post after code and documents  being arranged.
>>>
>>> > Zheng
>>> >
>>> > On Thu, Feb 25, 2010 at 1:16 AM, Mafish Liu <ma...@gmail.com> wrote:
>>> >> Hive does not support multi-distinct in one query.
>>> >>
>>> >> We have implemented multi-distinct based on hive 0.4.2rc to our demand.
>>> >> We don't know that if Hive is intresting in this feature.
>>> >>
>>> >> 2010/2/25 Jeff Zhang <zj...@gmail.com>:
>>> >>>
>>> >>> Hi all,
>>> >>>
>>> >>> I read the tutorial of Hive, and it says that "no two aggregations can
>>> >>> have
>>> >>> different DISTINCT columns". Could anyone tell what is the reason ?
>>> >>> Does the
>>> >>> following Distinct will been translate to map-reduce job or just do it
>>> >>> locally ?
>>> >>>
>>> >>>     INSERT OVERWRITE TABLE pv_gender_agg
>>> >>>     SELECT pv_users.gender, count(DISTINCT pv_users.userid),
>>> >>> count(DISTINCT
>>> >>> pv_users.ip)
>>> >>>     FROM pv_users
>>> >>>     GROUP BY pv_users.gender;
>>> >>>
>>> >>> --
>>> >>> Best Regards
>>> >>>
>>> >>> Jeff Zhang
>>> >>>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Mafish@gmail.com
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Yours,
>>> > Zheng
>>> >
>>>
>>>
>>>
>>> --
>>> Mafish@gmail.com
>>
>>
>
>
>
> --
> Mafish@gmail.com
>



-- 
Mafish@gmail.com

Re: Why no two aggregations can have different DISTINCT columns ?

Posted by Mafish Liu <ma...@gmail.com>.

2010/2/25 Todd Lipcon <to...@cloudera.com>:
> I think you can use this existing JIRA:
> http://issues.apache.org/jira/browse/HIVE-474
I'm using this JIRA. Thanks.

>
> Thanks
> -Todd
> On Thu, Feb 25, 2010 at 2:11 AM, Mafish Liu <ma...@gmail.com> wrote:
>>
>> 2010/2/25 Zheng Shao <zs...@gmail.com>:
>> > Yes definitely. Do you want to open a JIRA and post a patch?
>> > Please link the new JIRA to the other 2 JIRA that was mentioned in the
>> > same email thread.
>> I'll open a jira.
>> And the patch will be post after code and documents  being arranged.
>>
>> > Zheng
>> >
>> > On Thu, Feb 25, 2010 at 1:16 AM, Mafish Liu <ma...@gmail.com> wrote:
>> >> Hive does not support multi-distinct in one query.
>> >>
>> >> We have implemented multi-distinct based on hive 0.4.2rc to our demand.
>> >> We don't know that if Hive is intresting in this feature.
>> >>
>> >> 2010/2/25 Jeff Zhang <zj...@gmail.com>:
>> >>>
>> >>> Hi all,
>> >>>
>> >>> I read the tutorial of Hive, and it says that "no two aggregations can
>> >>> have
>> >>> different DISTINCT columns". Could anyone tell what is the reason ?
>> >>> Does the
>> >>> following Distinct will been translate to map-reduce job or just do it
>> >>> locally ?
>> >>>
>> >>>     INSERT OVERWRITE TABLE pv_gender_agg
>> >>>     SELECT pv_users.gender, count(DISTINCT pv_users.userid),
>> >>> count(DISTINCT
>> >>> pv_users.ip)
>> >>>     FROM pv_users
>> >>>     GROUP BY pv_users.gender;
>> >>>
>> >>> --
>> >>> Best Regards
>> >>>
>> >>> Jeff Zhang
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Mafish@gmail.com
>> >>
>> >
>> >
>> >
>> > --
>> > Yours,
>> > Zheng
>> >
>>
>>
>>
>> --
>> Mafish@gmail.com
>
>



-- 
Mafish@gmail.com

Re: Why no two aggregations can have different DISTINCT columns ?

Posted by Todd Lipcon <to...@cloudera.com>.

I think you can use this existing JIRA:

http://issues.apache.org/jira/browse/HIVE-474

Thanks
-Todd

On Thu, Feb 25, 2010 at 2:11 AM, Mafish Liu <ma...@gmail.com> wrote:

> 2010/2/25 Zheng Shao <zs...@gmail.com>:
> > Yes definitely. Do you want to open a JIRA and post a patch?
> > Please link the new JIRA to the other 2 JIRA that was mentioned in the
> > same email thread.
> I'll open a jira.
> And the patch will be post after code and documents  being arranged.
>
> > Zheng
> >
> > On Thu, Feb 25, 2010 at 1:16 AM, Mafish Liu <ma...@gmail.com> wrote:
> >> Hive does not support multi-distinct in one query.
> >>
> >> We have implemented multi-distinct based on hive 0.4.2rc to our demand.
> >> We don't know that if Hive is intresting in this feature.
> >>
> >> 2010/2/25 Jeff Zhang <zj...@gmail.com>:
> >>>
> >>> Hi all,
> >>>
> >>> I read the tutorial of Hive, and it says that "no two aggregations can
> have
> >>> different DISTINCT columns". Could anyone tell what is the reason ?
> Does the
> >>> following Distinct will been translate to map-reduce job or just do it
> >>> locally ?
> >>>
> >>>     INSERT OVERWRITE TABLE pv_gender_agg
> >>>     SELECT pv_users.gender, count(DISTINCT pv_users.userid),
> count(DISTINCT
> >>> pv_users.ip)
> >>>     FROM pv_users
> >>>     GROUP BY pv_users.gender;
> >>>
> >>> --
> >>> Best Regards
> >>>
> >>> Jeff Zhang
> >>>
> >>
> >>
> >>
> >> --
> >> Mafish@gmail.com
> >>
> >
> >
> >
> > --
> > Yours,
> > Zheng
> >
>
>
>
> --
> Mafish@gmail.com
>

Re: Why no two aggregations can have different DISTINCT columns ?

Posted by Mafish Liu <ma...@gmail.com>.

2010/2/25 Zheng Shao <zs...@gmail.com>:
> Yes definitely. Do you want to open a JIRA and post a patch?
> Please link the new JIRA to the other 2 JIRA that was mentioned in the
> same email thread.
I'll open a jira.
And the patch will be post after code and documents  being arranged.

> Zheng
>
> On Thu, Feb 25, 2010 at 1:16 AM, Mafish Liu <ma...@gmail.com> wrote:
>> Hive does not support multi-distinct in one query.
>>
>> We have implemented multi-distinct based on hive 0.4.2rc to our demand.
>> We don't know that if Hive is intresting in this feature.
>>
>> 2010/2/25 Jeff Zhang <zj...@gmail.com>:
>>>
>>> Hi all,
>>>
>>> I read the tutorial of Hive, and it says that "no two aggregations can have
>>> different DISTINCT columns". Could anyone tell what is the reason ? Does the
>>> following Distinct will been translate to map-reduce job or just do it
>>> locally ?
>>>
>>>     INSERT OVERWRITE TABLE pv_gender_agg
>>>     SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT
>>> pv_users.ip)
>>>     FROM pv_users
>>>     GROUP BY pv_users.gender;
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>>
>> --
>> Mafish@gmail.com
>>
>
>
>
> --
> Yours,
> Zheng
>



-- 
Mafish@gmail.com

Re: Why no two aggregations can have different DISTINCT columns ?

Posted by Zheng Shao <zs...@gmail.com>.

Yes definitely. Do you want to open a JIRA and post a patch?
Please link the new JIRA to the other 2 JIRA that was mentioned in the
same email thread.

Zheng

On Thu, Feb 25, 2010 at 1:16 AM, Mafish Liu <ma...@gmail.com> wrote:
> Hive does not support multi-distinct in one query.
>
> We have implemented multi-distinct based on hive 0.4.2rc to our demand.
> We don't know that if Hive is intresting in this feature.
>
> 2010/2/25 Jeff Zhang <zj...@gmail.com>:
>>
>> Hi all,
>>
>> I read the tutorial of Hive, and it says that "no two aggregations can have
>> different DISTINCT columns". Could anyone tell what is the reason ? Does the
>> following Distinct will been translate to map-reduce job or just do it
>> locally ?
>>
>>     INSERT OVERWRITE TABLE pv_gender_agg
>>     SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT
>> pv_users.ip)
>>     FROM pv_users
>>     GROUP BY pv_users.gender;
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> Mafish@gmail.com
>



-- 
Yours,
Zheng

Re: Why no two aggregations can have different DISTINCT columns ?

Posted by Mafish Liu <ma...@gmail.com>.

here are our result of multi-distinct:

hive> describe classes;
OK
name    string
number  string
class   string
Time taken: 0.122 seconds
hive> select * from classes;
OK
1       11      8
2       22      12
4       212     2
5       232     23
6       22      2
7       22      2
3       333     13
3       33      3
4       133     32
5       33      3
Time taken: 0.154 seconds

hive> select count(distinct name), count(distinct number), class from
classes group by class;
....
1       1       12
1       1       13
3       2       2
1       1       23
2       1       3
1       1       32
1       1       8


2010/2/25 Mafish Liu <ma...@gmail.com>:
> Hive does not support multi-distinct in one query.
>
> We have implemented multi-distinct based on hive 0.4.2rc to our demand.
> We don't know that if Hive is intresting in this feature.
>
> 2010/2/25 Jeff Zhang <zj...@gmail.com>:
>>
>> Hi all,
>>
>> I read the tutorial of Hive, and it says that "no two aggregations can have
>> different DISTINCT columns". Could anyone tell what is the reason ? Does the
>> following Distinct will been translate to map-reduce job or just do it
>> locally ?
>>
>>     INSERT OVERWRITE TABLE pv_gender_agg
>>     SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT
>> pv_users.ip)
>>     FROM pv_users
>>     GROUP BY pv_users.gender;
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> Mafish@gmail.com
>



-- 
Mafish@gmail.com

Re: Why no two aggregations can have different DISTINCT columns ?

Posted by Mafish Liu <ma...@gmail.com>.

Hive does not support multi-distinct in one query.

We have implemented multi-distinct based on hive 0.4.2rc to our demand.
We don't know that if Hive is intresting in this feature.

2010/2/25 Jeff Zhang <zj...@gmail.com>:
>
> Hi all,
>
> I read the tutorial of Hive, and it says that "no two aggregations can have
> different DISTINCT columns". Could anyone tell what is the reason ? Does the
> following Distinct will been translate to map-reduce job or just do it
> locally ?
>
>     INSERT OVERWRITE TABLE pv_gender_agg
>     SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT
> pv_users.ip)
>     FROM pv_users
>     GROUP BY pv_users.gender;
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
Mafish@gmail.com