You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Jeff Zhang <zj...@gmail.com> on 2010/02/25 10:01:16 UTC
Why no two aggregations can have different DISTINCT columns ?
Hi all,
I read the tutorial of Hive, and it says that "no two aggregations can have
different DISTINCT columns". Could anyone tell what is the reason ? Does the
following Distinct will been translate to map-reduce job or just do it
locally ?
INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid),
count(DISTINCT pv_users.ip)
FROM pv_users
GROUP BY pv_users.gender;
--
Best Regards
Jeff Zhang
Re: Why no two aggregations can have different DISTINCT columns ?
Posted by Zheng Shao <zs...@gmail.com>.
This will get a compilation error.
The reason is that we use the sort phase in reducers to make sure we
can detect duplicate values.
We can only sort the table in one way than the other.
See https://issues.apache.org/jira/browse/HIVE-537 and
https://issues.apache.org/jira/browse/HIVE-474 for details.
Zheng
On Thu, Feb 25, 2010 at 1:01 AM, Jeff Zhang <zj...@gmail.com> wrote:
>
> Hi all,
>
> I read the tutorial of Hive, and it says that "no two aggregations can have
> different DISTINCT columns". Could anyone tell what is the reason ? Does the
> following Distinct will been translate to map-reduce job or just do it
> locally ?
>
> INSERT OVERWRITE TABLE pv_gender_agg
> SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT
> pv_users.ip)
> FROM pv_users
> GROUP BY pv_users.gender;
>
> --
> Best Regards
>
> Jeff Zhang
>
--
Yours,
Zheng
Re: Why no two aggregations can have different DISTINCT columns ?
Posted by Amr Awadallah <aa...@cloudera.com>.
+1, please post jira/patch.
-- amr
On 2/25/2010 1:20 AM, Zheng Shao wrote:
> Yes definitely. Do you want to open a JIRA and post a patch?
> Please link the new JIRA to the other 2 JIRA that was mentioned in the
> same email thread.
>
> Zheng
>
> On Thu, Feb 25, 2010 at 1:16 AM, Mafish Liu<ma...@gmail.com> wrote:
>
>> Hive does not support multi-distinct in one query.
>>
>> We have implemented multi-distinct based on hive 0.4.2rc to our demand.
>> We don't know that if Hive is intresting in this feature.
>>
>> 2010/2/25 Jeff Zhang<zj...@gmail.com>:
>>
>>> Hi all,
>>>
>>> I read the tutorial of Hive, and it says that "no two aggregations can have
>>> different DISTINCT columns". Could anyone tell what is the reason ? Does the
>>> following Distinct will been translate to map-reduce job or just do it
>>> locally ?
>>>
>>> INSERT OVERWRITE TABLE pv_gender_agg
>>> SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT
>>> pv_users.ip)
>>> FROM pv_users
>>> GROUP BY pv_users.gender;
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>>
>>
>>
>> --
>> Mafish@gmail.com
>>
>>
>
>
>
Re: Why no two aggregations can have different DISTINCT columns ?
Posted by Mafish Liu <ma...@gmail.com>.
Patch uploaded.
Please have a review at https://issues.apache.org/jira/browse/HIVE-474
2010/2/26 Mafish Liu <ma...@gmail.com>:
> 2010/2/25 Todd Lipcon <to...@cloudera.com>:
>> I think you can use this existing JIRA:
>> http://issues.apache.org/jira/browse/HIVE-474
> I'm using this JIRA. Thanks.
>
>>
>> Thanks
>> -Todd
>> On Thu, Feb 25, 2010 at 2:11 AM, Mafish Liu <ma...@gmail.com> wrote:
>>>
>>> 2010/2/25 Zheng Shao <zs...@gmail.com>:
>>> > Yes definitely. Do you want to open a JIRA and post a patch?
>>> > Please link the new JIRA to the other 2 JIRA that was mentioned in the
>>> > same email thread.
>>> I'll open a jira.
>>> And the patch will be post after code and documents being arranged.
>>>
>>> > Zheng
>>> >
>>> > On Thu, Feb 25, 2010 at 1:16 AM, Mafish Liu <ma...@gmail.com> wrote:
>>> >> Hive does not support multi-distinct in one query.
>>> >>
>>> >> We have implemented multi-distinct based on hive 0.4.2rc to our demand.
>>> >> We don't know that if Hive is intresting in this feature.
>>> >>
>>> >> 2010/2/25 Jeff Zhang <zj...@gmail.com>:
>>> >>>
>>> >>> Hi all,
>>> >>>
>>> >>> I read the tutorial of Hive, and it says that "no two aggregations can
>>> >>> have
>>> >>> different DISTINCT columns". Could anyone tell what is the reason ?
>>> >>> Does the
>>> >>> following Distinct will been translate to map-reduce job or just do it
>>> >>> locally ?
>>> >>>
>>> >>> INSERT OVERWRITE TABLE pv_gender_agg
>>> >>> SELECT pv_users.gender, count(DISTINCT pv_users.userid),
>>> >>> count(DISTINCT
>>> >>> pv_users.ip)
>>> >>> FROM pv_users
>>> >>> GROUP BY pv_users.gender;
>>> >>>
>>> >>> --
>>> >>> Best Regards
>>> >>>
>>> >>> Jeff Zhang
>>> >>>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Mafish@gmail.com
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Yours,
>>> > Zheng
>>> >
>>>
>>>
>>>
>>> --
>>> Mafish@gmail.com
>>
>>
>
>
>
> --
> Mafish@gmail.com
>
--
Mafish@gmail.com
Re: Why no two aggregations can have different DISTINCT columns ?
Posted by Mafish Liu <ma...@gmail.com>.
2010/2/25 Todd Lipcon <to...@cloudera.com>:
> I think you can use this existing JIRA:
> http://issues.apache.org/jira/browse/HIVE-474
I'm using this JIRA. Thanks.
>
> Thanks
> -Todd
> On Thu, Feb 25, 2010 at 2:11 AM, Mafish Liu <ma...@gmail.com> wrote:
>>
>> 2010/2/25 Zheng Shao <zs...@gmail.com>:
>> > Yes definitely. Do you want to open a JIRA and post a patch?
>> > Please link the new JIRA to the other 2 JIRA that was mentioned in the
>> > same email thread.
>> I'll open a jira.
>> And the patch will be post after code and documents being arranged.
>>
>> > Zheng
>> >
>> > On Thu, Feb 25, 2010 at 1:16 AM, Mafish Liu <ma...@gmail.com> wrote:
>> >> Hive does not support multi-distinct in one query.
>> >>
>> >> We have implemented multi-distinct based on hive 0.4.2rc to our demand.
>> >> We don't know that if Hive is intresting in this feature.
>> >>
>> >> 2010/2/25 Jeff Zhang <zj...@gmail.com>:
>> >>>
>> >>> Hi all,
>> >>>
>> >>> I read the tutorial of Hive, and it says that "no two aggregations can
>> >>> have
>> >>> different DISTINCT columns". Could anyone tell what is the reason ?
>> >>> Does the
>> >>> following Distinct will been translate to map-reduce job or just do it
>> >>> locally ?
>> >>>
>> >>> INSERT OVERWRITE TABLE pv_gender_agg
>> >>> SELECT pv_users.gender, count(DISTINCT pv_users.userid),
>> >>> count(DISTINCT
>> >>> pv_users.ip)
>> >>> FROM pv_users
>> >>> GROUP BY pv_users.gender;
>> >>>
>> >>> --
>> >>> Best Regards
>> >>>
>> >>> Jeff Zhang
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Mafish@gmail.com
>> >>
>> >
>> >
>> >
>> > --
>> > Yours,
>> > Zheng
>> >
>>
>>
>>
>> --
>> Mafish@gmail.com
>
>
--
Mafish@gmail.com
Re: Why no two aggregations can have different DISTINCT columns ?
Posted by Todd Lipcon <to...@cloudera.com>.
I think you can use this existing JIRA:
http://issues.apache.org/jira/browse/HIVE-474
Thanks
-Todd
On Thu, Feb 25, 2010 at 2:11 AM, Mafish Liu <ma...@gmail.com> wrote:
> 2010/2/25 Zheng Shao <zs...@gmail.com>:
> > Yes definitely. Do you want to open a JIRA and post a patch?
> > Please link the new JIRA to the other 2 JIRA that was mentioned in the
> > same email thread.
> I'll open a jira.
> And the patch will be post after code and documents being arranged.
>
> > Zheng
> >
> > On Thu, Feb 25, 2010 at 1:16 AM, Mafish Liu <ma...@gmail.com> wrote:
> >> Hive does not support multi-distinct in one query.
> >>
> >> We have implemented multi-distinct based on hive 0.4.2rc to our demand.
> >> We don't know that if Hive is intresting in this feature.
> >>
> >> 2010/2/25 Jeff Zhang <zj...@gmail.com>:
> >>>
> >>> Hi all,
> >>>
> >>> I read the tutorial of Hive, and it says that "no two aggregations can
> have
> >>> different DISTINCT columns". Could anyone tell what is the reason ?
> Does the
> >>> following Distinct will been translate to map-reduce job or just do it
> >>> locally ?
> >>>
> >>> INSERT OVERWRITE TABLE pv_gender_agg
> >>> SELECT pv_users.gender, count(DISTINCT pv_users.userid),
> count(DISTINCT
> >>> pv_users.ip)
> >>> FROM pv_users
> >>> GROUP BY pv_users.gender;
> >>>
> >>> --
> >>> Best Regards
> >>>
> >>> Jeff Zhang
> >>>
> >>
> >>
> >>
> >> --
> >> Mafish@gmail.com
> >>
> >
> >
> >
> > --
> > Yours,
> > Zheng
> >
>
>
>
> --
> Mafish@gmail.com
>
Re: Why no two aggregations can have different DISTINCT columns ?
Posted by Mafish Liu <ma...@gmail.com>.
2010/2/25 Zheng Shao <zs...@gmail.com>:
> Yes definitely. Do you want to open a JIRA and post a patch?
> Please link the new JIRA to the other 2 JIRA that was mentioned in the
> same email thread.
I'll open a jira.
And the patch will be post after code and documents being arranged.
> Zheng
>
> On Thu, Feb 25, 2010 at 1:16 AM, Mafish Liu <ma...@gmail.com> wrote:
>> Hive does not support multi-distinct in one query.
>>
>> We have implemented multi-distinct based on hive 0.4.2rc to our demand.
>> We don't know that if Hive is intresting in this feature.
>>
>> 2010/2/25 Jeff Zhang <zj...@gmail.com>:
>>>
>>> Hi all,
>>>
>>> I read the tutorial of Hive, and it says that "no two aggregations can have
>>> different DISTINCT columns". Could anyone tell what is the reason ? Does the
>>> following Distinct will been translate to map-reduce job or just do it
>>> locally ?
>>>
>>> INSERT OVERWRITE TABLE pv_gender_agg
>>> SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT
>>> pv_users.ip)
>>> FROM pv_users
>>> GROUP BY pv_users.gender;
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>>
>> --
>> Mafish@gmail.com
>>
>
>
>
> --
> Yours,
> Zheng
>
--
Mafish@gmail.com
Re: Why no two aggregations can have different DISTINCT columns ?
Posted by Zheng Shao <zs...@gmail.com>.
Yes definitely. Do you want to open a JIRA and post a patch?
Please link the new JIRA to the other 2 JIRA that was mentioned in the
same email thread.
Zheng
On Thu, Feb 25, 2010 at 1:16 AM, Mafish Liu <ma...@gmail.com> wrote:
> Hive does not support multi-distinct in one query.
>
> We have implemented multi-distinct based on hive 0.4.2rc to our demand.
> We don't know that if Hive is intresting in this feature.
>
> 2010/2/25 Jeff Zhang <zj...@gmail.com>:
>>
>> Hi all,
>>
>> I read the tutorial of Hive, and it says that "no two aggregations can have
>> different DISTINCT columns". Could anyone tell what is the reason ? Does the
>> following Distinct will been translate to map-reduce job or just do it
>> locally ?
>>
>> INSERT OVERWRITE TABLE pv_gender_agg
>> SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT
>> pv_users.ip)
>> FROM pv_users
>> GROUP BY pv_users.gender;
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> Mafish@gmail.com
>
--
Yours,
Zheng
Re: Why no two aggregations can have different DISTINCT columns ?
Posted by Mafish Liu <ma...@gmail.com>.
here are our result of multi-distinct:
hive> describe classes;
OK
name string
number string
class string
Time taken: 0.122 seconds
hive> select * from classes;
OK
1 11 8
2 22 12
4 212 2
5 232 23
6 22 2
7 22 2
3 333 13
3 33 3
4 133 32
5 33 3
Time taken: 0.154 seconds
hive> select count(distinct name), count(distinct number), class from
classes group by class;
....
1 1 12
1 1 13
3 2 2
1 1 23
2 1 3
1 1 32
1 1 8
2010/2/25 Mafish Liu <ma...@gmail.com>:
> Hive does not support multi-distinct in one query.
>
> We have implemented multi-distinct based on hive 0.4.2rc to our demand.
> We don't know that if Hive is intresting in this feature.
>
> 2010/2/25 Jeff Zhang <zj...@gmail.com>:
>>
>> Hi all,
>>
>> I read the tutorial of Hive, and it says that "no two aggregations can have
>> different DISTINCT columns". Could anyone tell what is the reason ? Does the
>> following Distinct will been translate to map-reduce job or just do it
>> locally ?
>>
>> INSERT OVERWRITE TABLE pv_gender_agg
>> SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT
>> pv_users.ip)
>> FROM pv_users
>> GROUP BY pv_users.gender;
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> Mafish@gmail.com
>
--
Mafish@gmail.com
Re: Why no two aggregations can have different DISTINCT columns ?
Posted by Mafish Liu <ma...@gmail.com>.
Hive does not support multi-distinct in one query.
We have implemented multi-distinct based on hive 0.4.2rc to our demand.
We don't know that if Hive is intresting in this feature.
2010/2/25 Jeff Zhang <zj...@gmail.com>:
>
> Hi all,
>
> I read the tutorial of Hive, and it says that "no two aggregations can have
> different DISTINCT columns". Could anyone tell what is the reason ? Does the
> following Distinct will been translate to map-reduce job or just do it
> locally ?
>
> INSERT OVERWRITE TABLE pv_gender_agg
> SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT
> pv_users.ip)
> FROM pv_users
> GROUP BY pv_users.gender;
>
> --
> Best Regards
>
> Jeff Zhang
>
--
Mafish@gmail.com