You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kylin.apache.org by 崔苗 <cu...@danale.com> on 2017/11/30 01:25:51 UTC

count distinct

Hi,
we want to get count(distinct user) group by hour/day/week/month/year,now we have a problem:
what's the content of count(distinct user) that kylin keeps,the distinct users set or just a count number? If we want to count (distinct user) by year,do we need to keep data for a year in hive?







Re: Re: Re: Re: count distinct

Posted by ShaoFeng Shi <sh...@apache.org>.
Correct; GlobalDictionary can only encode a Non-integer to an integer, but
not able to decode from integer to original value.

2017-12-08 16:16 GMT+08:00 崔苗 <cu...@danale.com>:

> 1、the user_id is unique string id,but now we can't get user_id set from
> kylin,right?
>
>
> 在 2017-12-07 09:57:31,ShaoFeng Shi <sh...@apache.org> 写道:
>
> Hi Miao,
>
> For 1, Kylin is focusing on OLAP scenarios, so most queries are aggregated
> query instead of detail query. But your scenario is a case that bitmap can
> fit, if the result set isn't big, it is doable. Only need to decouple the
> bitmap values (if the user id is integer family, no need to decode with
> dictionary). This is something like the TopN measure.
>
> For 2, yes the global dictionary will grow as user number grows.
>
> For 3, If you use Kylin 2.1, the cube data, as well as metadata, will all
> on HBase cluster.  Before Kylin 2.1, there is an issue will cause some
> metadata file will be left on the hive cluster. With whatever deployment
> topology, we suggest you backup the metadata periodically to minimize the
> data loss possibility.
>
> 2017-12-06 9:45 GMT+08:00 崔苗 <cu...@danale.com>:
>
>> 1、we have four data node:us,shenzhen-china,hongkong-china and eu,every
>> data node has one MySql database,we want to deploy four kylin cluster to
>> anlyse the data and merge the result to get the final result , so we need
>> the distinct user set in every data node and merge it to get rid of
>> duplicated users. It seems it's not a good scenarios for kylin.
>> 2、If we want to get the count distinct on string column,such as user ID,
>> it's a high cardinality column,how to estimate the memory that the global
>> dict need? Will kylin expand the global dict and the bitmap about users if
>> users increase every day?
>> 3、If we deploy kylin with standalone hbase cluster , does all the data
>> about result ,such as dict , bitmap will be stored in the hbase cluster ?
>> so we don't need to set HA mode on the other hadoop cluster(hive+spark)
>> because the data loss in this cluster will not damage the result , we just
>> need to ensure the high availability on the hbase cluster?
>>
>>
>> 在 2017-12-06 08:41:13,ShaoFeng Shi <sh...@apache.org> 写道:
>>
>> Hi Miao,
>>
>> 1. Currently, Kylin only returns the count in the bitmap, not IDs in it;
>> It should be able to extend. Could you please describe your scenarios?
>> 2. Yes, the Cube API will return each segment of the cube, and each
>> segment has a start date and end date. Please check Kylin's Rest API
>> document.
>>
>> 2017-12-05 18:31 GMT+08:00 崔苗 <cu...@danale.com>:
>>
>>> 1、If there is Bitmap stored in hbase,can we get the distinct user set if
>>> we need to know all the distinct users?
>>> 2、Is there any restuful api could get the cube's
>>> date_time,date_range_start and date_range_end?
>>>
>>>
>>> 在 2017-11-30 18:30:27,ShaoFeng Shi <sh...@apache.org> 写道:
>>>
>>> Hi Miao,
>>>
>>> Kylin use HyperLogLog or Bitmap to persistent the distinct values; You
>>> can get some info from this blog: https://kylin.apache.org
>>> /blog/2016/08/01/count-distinct-in-kylin/
>>>
>>> 2017-11-30 9:25 GMT+08:00 崔苗 <cu...@danale.com>:
>>>
>>>> Hi,
>>>> we want to get count(distinct user) group by
>>>> hour/day/week/month/year,now we have a problem:
>>>> what's the content of count(distinct user) that kylin keeps,the
>>>> distinct users set or just a count number? If we want to count (distinct
>>>> user) by year,do we need to keep data for a year in hive?
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Best regards,
>>>
>>> Shaofeng Shi 史少锋
>>>
>>>
>>>
>>
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>>
>>
>>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re:Re: Re: Re: count distinct

Posted by 崔苗 <cu...@danale.com>.
1、the user_id is unique string id,but now we can't get user_id set from kylin,right?

在 2017-12-07 09:57:31,ShaoFeng Shi <sh...@apache.org> 写道:
Hi Miao,

For 1, Kylin is focusing on OLAP scenarios, so most queries are aggregated query instead of detail query. But your scenario is a case that bitmap can fit, if the result set isn't big, it is doable. Only need to decouple the bitmap values (if the user id is integer family, no need to decode with dictionary). This is something like the TopN measure.


For 2, yes the global dictionary will grow as user number grows. 


For 3, If you use Kylin 2.1, the cube data, as well as metadata, will all on HBase cluster.  Before Kylin 2.1, there is an issue will cause some metadata file will be left on the hive cluster. With whatever deployment topology, we suggest you backup the metadata periodically to minimize the data loss possibility.


2017-12-06 9:45 GMT+08:00 崔苗 <cu...@danale.com>:
1、we have four data node:us,shenzhen-china,hongkong-china and eu,every data node has one MySql database,we want to deploy four kylin cluster to anlyse the data and merge the result to get the final result , so we need the distinct user set in every data node and merge it to get rid of duplicated users. It seems it's not a good scenarios for kylin.
2、If we want to get the count distinct on string column,such as user ID, it's a high cardinality column,how to estimate the memory that the global dict need? Will kylin expand the global dict and the bitmap about users if users increase every day?
3、If we deploy kylin with standalone hbase cluster , does all the data about result ,such as dict , bitmap will be stored in the hbase cluster ? so we don't need to set HA mode on the other hadoop cluster(hive+spark) because the data loss in this cluster will not damage the result , we just need to ensure the high availability on the hbase cluster?

在 2017-12-06 08:41:13,ShaoFeng Shi <sh...@apache.org> 写道:
Hi Miao,

1. Currently, Kylin only returns the count in the bitmap, not IDs in it; It should be able to extend. Could you please describe your scenarios?
2. Yes, the Cube API will return each segment of the cube, and each segment has a start date and end date. Please check Kylin's Rest API document.


2017-12-05 18:31 GMT+08:00 崔苗 <cu...@danale.com>:
1、If there is Bitmap stored in hbase,can we get the distinct user set if we need to know all the distinct users?
2、Is there any restuful api could get the cube's date_time,date_range_start and date_range_end?

在 2017-11-30 18:30:27,ShaoFeng Shi <sh...@apache.org> 写道:
Hi Miao,

Kylin use HyperLogLog or Bitmap to persistent the distinct values; You can get some info from this blog: https://kylin.apache.org/blog/2016/08/01/count-distinct-in-kylin/


2017-11-30 9:25 GMT+08:00 崔苗 <cu...@danale.com>:
Hi,
we want to get count(distinct user) group by hour/day/week/month/year,now we have a problem:
what's the content of count(distinct user) that kylin keeps,the distinct users set or just a count number? If we want to count (distinct user) by year,do we need to keep data for a year in hive?









--
Best regards,

Shaofeng Shi 史少锋
















--
Best regards,

Shaofeng Shi 史少锋
















--
Best regards,

Shaofeng Shi 史少锋











Re: Re: Re: count distinct

Posted by ShaoFeng Shi <sh...@apache.org>.
Hi Miao,

For 1, Kylin is focusing on OLAP scenarios, so most queries are aggregated
query instead of detail query. But your scenario is a case that bitmap can
fit, if the result set isn't big, it is doable. Only need to decouple the
bitmap values (if the user id is integer family, no need to decode with
dictionary). This is something like the TopN measure.

For 2, yes the global dictionary will grow as user number grows.

For 3, If you use Kylin 2.1, the cube data, as well as metadata, will all
on HBase cluster.  Before Kylin 2.1, there is an issue will cause some
metadata file will be left on the hive cluster. With whatever deployment
topology, we suggest you backup the metadata periodically to minimize the
data loss possibility.

2017-12-06 9:45 GMT+08:00 崔苗 <cu...@danale.com>:

> 1、we have four data node:us,shenzhen-china,hongkong-china and eu,every
> data node has one MySql database,we want to deploy four kylin cluster to
> anlyse the data and merge the result to get the final result , so we need
> the distinct user set in every data node and merge it to get rid of
> duplicated users. It seems it's not a good scenarios for kylin.
> 2、If we want to get the count distinct on string column,such as user ID,
> it's a high cardinality column,how to estimate the memory that the global
> dict need? Will kylin expand the global dict and the bitmap about users if
> users increase every day?
> 3、If we deploy kylin with standalone hbase cluster , does all the data
> about result ,such as dict , bitmap will be stored in the hbase cluster ?
> so we don't need to set HA mode on the other hadoop cluster(hive+spark)
> because the data loss in this cluster will not damage the result , we just
> need to ensure the high availability on the hbase cluster?
>
>
> 在 2017-12-06 08:41:13,ShaoFeng Shi <sh...@apache.org> 写道:
>
> Hi Miao,
>
> 1. Currently, Kylin only returns the count in the bitmap, not IDs in it;
> It should be able to extend. Could you please describe your scenarios?
> 2. Yes, the Cube API will return each segment of the cube, and each
> segment has a start date and end date. Please check Kylin's Rest API
> document.
>
> 2017-12-05 18:31 GMT+08:00 崔苗 <cu...@danale.com>:
>
>> 1、If there is Bitmap stored in hbase,can we get the distinct user set if
>> we need to know all the distinct users?
>> 2、Is there any restuful api could get the cube's
>> date_time,date_range_start and date_range_end?
>>
>>
>> 在 2017-11-30 18:30:27,ShaoFeng Shi <sh...@apache.org> 写道:
>>
>> Hi Miao,
>>
>> Kylin use HyperLogLog or Bitmap to persistent the distinct values; You
>> can get some info from this blog: https://kylin.apache.org
>> /blog/2016/08/01/count-distinct-in-kylin/
>>
>> 2017-11-30 9:25 GMT+08:00 崔苗 <cu...@danale.com>:
>>
>>> Hi,
>>> we want to get count(distinct user) group by
>>> hour/day/week/month/year,now we have a problem:
>>> what's the content of count(distinct user) that kylin keeps,the distinct
>>> users set or just a count number? If we want to count (distinct user) by
>>> year,do we need to keep data for a year in hive?
>>>
>>>
>>>
>>
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>>
>>
>>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re:Re: Re: count distinct

Posted by 崔苗 <cu...@danale.com>.
1、we have four data node:us,shenzhen-china,hongkong-china and eu,every data node has one MySql database,we want to deploy four kylin cluster to anlyse the data and merge the result to get the final result , so we need the distinct user set in every data node and merge it to get rid of duplicated users. It seems it's not a good scenarios for kylin.
2、If we want to get the count distinct on string column,such as user ID, it's a high cardinality column,how to estimate the memory that the global dict need? Will kylin expand the global dict and the bitmap about users if users increase every day?
3、If we deploy kylin with standalone hbase cluster , does all the data about result ,such as dict , bitmap will be stored in the hbase cluster ? so we don't need to set HA mode on the other hadoop cluster(hive+spark) because the data loss in this cluster will not damage the result , we just need to ensure the high availability on the hbase cluster?

在 2017-12-06 08:41:13,ShaoFeng Shi <sh...@apache.org> 写道:
Hi Miao,

1. Currently, Kylin only returns the count in the bitmap, not IDs in it; It should be able to extend. Could you please describe your scenarios?
2. Yes, the Cube API will return each segment of the cube, and each segment has a start date and end date. Please check Kylin's Rest API document.


2017-12-05 18:31 GMT+08:00 崔苗 <cu...@danale.com>:
1、If there is Bitmap stored in hbase,can we get the distinct user set if we need to know all the distinct users?
2、Is there any restuful api could get the cube's date_time,date_range_start and date_range_end?

在 2017-11-30 18:30:27,ShaoFeng Shi <sh...@apache.org> 写道:
Hi Miao,

Kylin use HyperLogLog or Bitmap to persistent the distinct values; You can get some info from this blog: https://kylin.apache.org/blog/2016/08/01/count-distinct-in-kylin/


2017-11-30 9:25 GMT+08:00 崔苗 <cu...@danale.com>:
Hi,
we want to get count(distinct user) group by hour/day/week/month/year,now we have a problem:
what's the content of count(distinct user) that kylin keeps,the distinct users set or just a count number? If we want to count (distinct user) by year,do we need to keep data for a year in hive?









--
Best regards,

Shaofeng Shi 史少锋
















--
Best regards,

Shaofeng Shi 史少锋











Re: Re: count distinct

Posted by ShaoFeng Shi <sh...@apache.org>.
Hi Miao,

1. Currently, Kylin only returns the count in the bitmap, not IDs in it; It
should be able to extend. Could you please describe your scenarios?
2. Yes, the Cube API will return each segment of the cube, and each segment
has a start date and end date. Please check Kylin's Rest API document.

2017-12-05 18:31 GMT+08:00 崔苗 <cu...@danale.com>:

> 1、If there is Bitmap stored in hbase,can we get the distinct user set if
> we need to know all the distinct users?
> 2、Is there any restuful api could get the cube's
> date_time,date_range_start and date_range_end?
>
>
> 在 2017-11-30 18:30:27,ShaoFeng Shi <sh...@apache.org> 写道:
>
> Hi Miao,
>
> Kylin use HyperLogLog or Bitmap to persistent the distinct values; You can
> get some info from this blog: https://kylin.apache.
> org/blog/2016/08/01/count-distinct-in-kylin/
>
> 2017-11-30 9:25 GMT+08:00 崔苗 <cu...@danale.com>:
>
>> Hi,
>> we want to get count(distinct user) group by hour/day/week/month/year,now
>> we have a problem:
>> what's the content of count(distinct user) that kylin keeps,the distinct
>> users set or just a count number? If we want to count (distinct user) by
>> year,do we need to keep data for a year in hive?
>>
>>
>>
>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>
>


-- 
Best regards,

Shaofeng Shi 史少锋

Re:Re: count distinct

Posted by 崔苗 <cu...@danale.com>.
1、If there is Bitmap stored in hbase,can we get the distinct user set if we need to know all the distinct users?
2、Is there any restuful api could get the cube's date_time,date_range_start and date_range_end?

在 2017-11-30 18:30:27,ShaoFeng Shi <sh...@apache.org> 写道:
Hi Miao,

Kylin use HyperLogLog or Bitmap to persistent the distinct values; You can get some info from this blog: https://kylin.apache.org/blog/2016/08/01/count-distinct-in-kylin/


2017-11-30 9:25 GMT+08:00 崔苗 <cu...@danale.com>:
Hi,
we want to get count(distinct user) group by hour/day/week/month/year,now we have a problem:
what's the content of count(distinct user) that kylin keeps,the distinct users set or just a count number? If we want to count (distinct user) by year,do we need to keep data for a year in hive?









--
Best regards,

Shaofeng Shi 史少锋











Re: count distinct

Posted by ShaoFeng Shi <sh...@apache.org>.
Hi Miao,

Kylin use HyperLogLog or Bitmap to persistent the distinct values; You can
get some info from this blog:
https://kylin.apache.org/blog/2016/08/01/count-distinct-in-kylin/

2017-11-30 9:25 GMT+08:00 崔苗 <cu...@danale.com>:

> Hi,
> we want to get count(distinct user) group by hour/day/week/month/year,now
> we have a problem:
> what's the content of count(distinct user) that kylin keeps,the distinct
> users set or just a count number? If we want to count (distinct user) by
> year,do we need to keep data for a year in hive?
>
>
>


-- 
Best regards,

Shaofeng Shi 史少锋