You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kylin.apache.org by we...@corp.netease.com on 2019/10/28 09:31:05 UTC

need help with non-static dimensions

Hi, all.  It seems that in Kylin, the dimensions of input data are
considered as static. How to use kylin to process input data whose
dimensions may change.

 

For example, there are input data with fields "t_date, user_id, user_server,
user_os". However, user's login server may change during the day. If I want
to calc the DAU of the server and os, I need to dedup like this:

 

SELECT 

   t_date, user_server, user_os, COUNT(DISTINCT user_id) AS user_cnt

FROM 

  (

  SELECT 

     t_date, user_id, user_os, 

      -- one user maps to only one server

      MAX(user_server) AS user_server

  FROM Src

  GROUP BY t_date, user_id, user_os

  )

GROUP BY t_date, user_server, user_os

 

Because user can not be counted in more than one server.

 

Suppose there are inputs:

 

t_date        user_id    user_server     user_os

20191028    Lily        100             Windows

20191028    Lily        101             Windows

 

The expected result is 

t_date       user_server  user_os      user_cnt

20191028   100          Windows    0

20191028   101          Windows    1

 

But the result from Kylin may be:

t_date       user_server  user_os      user_cnt

20191028   100          Windows    1

20191028   101          Windows    1

 

which is not what I expect. 

 

How should I do to deal with the input data with non-static dimensions ?