You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Anup Tiwari <an...@gmail.com> on 2018/07/12 07:26:03 UTC

Hive Row number Use case

Hi All,

We have a use case where we want to assign a row number to a table based on
3 column ( uid, update_date, flag) i.e. if value of any of the column gets
changed, we want to reset this number. Please find below sample input data
and expected output data.

Also please note that we have tried row_number() over(partition by uid,
update_date, flag order by update_time asc)  but due to this actual input
ordering got break due to I believe partition by clause because it seems
partition by creates group within column specified and then it start row
number and due to this actual ordering is breaking. So i just wanted to
know that is there any function available in hive which can give us below
result OR we are missing something in window function?


*Input Data :- *

*uid* *update_date* *update_time* *flag*
468730 2017-07-12 12/07/2017 22:59:17 1
468730 2017-07-12 12/07/2017 23:02:14 0
468730 2017-07-12 12/07/2017 23:07:40 0
468730 2017-07-12 12/07/2017 23:12:41 0
468730 2017-07-12 12/07/2017 23:22:06 0
468730 2017-07-12 12/07/2017 23:38:35 0
468730 2017-07-12 12/07/2017 23:44:19 0
468730 2017-07-12 12/07/2017 23:47:49 1
468730 2017-07-12 12/07/2017 23:48:49 1
468730 2017-07-12 12/07/2017 23:53:31 0
468730 2017-07-12 12/07/2017 23:57:01 1
468730 2017-07-13 13/07/2017 00:03:10 1
468730 2017-07-13 13/07/2017 00:06:35 0
468730 2017-07-13 13/07/2017 00:07:29 1
468731 2017-07-13 12/07/2017 12:59:17 1
468731 2017-07-13 12/07/2017 13:02:14 0
468731 2017-07-13 12/07/2017 13:07:40 0
468731 2017-07-13 12/07/2017 13:12:41 0


*Output Data :-*

*uid* *update_date* *update_time* *flag* *required_row_num*
468730 2017-07-12 12/07/2017 22:59:17 1 1
468730 2017-07-12 12/07/2017 23:02:14 0 1
468730 2017-07-12 12/07/2017 23:07:40 0 2
468730 2017-07-12 12/07/2017 23:12:41 0 3
468730 2017-07-12 12/07/2017 23:22:06 0 4
468730 2017-07-12 12/07/2017 23:38:35 0 5
468730 2017-07-12 12/07/2017 23:44:19 0 6
468730 2017-07-12 12/07/2017 23:47:49 1 1
468730 2017-07-12 12/07/2017 23:48:49 1 2
468730 2017-07-12 12/07/2017 23:53:31 0 1
468730 2017-07-12 12/07/2017 23:57:01 1 1
468730 2017-07-13 13/07/2017 00:03:10 1 1
468730 2017-07-13 13/07/2017 00:06:35 0 1
468730 2017-07-13 13/07/2017 00:07:29 1 1
468731 2017-07-13 12/07/2017 12:59:17 1 1
468731 2017-07-13 12/07/2017 13:02:14 0 1
468731 2017-07-13 12/07/2017 13:07:40 0 2
468731 2017-07-13 12/07/2017 13:12:41 0 3
*FYI :* We are one Hive 2.3.1.

Re: Hive Row number Use case

Posted by Brad Kaneyama <br...@gmail.com>.
As you noticed your partition by clause is improper since your row_number()
counts all of the members that are the same -- instead of sequencing a run
of the same values.

Here's a POC using a simplified version of your dataset:
You can adjust the grouping windows and change detection accordingly.

CREATE TABLE foo
(
  dt string,
  flag int
);

insert into foo
values
('2017-07-12 22:59:17',1),
('2017-07-12 23:02:14',0),
('2017-07-12 23:07:40',0),
('2017-07-12 23:12:41',0),
('2017-07-12 23:22:06',0),
('2017-07-12 23:38:35',0),
('2017-07-12 23:44:19',0),
('2017-07-12 23:47:49',1),
('2017-07-12 23:48:49',1),
('2017-07-12 23:53:31',0),
('2017-07-12 23:57:01',1),
('2017-07-13 00:03:10',1),
('2017-07-13 00:06:35',0),
('2017-07-13 00:07:29',1),
('2017-07-12 12:59:17',1),
('2017-07-12 13:02:14',0),
('2017-07-12 13:07:40',0),
('2017-07-12 13:12:41',0);

with cte
AS
(
  select
    dt,
    flag,
    lag(flag) over(order by dt) AS prev_flag,
    (case
      when flag <> lag(flag,1) over(order by dt) then 1
      else 0
    end) as changed_indicator
  from
    foo as x
),
cte2
AS
(
  select
    *,
    sum(changed_indicator) over(order by dt) AS grouping_key
  from
    cte
)
select
  *,
  row_number() over(partition by grouping_key order by dt) as seq
from
  cte2
order by
  dt;


+---------------------+------+-----------+-------------------+--------------+-----+
|         dt          | flag | prev_flag | changed_indicator | grouping_key
| seq |
+---------------------+------+-----------+-------------------+--------------+-----+
| 2017-07-12 12:59:17 |    1 | NULL      |                 0 |            0
|   1 |
| 2017-07-12 13:02:14 |    0 | 1         |                 1 |            1
|   1 |
| 2017-07-12 13:07:40 |    0 | 0         |                 0 |            1
|   2 |
| 2017-07-12 13:12:41 |    0 | 0         |                 0 |            1
|   3 |
| 2017-07-12 22:59:17 |    1 | 0         |                 1 |            2
|   1 |
| 2017-07-12 23:02:14 |    0 | 1         |                 1 |            3
|   1 |
| 2017-07-12 23:07:40 |    0 | 0         |                 0 |            3
|   2 |
| 2017-07-12 23:12:41 |    0 | 0         |                 0 |            3
|   3 |
| 2017-07-12 23:22:06 |    0 | 0         |                 0 |            3
|   4 |
| 2017-07-12 23:38:35 |    0 | 0         |                 0 |            3
|   5 |
| 2017-07-12 23:44:19 |    0 | 0         |                 0 |            3
|   6 |
| 2017-07-12 23:47:49 |    1 | 0         |                 1 |            4
|   1 |
| 2017-07-12 23:48:49 |    1 | 1         |                 0 |            4
|   2 |
| 2017-07-12 23:53:31 |    0 | 1         |                 1 |            5
|   1 |
| 2017-07-12 23:57:01 |    1 | 0         |                 1 |            6
|   1 |
| 2017-07-13 00:03:10 |    1 | 1         |                 0 |            6
|   2 |
| 2017-07-13 00:06:35 |    0 | 1         |                 1 |            7
|   1 |
| 2017-07-13 00:07:29 |    1 | 0         |                 1 |            8
|   1 |
+---------------------+------+-----------+-------------------+--------------+-----+


On Tue, Jul 17, 2018 at 11:08 PM Anup Tiwari <an...@gmail.com> wrote:

> Can someone look into this and revert if possible?
>
>
> Regards,
> Anup Tiwari
>
> On Sat, Jul 14, 2018 at 12:28 AM, Anup Tiwari <an...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> Can someone look into this and revert if possible?
>>
>> Thanks.
>>
>>
>> On Thu, 12 Jul 2018 12:56 Anup Tiwari, <an...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> We have a use case where we want to assign a row number to a table based
>>> on 3 column ( uid, update_date, flag) i.e. if value of any of the
>>> column gets changed, we want to reset this number. Please find below sample
>>> input data and expected output data.
>>>
>>> Also please note that we have tried row_number() over(partition by uid,
>>> update_date, flag order by update_time asc)  but due to this actual
>>> input ordering got break due to I believe partition by clause because it
>>> seems partition by creates group within column specified and then it start
>>> row number and due to this actual ordering is breaking. So i just wanted to
>>> know that is there any function available in hive which can give us below
>>> result OR we are missing something in window function?
>>>
>>>
>>> *Input Data :- *
>>>
>>> *uid* *update_date* *update_time* *flag*
>>> 468730 2017-07-12 12/07/2017 22:59:17 1
>>> 468730 2017-07-12 12/07/2017 23:02:14 0
>>> 468730 2017-07-12 12/07/2017 23:07:40 0
>>> 468730 2017-07-12 12/07/2017 23:12:41 0
>>> 468730 2017-07-12 12/07/2017 23:22:06 0
>>> 468730 2017-07-12 12/07/2017 23:38:35 0
>>> 468730 2017-07-12 12/07/2017 23:44:19 0
>>> 468730 2017-07-12 12/07/2017 23:47:49 1
>>> 468730 2017-07-12 12/07/2017 23:48:49 1
>>> 468730 2017-07-12 12/07/2017 23:53:31 0
>>> 468730 2017-07-12 12/07/2017 23:57:01 1
>>> 468730 2017-07-13 13/07/2017 00:03:10 1
>>> 468730 2017-07-13 13/07/2017 00:06:35 0
>>> 468730 2017-07-13 13/07/2017 00:07:29 1
>>> 468731 2017-07-13 12/07/2017 12:59:17 1
>>> 468731 2017-07-13 12/07/2017 13:02:14 0
>>> 468731 2017-07-13 12/07/2017 13:07:40 0
>>> 468731 2017-07-13 12/07/2017 13:12:41 0
>>>
>>>
>>> *Output Data :-*
>>>
>>> *uid* *update_date* *update_time* *flag* *required_row_num*
>>> 468730 2017-07-12 12/07/2017 22:59:17 1 1
>>> 468730 2017-07-12 12/07/2017 23:02:14 0 1
>>> 468730 2017-07-12 12/07/2017 23:07:40 0 2
>>> 468730 2017-07-12 12/07/2017 23:12:41 0 3
>>> 468730 2017-07-12 12/07/2017 23:22:06 0 4
>>> 468730 2017-07-12 12/07/2017 23:38:35 0 5
>>> 468730 2017-07-12 12/07/2017 23:44:19 0 6
>>> 468730 2017-07-12 12/07/2017 23:47:49 1 1
>>> 468730 2017-07-12 12/07/2017 23:48:49 1 2
>>> 468730 2017-07-12 12/07/2017 23:53:31 0 1
>>> 468730 2017-07-12 12/07/2017 23:57:01 1 1
>>> 468730 2017-07-13 13/07/2017 00:03:10 1 1
>>> 468730 2017-07-13 13/07/2017 00:06:35 0 1
>>> 468730 2017-07-13 13/07/2017 00:07:29 1 1
>>> 468731 2017-07-13 12/07/2017 12:59:17 1 1
>>> 468731 2017-07-13 12/07/2017 13:02:14 0 1
>>> 468731 2017-07-13 12/07/2017 13:07:40 0 2
>>> 468731 2017-07-13 12/07/2017 13:12:41 0 3
>>> *FYI :* We are one Hive 2.3.1.
>>>
>>
>

Re: Hive Row number Use case

Posted by Anup Tiwari <an...@gmail.com>.
Can someone look into this and revert if possible?


Regards,
Anup Tiwari

On Sat, Jul 14, 2018 at 12:28 AM, Anup Tiwari <an...@gmail.com>
wrote:

> Hi All,
>
> Can someone look into this and revert if possible?
>
> Thanks.
>
>
> On Thu, 12 Jul 2018 12:56 Anup Tiwari, <an...@gmail.com> wrote:
>
>> Hi All,
>>
>> We have a use case where we want to assign a row number to a table based
>> on 3 column ( uid, update_date, flag) i.e. if value of any of the column
>> gets changed, we want to reset this number. Please find below sample input
>> data and expected output data.
>>
>> Also please note that we have tried row_number() over(partition by uid,
>> update_date, flag order by update_time asc)  but due to this actual
>> input ordering got break due to I believe partition by clause because it
>> seems partition by creates group within column specified and then it start
>> row number and due to this actual ordering is breaking. So i just wanted to
>> know that is there any function available in hive which can give us below
>> result OR we are missing something in window function?
>>
>>
>> *Input Data :- *
>>
>> *uid* *update_date* *update_time* *flag*
>> 468730 2017-07-12 12/07/2017 22:59:17 1
>> 468730 2017-07-12 12/07/2017 23:02:14 0
>> 468730 2017-07-12 12/07/2017 23:07:40 0
>> 468730 2017-07-12 12/07/2017 23:12:41 0
>> 468730 2017-07-12 12/07/2017 23:22:06 0
>> 468730 2017-07-12 12/07/2017 23:38:35 0
>> 468730 2017-07-12 12/07/2017 23:44:19 0
>> 468730 2017-07-12 12/07/2017 23:47:49 1
>> 468730 2017-07-12 12/07/2017 23:48:49 1
>> 468730 2017-07-12 12/07/2017 23:53:31 0
>> 468730 2017-07-12 12/07/2017 23:57:01 1
>> 468730 2017-07-13 13/07/2017 00:03:10 1
>> 468730 2017-07-13 13/07/2017 00:06:35 0
>> 468730 2017-07-13 13/07/2017 00:07:29 1
>> 468731 2017-07-13 12/07/2017 12:59:17 1
>> 468731 2017-07-13 12/07/2017 13:02:14 0
>> 468731 2017-07-13 12/07/2017 13:07:40 0
>> 468731 2017-07-13 12/07/2017 13:12:41 0
>>
>>
>> *Output Data :-*
>>
>> *uid* *update_date* *update_time* *flag* *required_row_num*
>> 468730 2017-07-12 12/07/2017 22:59:17 1 1
>> 468730 2017-07-12 12/07/2017 23:02:14 0 1
>> 468730 2017-07-12 12/07/2017 23:07:40 0 2
>> 468730 2017-07-12 12/07/2017 23:12:41 0 3
>> 468730 2017-07-12 12/07/2017 23:22:06 0 4
>> 468730 2017-07-12 12/07/2017 23:38:35 0 5
>> 468730 2017-07-12 12/07/2017 23:44:19 0 6
>> 468730 2017-07-12 12/07/2017 23:47:49 1 1
>> 468730 2017-07-12 12/07/2017 23:48:49 1 2
>> 468730 2017-07-12 12/07/2017 23:53:31 0 1
>> 468730 2017-07-12 12/07/2017 23:57:01 1 1
>> 468730 2017-07-13 13/07/2017 00:03:10 1 1
>> 468730 2017-07-13 13/07/2017 00:06:35 0 1
>> 468730 2017-07-13 13/07/2017 00:07:29 1 1
>> 468731 2017-07-13 12/07/2017 12:59:17 1 1
>> 468731 2017-07-13 12/07/2017 13:02:14 0 1
>> 468731 2017-07-13 12/07/2017 13:07:40 0 2
>> 468731 2017-07-13 12/07/2017 13:12:41 0 3
>> *FYI :* We are one Hive 2.3.1.
>>
>

Re: Hive Row number Use case

Posted by Anup Tiwari <an...@gmail.com>.
Hi All,

Can someone look into this and revert if possible?

Thanks.


On Thu, 12 Jul 2018 12:56 Anup Tiwari, <an...@gmail.com> wrote:

> Hi All,
>
> We have a use case where we want to assign a row number to a table based
> on 3 column ( uid, update_date, flag) i.e. if value of any of the column
> gets changed, we want to reset this number. Please find below sample input
> data and expected output data.
>
> Also please note that we have tried row_number() over(partition by uid,
> update_date, flag order by update_time asc)  but due to this actual input
> ordering got break due to I believe partition by clause because it seems
> partition by creates group within column specified and then it start row
> number and due to this actual ordering is breaking. So i just wanted to
> know that is there any function available in hive which can give us below
> result OR we are missing something in window function?
>
>
> *Input Data :- *
>
> *uid* *update_date* *update_time* *flag*
> 468730 2017-07-12 12/07/2017 22:59:17 1
> 468730 2017-07-12 12/07/2017 23:02:14 0
> 468730 2017-07-12 12/07/2017 23:07:40 0
> 468730 2017-07-12 12/07/2017 23:12:41 0
> 468730 2017-07-12 12/07/2017 23:22:06 0
> 468730 2017-07-12 12/07/2017 23:38:35 0
> 468730 2017-07-12 12/07/2017 23:44:19 0
> 468730 2017-07-12 12/07/2017 23:47:49 1
> 468730 2017-07-12 12/07/2017 23:48:49 1
> 468730 2017-07-12 12/07/2017 23:53:31 0
> 468730 2017-07-12 12/07/2017 23:57:01 1
> 468730 2017-07-13 13/07/2017 00:03:10 1
> 468730 2017-07-13 13/07/2017 00:06:35 0
> 468730 2017-07-13 13/07/2017 00:07:29 1
> 468731 2017-07-13 12/07/2017 12:59:17 1
> 468731 2017-07-13 12/07/2017 13:02:14 0
> 468731 2017-07-13 12/07/2017 13:07:40 0
> 468731 2017-07-13 12/07/2017 13:12:41 0
>
>
> *Output Data :-*
>
> *uid* *update_date* *update_time* *flag* *required_row_num*
> 468730 2017-07-12 12/07/2017 22:59:17 1 1
> 468730 2017-07-12 12/07/2017 23:02:14 0 1
> 468730 2017-07-12 12/07/2017 23:07:40 0 2
> 468730 2017-07-12 12/07/2017 23:12:41 0 3
> 468730 2017-07-12 12/07/2017 23:22:06 0 4
> 468730 2017-07-12 12/07/2017 23:38:35 0 5
> 468730 2017-07-12 12/07/2017 23:44:19 0 6
> 468730 2017-07-12 12/07/2017 23:47:49 1 1
> 468730 2017-07-12 12/07/2017 23:48:49 1 2
> 468730 2017-07-12 12/07/2017 23:53:31 0 1
> 468730 2017-07-12 12/07/2017 23:57:01 1 1
> 468730 2017-07-13 13/07/2017 00:03:10 1 1
> 468730 2017-07-13 13/07/2017 00:06:35 0 1
> 468730 2017-07-13 13/07/2017 00:07:29 1 1
> 468731 2017-07-13 12/07/2017 12:59:17 1 1
> 468731 2017-07-13 12/07/2017 13:02:14 0 1
> 468731 2017-07-13 12/07/2017 13:07:40 0 2
> 468731 2017-07-13 12/07/2017 13:12:41 0 3
> *FYI :* We are one Hive 2.3.1.
>