You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by qiaoresearcher <qi...@gmail.com> on 2012/10/31 22:18:18 UTC

need help on writing hive query

Hi all,

here is the question. Assume we have a table like:
------------------------------------------------------------------------------------------------------------------------------
user_id    ||  user_visiting_time    ||      user_current_web_page     ||
 user_previous_web_page
user 1                 time (1,1)                                   page 1
                                      page 0
user 1                 time (1,2)                                   page 2
                                      page 1
user 1                 time (1,3 )                                  page 3
                                      page 2
.....                          ......
     ....                                                ....
user n                 time (n,1)                                   page 1
                                      page 0
user n                 time (n,2)                                   page 2
                                      page 1
user n                 time (n,3)                                   page 3
                                      page 2

that is, in each row, we know the current web page that user is viewing,
and we know the previous web page the user coming from

now we want to generate a list for each user that recorded the complete
path the user is taking:
i.e., how can we use hive to generate output like:
------------------------------------------------------------------------------------------------
user 1 :      page 1   page 2 page 3  page 4  .......... (till reach the
beginning page of user 1)
user 2:       page 1 page 2 page 3  page 4 page 5  .......  ( till reach
the beginning page of user 2)
the web pages viewed by user 1 and user 2 might be different.

can we generate this using hive?

thanks,

Re: need help on writing hive query

Posted by Mark Grover <gr...@gmail.com>.
You should look into Hive's cluster by/distribute by functionality.

https://cwiki.apache.org/Hive/languagemanual-sortby.html#LanguageManualSortBy-SyntaxofClusterByandDistributeBy
https://cwiki.apache.org/Hive/languagemanual-transform.html

On Wed, Oct 31, 2012 at 2:18 PM, qiaoresearcher <qi...@gmail.com>wrote:

> Hi all,
>
> here is the question. Assume we have a table like:
>
> ------------------------------------------------------------------------------------------------------------------------------
> user_id    ||  user_visiting_time    ||      user_current_web_page     ||
>  user_previous_web_page
> user 1                 time (1,1)                                   page 1
>                                       page 0
> user 1                 time (1,2)                                   page 2
>                                       page 1
> user 1                 time (1,3 )                                  page 3
>                                       page 2
> .....                          ......
>        ....                                                ....
> user n                 time (n,1)                                   page 1
>                                       page 0
> user n                 time (n,2)                                   page 2
>                                       page 1
> user n                 time (n,3)                                   page 3
>                                       page 2
>
> that is, in each row, we know the current web page that user is viewing,
> and we know the previous web page the user coming from
>
> now we want to generate a list for each user that recorded the complete
> path the user is taking:
> i.e., how can we use hive to generate output like:
>
> ------------------------------------------------------------------------------------------------
> user 1 :      page 1   page 2 page 3  page 4  .......... (till reach the
> beginning page of user 1)
> user 2:       page 1 page 2 page 3  page 4 page 5  .......  ( till reach
> the beginning page of user 2)
> the web pages viewed by user 1 and user 2 might be different.
>
> can we generate this using hive?
>
> thanks,
>

RE: need help on writing hive query

Posted by java8964 java8964 <ja...@hotmail.com>.
If you don't need to join current_web_page and previous_web_page, assuming you can just trust the time stamp, as Phil points out, an custom UDF of collect_list() is the way to go.
You need to implement collect_list() UDF by yourself, hive doesn't have one by default.But it should be straight forward. In fact, you can reuse the code of collect_set(), replace the internal set with a Java ArrayList, then
select user_id, collect_list(user_current_web_page)from(    select user_id, user_current_web_page    order by user_id asc, user_visiting_time asc)agroup by user_id
Yong
> Subject: Re: need help on writing hive query
> From: matthewtckr@gmail.com
> Date: Wed, 31 Oct 2012 17:53:06 -0400
> To: user@hive.apache.org
> 
> I did a similar query a few months ago.  In short, I left-padded the page name with the time stamp, grouped with collect_set, and then used sort_array().  There was some other cleanup work and converting back to string to remove the time stamps, but it remained in order.
> 
> If there's an easier way, please let me know.
> 
> Matt Tucker
> 
> On Oct 31, 2012, at 5:37 PM, Tom Brown <to...@gmail.com> wrote:
> 
> > It wouldn't retrieve the user's path in a single string, but you could
> > simply select the user id and current page, ordered by the timestamp.
> > 
> > It would require a second step to turn it into the single string path,
> > so that might be a deal-breaker.
> > 
> > --Tom
> > 
> > On Wed, Oct 31, 2012 at 3:32 PM, Philip Tromans
> > <ph...@gmail.com> wrote:
> >> You could use collect_set() and GROUP BY. That wouldn't preserve order
> >> though.
> >> 
> >> Phil.
> >> 
> >> On Oct 31, 2012 9:18 PM, "qiaoresearcher" <qi...@gmail.com> wrote:
> >>> 
> >>> Hi all,
> >>> 
> >>> here is the question. Assume we have a table like:
> >>> 
> >>> ------------------------------------------------------------------------------------------------------------------------------
> >>> user_id    ||  user_visiting_time    ||      user_current_web_page     ||
> >>> user_previous_web_page
> >>> user 1                 time (1,1)                                   page 1
> >>> page 0
> >>> user 1                 time (1,2)                                   page 2
> >>> page 1
> >>> user 1                 time (1,3 )                                  page 3
> >>> page 2
> >>> .....                          ......
> >>> ....                                                ....
> >>> user n                 time (n,1)                                   page 1
> >>> page 0
> >>> user n                 time (n,2)                                   page 2
> >>> page 1
> >>> user n                 time (n,3)                                   page 3
> >>> page 2
> >>> 
> >>> that is, in each row, we know the current web page that user is viewing,
> >>> and we know the previous web page the user coming from
> >>> 
> >>> now we want to generate a list for each user that recorded the complete
> >>> path the user is taking:
> >>> i.e., how can we use hive to generate output like:
> >>> 
> >>> ------------------------------------------------------------------------------------------------
> >>> user 1 :      page 1   page 2 page 3  page 4  .......... (till reach the
> >>> beginning page of user 1)
> >>> user 2:       page 1 page 2 page 3  page 4 page 5  .......  ( till reach
> >>> the beginning page of user 2)
> >>> the web pages viewed by user 1 and user 2 might be different.
> >>> 
> >>> can we generate this using hive?
> >>> 
> >>> thanks,
 		 	   		  

Re: need help on writing hive query

Posted by Matt Tucker <ma...@gmail.com>.
I did a similar query a few months ago.  In short, I left-padded the page name with the time stamp, grouped with collect_set, and then used sort_array().  There was some other cleanup work and converting back to string to remove the time stamps, but it remained in order.

If there's an easier way, please let me know.

Matt Tucker

On Oct 31, 2012, at 5:37 PM, Tom Brown <to...@gmail.com> wrote:

> It wouldn't retrieve the user's path in a single string, but you could
> simply select the user id and current page, ordered by the timestamp.
> 
> It would require a second step to turn it into the single string path,
> so that might be a deal-breaker.
> 
> --Tom
> 
> On Wed, Oct 31, 2012 at 3:32 PM, Philip Tromans
> <ph...@gmail.com> wrote:
>> You could use collect_set() and GROUP BY. That wouldn't preserve order
>> though.
>> 
>> Phil.
>> 
>> On Oct 31, 2012 9:18 PM, "qiaoresearcher" <qi...@gmail.com> wrote:
>>> 
>>> Hi all,
>>> 
>>> here is the question. Assume we have a table like:
>>> 
>>> ------------------------------------------------------------------------------------------------------------------------------
>>> user_id    ||  user_visiting_time    ||      user_current_web_page     ||
>>> user_previous_web_page
>>> user 1                 time (1,1)                                   page 1
>>> page 0
>>> user 1                 time (1,2)                                   page 2
>>> page 1
>>> user 1                 time (1,3 )                                  page 3
>>> page 2
>>> .....                          ......
>>> ....                                                ....
>>> user n                 time (n,1)                                   page 1
>>> page 0
>>> user n                 time (n,2)                                   page 2
>>> page 1
>>> user n                 time (n,3)                                   page 3
>>> page 2
>>> 
>>> that is, in each row, we know the current web page that user is viewing,
>>> and we know the previous web page the user coming from
>>> 
>>> now we want to generate a list for each user that recorded the complete
>>> path the user is taking:
>>> i.e., how can we use hive to generate output like:
>>> 
>>> ------------------------------------------------------------------------------------------------
>>> user 1 :      page 1   page 2 page 3  page 4  .......... (till reach the
>>> beginning page of user 1)
>>> user 2:       page 1 page 2 page 3  page 4 page 5  .......  ( till reach
>>> the beginning page of user 2)
>>> the web pages viewed by user 1 and user 2 might be different.
>>> 
>>> can we generate this using hive?
>>> 
>>> thanks,

Re: need help on writing hive query

Posted by Tom Brown <to...@gmail.com>.
It wouldn't retrieve the user's path in a single string, but you could
simply select the user id and current page, ordered by the timestamp.

It would require a second step to turn it into the single string path,
so that might be a deal-breaker.

--Tom

On Wed, Oct 31, 2012 at 3:32 PM, Philip Tromans
<ph...@gmail.com> wrote:
> You could use collect_set() and GROUP BY. That wouldn't preserve order
> though.
>
> Phil.
>
> On Oct 31, 2012 9:18 PM, "qiaoresearcher" <qi...@gmail.com> wrote:
>>
>> Hi all,
>>
>> here is the question. Assume we have a table like:
>>
>> ------------------------------------------------------------------------------------------------------------------------------
>> user_id    ||  user_visiting_time    ||      user_current_web_page     ||
>> user_previous_web_page
>> user 1                 time (1,1)                                   page 1
>> page 0
>> user 1                 time (1,2)                                   page 2
>> page 1
>> user 1                 time (1,3 )                                  page 3
>> page 2
>> .....                          ......
>> ....                                                ....
>> user n                 time (n,1)                                   page 1
>> page 0
>> user n                 time (n,2)                                   page 2
>> page 1
>> user n                 time (n,3)                                   page 3
>> page 2
>>
>> that is, in each row, we know the current web page that user is viewing,
>> and we know the previous web page the user coming from
>>
>> now we want to generate a list for each user that recorded the complete
>> path the user is taking:
>> i.e., how can we use hive to generate output like:
>>
>> ------------------------------------------------------------------------------------------------
>> user 1 :      page 1   page 2 page 3  page 4  .......... (till reach the
>> beginning page of user 1)
>> user 2:       page 1 page 2 page 3  page 4 page 5  .......  ( till reach
>> the beginning page of user 2)
>> the web pages viewed by user 1 and user 2 might be different.
>>
>> can we generate this using hive?
>>
>> thanks,

Re: need help on writing hive query

Posted by Philip Tromans <ph...@gmail.com>.
You could use collect_set() and GROUP BY. That wouldn't preserve order
though.

Phil.
On Oct 31, 2012 9:18 PM, "qiaoresearcher" <qi...@gmail.com> wrote:

> Hi all,
>
> here is the question. Assume we have a table like:
>
> ------------------------------------------------------------------------------------------------------------------------------
> user_id    ||  user_visiting_time    ||      user_current_web_page     ||
>  user_previous_web_page
> user 1                 time (1,1)                                   page 1
>                                       page 0
> user 1                 time (1,2)                                   page 2
>                                       page 1
> user 1                 time (1,3 )                                  page 3
>                                       page 2
> .....                          ......
>        ....                                                ....
> user n                 time (n,1)                                   page 1
>                                       page 0
> user n                 time (n,2)                                   page 2
>                                       page 1
> user n                 time (n,3)                                   page 3
>                                       page 2
>
> that is, in each row, we know the current web page that user is viewing,
> and we know the previous web page the user coming from
>
> now we want to generate a list for each user that recorded the complete
> path the user is taking:
> i.e., how can we use hive to generate output like:
>
> ------------------------------------------------------------------------------------------------
> user 1 :      page 1   page 2 page 3  page 4  .......... (till reach the
> beginning page of user 1)
> user 2:       page 1 page 2 page 3  page 4 page 5  .......  ( till reach
> the beginning page of user 2)
> the web pages viewed by user 1 and user 2 might be different.
>
> can we generate this using hive?
>
> thanks,
>