You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by java8964 java8964 <ja...@hotmail.com> on 2012/11/01 00:08:17 UTC

RE: need help on writing hive query

If you don't need to join current_web_page and previous_web_page, assuming you can just trust the time stamp, as Phil points out, an custom UDF of collect_list() is the way to go.
You need to implement collect_list() UDF by yourself, hive doesn't have one by default.But it should be straight forward. In fact, you can reuse the code of collect_set(), replace the internal set with a Java ArrayList, then
select user_id, collect_list(user_current_web_page)from(    select user_id, user_current_web_page    order by user_id asc, user_visiting_time asc)agroup by user_id
Yong
> Subject: Re: need help on writing hive query
> From: matthewtckr@gmail.com
> Date: Wed, 31 Oct 2012 17:53:06 -0400
> To: user@hive.apache.org
> 
> I did a similar query a few months ago.  In short, I left-padded the page name with the time stamp, grouped with collect_set, and then used sort_array().  There was some other cleanup work and converting back to string to remove the time stamps, but it remained in order.
> 
> If there's an easier way, please let me know.
> 
> Matt Tucker
> 
> On Oct 31, 2012, at 5:37 PM, Tom Brown <to...@gmail.com> wrote:
> 
> > It wouldn't retrieve the user's path in a single string, but you could
> > simply select the user id and current page, ordered by the timestamp.
> > 
> > It would require a second step to turn it into the single string path,
> > so that might be a deal-breaker.
> > 
> > --Tom
> > 
> > On Wed, Oct 31, 2012 at 3:32 PM, Philip Tromans
> > <ph...@gmail.com> wrote:
> >> You could use collect_set() and GROUP BY. That wouldn't preserve order
> >> though.
> >> 
> >> Phil.
> >> 
> >> On Oct 31, 2012 9:18 PM, "qiaoresearcher" <qi...@gmail.com> wrote:
> >>> 
> >>> Hi all,
> >>> 
> >>> here is the question. Assume we have a table like:
> >>> 
> >>> ------------------------------------------------------------------------------------------------------------------------------
> >>> user_id    ||  user_visiting_time    ||      user_current_web_page     ||
> >>> user_previous_web_page
> >>> user 1                 time (1,1)                                   page 1
> >>> page 0
> >>> user 1                 time (1,2)                                   page 2
> >>> page 1
> >>> user 1                 time (1,3 )                                  page 3
> >>> page 2
> >>> .....                          ......
> >>> ....                                                ....
> >>> user n                 time (n,1)                                   page 1
> >>> page 0
> >>> user n                 time (n,2)                                   page 2
> >>> page 1
> >>> user n                 time (n,3)                                   page 3
> >>> page 2
> >>> 
> >>> that is, in each row, we know the current web page that user is viewing,
> >>> and we know the previous web page the user coming from
> >>> 
> >>> now we want to generate a list for each user that recorded the complete
> >>> path the user is taking:
> >>> i.e., how can we use hive to generate output like:
> >>> 
> >>> ------------------------------------------------------------------------------------------------
> >>> user 1 :      page 1   page 2 page 3  page 4  .......... (till reach the
> >>> beginning page of user 1)
> >>> user 2:       page 1 page 2 page 3  page 4 page 5  .......  ( till reach
> >>> the beginning page of user 2)
> >>> the web pages viewed by user 1 and user 2 might be different.
> >>> 
> >>> can we generate this using hive?
> >>> 
> >>> thanks,