You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Wil - <wi...@yahoo.com> on 2011/02/10 23:24:22 UTC

Determining New/Repeat Visitor

Hi,

Is there a good way to determine repeat visitor in analyzing web logs using 
Hive/Hadoop?  One idea that I can come up with is storing the list of user id 
and session id (session data) in another table and then join that table. 
 However, the session data table would grow indefinitely (potentially over 1B+ 
records).  Joining two large table in Hive would result in a Common Join and I 
cannot find any performance information on it.  Is this even feasible and 
scalable? 

There was an older thread that was somewhat related to this 
issue: http://osdir.com/ml/hive-user-hadoop-apache/2009-07/msg00267.html and one 
of the suggestions was using HBase.  However, I don't see anything related on 
using Hive with HBase integration for updating fields.

Are there any alternatives? Or a better approach to solve this problem?

Thanks for any pointers.

Thanks,
--wil