You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Wil - <wi...@yahoo.com> on 2011/02/05 03:50:02 UTC

Robot identification by views per session in web logs

Hi all,

I've heard very nice things about using Hadoop/Hive/Pig for web log processing 
and started playing around with it. However, I quickly ran into some road blocks 
midway.  I want to identify and throw away robots in our web logs. One criteria 
of robot identification is a user viewing more than X pages in a session.

Basically I want to be able to determine how many page views a user had in a 
session during processing time.  Figuring out the session view count is not very 
difficult. One of the challenge is figuring out how to update the previous 
processed data (which could be X days ago -- data already written to disk) so 
that the initial start of the session have the session view count.

IE: a robot that continually view pages:
session-id,   date,       view_count
session-id-1, 2011-01-01, view #1
session-id-1, 2011-01-01, view #2 ...
session-id-1, 2011-01-02, view #100..
session-id-1, 2011-01-03, view #500 <end>

I want to ignore all the view from this robot, but cannot find a way to mark the 
entire user/session as a robot.  Using a relation database, I can easily say 
"update view_table set view_count = 500 where session_id = session-id-1" and 
that would be it. But using Hadoop/Hive proves tricky.


What I would want would be:
session-id,   date,       session_view_count
session-id-1, 2011-01-01, 500
session-id-1, 2011-01-01, 500
session-id-1, 2011-01-02, 500
session-id-1, 2011-01-03, 500

If I had this data, I can filter out these robots by issuing a hive query: 
"select * from view_table where session_view_count < 500". I can look at only 
2011-01-01's data and get filtered data using the same query.  With the initial 
approach, I cannot simply say "select * from view_table where view_count < 500" 
to rid of all the views from the session. (Page view #1 to #499 would still be 
counted.)

Currently, I am using Amazon EMR with S3 as the storage using text files.

Have anyone ran into these types of issues?  Are there alternatives? Is there 
any way to filter out these robots whether during processing or query time?

Thanks for any help or pointers! 

--wil