You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Wil - <wi...@yahoo.com> on 2011/02/05 03:50:02 UTC
Robot identification by views per session in web logs
Hi all,
I've heard very nice things about using Hadoop/Hive/Pig for web log processing
and started playing around with it. However, I quickly ran into some road blocks
midway. I want to identify and throw away robots in our web logs. One criteria
of robot identification is a user viewing more than X pages in a session.
Basically I want to be able to determine how many page views a user had in a
session during processing time. Figuring out the session view count is not very
difficult. One of the challenge is figuring out how to update the previous
processed data (which could be X days ago -- data already written to disk) so
that the initial start of the session have the session view count.
IE: a robot that continually view pages:
session-id, date, view_count
session-id-1, 2011-01-01, view #1
session-id-1, 2011-01-01, view #2 ...
session-id-1, 2011-01-02, view #100..
session-id-1, 2011-01-03, view #500 <end>
I want to ignore all the view from this robot, but cannot find a way to mark the
entire user/session as a robot. Using a relation database, I can easily say
"update view_table set view_count = 500 where session_id = session-id-1" and
that would be it. But using Hadoop/Hive proves tricky.
What I would want would be:
session-id, date, session_view_count
session-id-1, 2011-01-01, 500
session-id-1, 2011-01-01, 500
session-id-1, 2011-01-02, 500
session-id-1, 2011-01-03, 500
If I had this data, I can filter out these robots by issuing a hive query:
"select * from view_table where session_view_count < 500". I can look at only
2011-01-01's data and get filtered data using the same query. With the initial
approach, I cannot simply say "select * from view_table where view_count < 500"
to rid of all the views from the session. (Page view #1 to #499 would still be
counted.)
Currently, I am using Amazon EMR with S3 as the storage using text files.
Have anyone ran into these types of issues? Are there alternatives? Is there
any way to filter out these robots whether during processing or query time?
Thanks for any help or pointers!
--wil