You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Ian Holsman <li...@holsman.net> on 2008/10/12 09:47:38 UTC

collecting data (was Re: What are the business cases for collaborative filtering?)

Otis Gospodnetic wrote:
> Heh, it sounds like we are going through similar steps.  I first wrote a simple "beacon servlet" for tracking purposes.  Then opted for a simpler (and more static) pixel tracker and a web server (nginx) logging and a log parser that is supposed to process that log and store it to _____ (not sure where, yet, didn't get there) and then from there get it to Taste.  This, of course, means more batch oriented processes.  Going with the beacon servlet approach could *presumably* do something closer to real-time recommendations....
>
>   
right.. we have put our 'real time' portion on the side lines for the 
moment, and are have hadoop jobs running every X minutes to process the 
data coming in.

We are planning on using something like spread or possibly jabber to 
handle the pushing the data between the log collectors and the various 
receivers of the data.
Our scale also limits us, we have a lot of page views to count ;-)
> Ian, can you elaborate on the "feed data into HDFS" part?  You simply store it in HDFS?  Why HDFS?  Why not some other FS or why not a RDBMS?  What happens to your data after you store it in the HDFS?
>
>   

we put the log files onto HDFS so that other things can read them and 
process them.
We have several CF applications that use subsets of the data. (for 
example a very basic one shows summaries of popular pages on a site, to 
ones that use the Fuzzy K-Means algorithm that Pallavi has contributed)
Several of those scripts writes summary info into a sets of mysql 
servers that are accessed by various web sites, as our web site 
developers are familiar with that.

Regards
Ian
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>

Re: collecting data (was Re: What are the business cases for collaborative filtering?)

Posted by Sean Owen <sr...@gmail.com>.

On Sun, Oct 12, 2008 at 8:47 AM, Ian Holsman <li...@holsman.net> wrote:
> right.. we have put our 'real time' portion on the side lines for the
> moment, and are have hadoop jobs running every X minutes to process the data
> coming in.

Incidentally this sort of model is certainly what I recommend. I don't
think real-time updates to recommenders are a good use of resources,
let alone feasible in many cases.

> we put the log files onto HDFS so that other things can read them and
> process them.

PS if you have suggestions for improvements to the code here -- like
an ability to read from N files instead of 1, or N tables or
something, do let me know so I can get on it.