You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Tim Bass <ti...@gmail.com> on 2009/05/28 11:55:28 UTC
Re: Mahout Recommendation for MySQL Data (Network Trending)

Dear Ted and Jeff,

My apologies for not following up on this earlier.  I've been busy on
a few other
projects, including some work in Amazon CloudFront/S3 and then, more
interestingly, work in virtual economics and virtual currency for an active
web community.

I just wanted to let you know have not fogotten about this discussion
and although I am current busy on web development projects, I do read
the mahout lists :-)

Again, my apologies for dropping the ball on my end.

Yours, Tim

On Thu, Apr 2, 2009 at 12:27 PM, Tim Bass  wrote:
> Hi Ted,
>
> I am on travel for a few days, starting in a few hours;  I'll respond
> in more detail
> when I am back.
>
> After reading your generous reply, it seems I need to outline the
> overall data and problem set,
> rough behavioral model, and other higher level abstractions to help
> insure the details
> fit the problem space.   I'll take a stab at this in my next post.
>
> More later....
>
> Yours sincerely, Tim
>
> On Thu, Apr 2, 2009 at 5:17 AM, Ted Dunning <te...@gmail.com> wrote:
>> On Wed, Apr 1, 2009 at 1:08 PM, Tim Bass <ti...@gmail.com> wrote:
>>
>>> I like the idea of clustering of mixture models and, think with a bit of
>>> effort, it would not be too difficult to create initial first order
>>> behavioral models.
>>> ...
>>> what might be the next steps?
>>>
>>
>> I think the first steps are:
>>
>> a) define and collect some sample data.  This should include real data and
>> some synthetic data for testing.
>>
>> b) define the form of behavioral models for clustering
>>
>> c) do an initial clustering
>>
>> d) diagnose what didn't work as planned
>>
>> You should say a bit about the data you have, but my guess is that you have
>> times, target host, general transaction type, source host and possibly user
>> name.  For a first step, I would use times, target host, transaction type
>> (or protocol or port) and source host.  I would obfuscate the source host by
>> picking a random salt and hashing the source IP or host name (and then
>> forget the key).  For synthetic data, I would generate several data files:
>>
>> 1) a mixture of 2,10 and 100 Poisson sources with rates selected over a
>> fairly wide range.  Each source should be identified by a single source host
>> and go to a single target host.
>>
>> 2) a mixture of 100 Poisson sources and 10 periodic sources with rates over
>> a wide range.
>>
>> 3) something more appropriate for the data that you have (and I don't know
>> about)
>>
>> The simplest behavioral models would simply be Poisson sources with a known
>> access rate.  That would not, however, handle periodic or nearly periodic
>> traffic.  My recommendation would be to start with a data source with times
>> distributed according to a gamma distribution for the real data.  Initially,
>> I would not look at source and target hosts in the models.
>>
>> The initial clustering should be run with differing amounts of the synthetic
>> data from (1) and Poisson models.  For small amounts of data, the clustering
>> should identify the high frequency components pretty easily.  With more
>> data, the lower frequency sources should become apparent.
>>
>> With data (1) and gamma models, the system should determine that the sources
>> are largely Poisson.
>>
>> With data (2) and Poisson models, it should be possible to show that the
>> periodic models are not well modeled, largely because the periodic data
>> won't be attached to a single model.  With data(2) and gamma models, the
>> Poisson sources should have appropriate shape parameters and the periodic
>> sources should have narrow range of predicted time between transactions.
>>
>> I don't know how much data will be required for this unlabeled source
>> separation task and it may turn out to be difficult to see the low rate
>> signals against the high rate background.  Choice of priors will be
>> important to avoid describing every data set as being singleton observations
>> from a large number of very low rate sources.  There may also be convergence
>> issues.  For example, if you have a Poisson source with rate 1 and a
>> periodic source with rate 0.1, the mixture of a Poisson and a very narrow
>> gamma distribution would fit the data very well, but it would be very hard
>> to notice by accident that every 10 or so events occur at a constant rate.
>> With labels, this will be much easier, of course.
>>
>> Depending on the results with synthetic data, it may be time to look at the
>> real data or we may need to move to more interesting models.
>>
>> The next interesting model that I would be curious about would be one which
>> combines rate and target host.  This would likely be done using a gamma for
>> rate and multinomial for target host.  1 or more sources should be shared by
>> individual source hosts to emulate proxying and multi-tasking so source host
>> would provide additional information.  This could be used to specify a
>> multi-level generative model where each source host is modeled by multiple
>> traffic sources and the number of traffic sources for each source host and
>> the individual rate and shape parameters would be the latent variables for
>> the model.
>>
>>
>>
>>
>>
>> --
>> Ted Dunning, CTO
>> DeepDyve
>>
>