You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Radek Maciaszek <ra...@maciaszek.co.uk> on 2011/07/02 12:47:14 UTC

Re: Similarity between users' groups

Hello,

This project was put on hold for a while so I only had a time to look into
it recently. I was thinking about the idea of down-sampling and different
sampling strategies.

What would be the minimum rate of sampling the users? Right now I sample 1
in 256 users. But if there will be only 400 users in a group I will not get
as good estimate as if there would have 10k users. I am trying to find out
here the strategy for downsampling.

I was hoping there should be some statistical way of estimating sampling
ratio?

Cheers,
Radek

On 18 February 2011 18:04, Sebastian Schelter <ss...@apache.org> wrote:

> This shouldn't be too difficult and would maybe make a good newcomer or
> student project.
>
> --sebastian
>
> Am 18.02.2011 18:19, schrieb Ted Dunning:
> > A better way to sample is to find groups with a very large number of
> users
> > and downsample the number of users to a maximum of about 1000 (or even
> 200
> > if you want to be more aggressive).  Do the same with users.
> >
> > That won't delete a whole lot data volume, but it will make most
> > recommendation algorithms go much faster.  The idea is that after you
> have
> > 200 or more users in a group, you aren't learning anything new anyway.
> >
> > On Fri, Feb 18, 2011 at 7:41 AM, Radek Maciaszek
> > <ra...@gmail.com>wrote:
> >
> >>  Each user can belong to
> >> many groups so the number of combinations is rather big. In fact this
> >> number
> >> of combinations is so large I am considering to sample the users and
> only
> >> analyse 1 in about 256 users. So essentially I would have about 1000+
> >> groups
> >> and about 150k users. Since one user can potentially belong to many
> dozens
> >> of groups this will easily go into millions of records anyway but
> perhaps
> >> will be lower than 100M margin you mentioned.
> >>
> >
>
>

Re: Similarity between users' groups

Posted by Ted Dunning <te...@gmail.com>.
It is pretty easy to set up a reservoir sampler as a combiner and as the front  end to a reducer. 

Sent from my iPhone

On Jul 2, 2011, at 14:22, Lance Norskog <go...@gmail.com> wrote:

> How to do this in an efficient way? No idea.

Re: Similarity between users' groups

Posted by Lance Norskog <go...@gmail.com>.
"reservoir sampling" lets you make good per-user sample sets. This has
code demonstrating the approach.

https://issues.apache.org/jira/browse/MAHOUT-676

How to do this in an efficient way? No idea.

On Sat, Jul 2, 2011 at 9:18 AM, Ted Dunning <te...@gmail.com> wrote:
> Don't sample at a constant rate.
>
> Either downsample user ratings so that no user has more than a reasonable
> number of ratings or downsample users so that no thing has more than a
> reasonable number of users rating it.
>
> I generally prefer the former, but either should be fine.
>
> On Sat, Jul 2, 2011 at 3:47 AM, Radek Maciaszek <ra...@maciaszek.co.uk>wrote:
>
>> Hello,
>>
>> This project was put on hold for a while so I only had a time to look into
>> it recently. I was thinking about the idea of down-sampling and different
>> sampling strategies.
>>
>> What would be the minimum rate of sampling the users? Right now I sample 1
>> in 256 users. But if there will be only 400 users in a group I will not get
>> as good estimate as if there would have 10k users. I am trying to find out
>> here the strategy for downsampling.
>>
>> I was hoping there should be some statistical way of estimating sampling
>> ratio?
>>
>> Cheers,
>> Radek
>>
>> On 18 February 2011 18:04, Sebastian Schelter <ss...@apache.org> wrote:
>>
>> > This shouldn't be too difficult and would maybe make a good newcomer or
>> > student project.
>> >
>> > --sebastian
>> >
>> > Am 18.02.2011 18:19, schrieb Ted Dunning:
>> > > A better way to sample is to find groups with a very large number of
>> > users
>> > > and downsample the number of users to a maximum of about 1000 (or even
>> > 200
>> > > if you want to be more aggressive).  Do the same with users.
>> > >
>> > > That won't delete a whole lot data volume, but it will make most
>> > > recommendation algorithms go much faster.  The idea is that after you
>> > have
>> > > 200 or more users in a group, you aren't learning anything new anyway.
>> > >
>> > > On Fri, Feb 18, 2011 at 7:41 AM, Radek Maciaszek
>> > > <ra...@gmail.com>wrote:
>> > >
>> > >>  Each user can belong to
>> > >> many groups so the number of combinations is rather big. In fact this
>> > >> number
>> > >> of combinations is so large I am considering to sample the users and
>> > only
>> > >> analyse 1 in about 256 users. So essentially I would have about 1000+
>> > >> groups
>> > >> and about 150k users. Since one user can potentially belong to many
>> > dozens
>> > >> of groups this will easily go into millions of records anyway but
>> > perhaps
>> > >> will be lower than 100M margin you mentioned.
>> > >>
>> > >
>> >
>> >
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Similarity between users' groups

Posted by Ted Dunning <te...@gmail.com>.
Don't sample at a constant rate.

Either downsample user ratings so that no user has more than a reasonable
number of ratings or downsample users so that no thing has more than a
reasonable number of users rating it.

I generally prefer the former, but either should be fine.

On Sat, Jul 2, 2011 at 3:47 AM, Radek Maciaszek <ra...@maciaszek.co.uk>wrote:

> Hello,
>
> This project was put on hold for a while so I only had a time to look into
> it recently. I was thinking about the idea of down-sampling and different
> sampling strategies.
>
> What would be the minimum rate of sampling the users? Right now I sample 1
> in 256 users. But if there will be only 400 users in a group I will not get
> as good estimate as if there would have 10k users. I am trying to find out
> here the strategy for downsampling.
>
> I was hoping there should be some statistical way of estimating sampling
> ratio?
>
> Cheers,
> Radek
>
> On 18 February 2011 18:04, Sebastian Schelter <ss...@apache.org> wrote:
>
> > This shouldn't be too difficult and would maybe make a good newcomer or
> > student project.
> >
> > --sebastian
> >
> > Am 18.02.2011 18:19, schrieb Ted Dunning:
> > > A better way to sample is to find groups with a very large number of
> > users
> > > and downsample the number of users to a maximum of about 1000 (or even
> > 200
> > > if you want to be more aggressive).  Do the same with users.
> > >
> > > That won't delete a whole lot data volume, but it will make most
> > > recommendation algorithms go much faster.  The idea is that after you
> > have
> > > 200 or more users in a group, you aren't learning anything new anyway.
> > >
> > > On Fri, Feb 18, 2011 at 7:41 AM, Radek Maciaszek
> > > <ra...@gmail.com>wrote:
> > >
> > >>  Each user can belong to
> > >> many groups so the number of combinations is rather big. In fact this
> > >> number
> > >> of combinations is so large I am considering to sample the users and
> > only
> > >> analyse 1 in about 256 users. So essentially I would have about 1000+
> > >> groups
> > >> and about 150k users. Since one user can potentially belong to many
> > dozens
> > >> of groups this will easily go into millions of records anyway but
> > perhaps
> > >> will be lower than 100M margin you mentioned.
> > >>
> > >
> >
> >
>