You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Radek Maciaszek <ra...@gmail.com> on 2011/02/17 12:34:32 UTC

Similarity between users' groups

Hello,

I have a following problem and I am trying to figure out if using Mahout is
a good idea for this or perhaps there may be a much simpler approach.

Consider I have users who can belong to many groups:
user1: group1, group2
user2: group2
user3: group2, group3
... and millions more

I am trying to find a similarities between the groups (not the users). Some
simple similarity metric (e.g. 0-1, close to 0 for not similar at all, close
to 1 very similar) would be ideal. So essentially I need to calculate such a
metric for every pair of groups.

Is it something Mahout can help me with?

Many thanks,
Radek

Re: Similarity between users' groups

Posted by Sean Owen <sr...@gmail.com>.
(I don't think transposition is needed per se, just feed it in as-is
and compute item-item similarity. It would be fine, sure, to transpose
and compute user-user similarity, if you wanted. Rating is not
needed.)

On Thu, Feb 17, 2011 at 8:42 PM, Ted Dunning <te...@gmail.com> wrote:
> Yes.
>
> Simply transpose your data and then use standard similarity techniques.
>
> Transposition in this case means that you would reformulate your data to be
>
> group1: user ... user
>
> In practice, the standard input form for Mahout recommendations is more like
> this:
>
> user group rating
>
> where your ratings will always be 1.  Simply redesignation of the two first
> columns suffices to transpose data like this.
>

Re: Similarity between users' groups

Posted by Sean Owen <sr...@gmail.com>.
You can do this... unless you have over about 100M user-group
memberships, it's overkill. The non-Hadoop solution is about 10 lines
of code in comparison.

On Fri, Feb 18, 2011 at 1:14 PM, Radek Maciaszek <ra...@maciaszek.co.uk> wrote:
> Hi Ted,
>
> Thanks for pointing me into the right direction. I just looked up more
> closely on the recommendation wiki and I think I can do something you
> proposed. To quote from
> this<https://cwiki.apache.org/confluence/display/MAHOUT/Itembased+Collaborative+Filtering>page:
> "*org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob* computes
> all similar items. It expects a .csv file with the preference data as input,
> where each line represents a single preference in the form *
> userID,itemID,value* and outputs pairs of itemIDs with their associated
> similarity value."
>
> If I will pass the data in format "userId,groupId,1" it should output pairs
> of groupIDs with their similarities - or at least I hope so. Sounds easy :)
>
> Many thanks!
> Radek
>
> On 17 February 2011 17:42, Ted Dunning <te...@gmail.com> wrote:
>
>> Yes.
>>
>> Simply transpose your data and then use standard similarity techniques.
>>
>> Transposition in this case means that you would reformulate your data to be
>>
>> group1: user ... user
>>
>> In practice, the standard input form for Mahout recommendations is more
>> like
>> this:
>>
>> user group rating
>>
>> where your ratings will always be 1.  Simply redesignation of the two first
>> columns suffices to transpose data like this.
>>
>> On Thu, Feb 17, 2011 at 3:34 AM, Radek Maciaszek
>> <ra...@gmail.com>wrote:
>>
>> > I am trying to find a similarities between the groups (not the users).
>> Some
>> > simple similarity metric (e.g. 0-1, close to 0 for not similar at all,
>> > close
>> > to 1 very similar) would be ideal. So essentially I need to calculate
>> such
>> > a
>> > metric for every pair of groups.
>> >
>> > Is it something Mahout can help me with?
>> >
>>
>

Re: Similarity between users' groups

Posted by Sebastian Schelter <ss...@apache.org>.
Hi Radek,

Looking forward to your report. This job works fine on EMR.

--sebastian

On 18.02.2011 12:12, Radek Maciaszek wrote:
> Hi Sebastian,
>
> Thanks for a tip. I will try to report later how the analysis will go. 
> Hopefully EMR will work fine with all this.
>
> Cheers,
> Radek
>
> On 18 February 2011 10:20, Sebastian Schelter <ssc@apache.org 
> <ma...@apache.org>> wrote:
>
>     Hi Radek,
>
>     While this a nice and creative way to use ItemSimilarityJob, be
>     aware that it might be prune away some of your data! So either set
>     the parameter "maxCooccurrencesPerItem" to a very high number or
>     use RowSimilarityJob directly.
>
>     --sebastian
>
>
>     On 18.02.2011 11:14, Radek Maciaszek wrote:
>
>         Hi Ted,
>
>         Thanks for pointing me into the right direction. I just looked
>         up more
>         closely on the recommendation wiki and I think I can do
>         something you
>         proposed. To quote from
>         this<https://cwiki.apache.org/confluence/display/MAHOUT/Itembased+Collaborative+Filtering>page:
>
>
>         "*org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob*
>         computes
>         all similar items. It expects a .csv file with the preference
>         data as input,
>         where each line represents a single preference in the form *
>         userID,itemID,value* and outputs pairs of itemIDs with their
>         associated
>         similarity value."
>
>         If I will pass the data in format "userId,groupId,1" it should
>         output pairs
>         of groupIDs with their similarities - or at least I hope so.
>         Sounds easy :)
>
>         Many thanks!
>         Radek
>
>         On 17 February 2011 17:42, Ted Dunning<ted.dunning@gmail.com
>         <ma...@gmail.com>>  wrote:
>
>             Yes.
>
>             Simply transpose your data and then use standard
>             similarity techniques.
>
>             Transposition in this case means that you would
>             reformulate your data to be
>
>             group1: user ... user
>
>             In practice, the standard input form for Mahout
>             recommendations is more
>             like
>             this:
>
>             user group rating
>
>             where your ratings will always be 1.  Simply redesignation
>             of the two first
>             columns suffices to transpose data like this.
>
>             On Thu, Feb 17, 2011 at 3:34 AM, Radek Maciaszek
>             <radek.maciaszek@gmail.com
>             <ma...@gmail.com>>wrote:
>
>                 I am trying to find a similarities between the groups
>                 (not the users).
>
>             Some
>
>                 simple similarity metric (e.g. 0-1, close to 0 for not
>                 similar at all,
>                 close
>                 to 1 very similar) would be ideal. So essentially I
>                 need to calculate
>
>             such
>
>                 a
>                 metric for every pair of groups.
>
>                 Is it something Mahout can help me with?
>
>
>
>
>


Re: Similarity between users' groups

Posted by Radek Maciaszek <ra...@maciaszek.co.uk>.
Hi Sebastian,

Thanks for a tip. I will try to report later how the analysis will go.
Hopefully EMR will work fine with all this.

Cheers,
Radek

On 18 February 2011 10:20, Sebastian Schelter <ss...@apache.org> wrote:

> Hi Radek,
>
> While this a nice and creative way to use ItemSimilarityJob, be aware that
> it might be prune away some of your data! So either set the parameter
> "maxCooccurrencesPerItem" to a very high number or use RowSimilarityJob
> directly.
>
> --sebastian
>
>
> On 18.02.2011 11:14, Radek Maciaszek wrote:
>
>> Hi Ted,
>>
>> Thanks for pointing me into the right direction. I just looked up more
>> closely on the recommendation wiki and I think I can do something you
>> proposed. To quote from
>> this<
>> https://cwiki.apache.org/confluence/display/MAHOUT/Itembased+Collaborative+Filtering
>> >page:
>>
>> "*org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob*
>> computes
>> all similar items. It expects a .csv file with the preference data as
>> input,
>> where each line represents a single preference in the form *
>> userID,itemID,value* and outputs pairs of itemIDs with their associated
>> similarity value."
>>
>> If I will pass the data in format "userId,groupId,1" it should output
>> pairs
>> of groupIDs with their similarities - or at least I hope so. Sounds easy
>> :)
>>
>> Many thanks!
>> Radek
>>
>> On 17 February 2011 17:42, Ted Dunning<te...@gmail.com>  wrote:
>>
>>  Yes.
>>>
>>> Simply transpose your data and then use standard similarity techniques.
>>>
>>> Transposition in this case means that you would reformulate your data to
>>> be
>>>
>>> group1: user ... user
>>>
>>> In practice, the standard input form for Mahout recommendations is more
>>> like
>>> this:
>>>
>>> user group rating
>>>
>>> where your ratings will always be 1.  Simply redesignation of the two
>>> first
>>> columns suffices to transpose data like this.
>>>
>>> On Thu, Feb 17, 2011 at 3:34 AM, Radek Maciaszek
>>> <ra...@gmail.com>wrote:
>>>
>>>  I am trying to find a similarities between the groups (not the users).
>>>>
>>> Some
>>>
>>>> simple similarity metric (e.g. 0-1, close to 0 for not similar at all,
>>>> close
>>>> to 1 very similar) would be ideal. So essentially I need to calculate
>>>>
>>> such
>>>
>>>> a
>>>> metric for every pair of groups.
>>>>
>>>> Is it something Mahout can help me with?
>>>>
>>>>
>

Re: Similarity between users' groups

Posted by Sebastian Schelter <ss...@apache.org>.
Hi Radek,

While this a nice and creative way to use ItemSimilarityJob, be aware 
that it might be prune away some of your data! So either set the 
parameter "maxCooccurrencesPerItem" to a very high number or use 
RowSimilarityJob directly.

--sebastian

On 18.02.2011 11:14, Radek Maciaszek wrote:
> Hi Ted,
>
> Thanks for pointing me into the right direction. I just looked up more
> closely on the recommendation wiki and I think I can do something you
> proposed. To quote from
> this<https://cwiki.apache.org/confluence/display/MAHOUT/Itembased+Collaborative+Filtering>page:
> "*org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob* computes
> all similar items. It expects a .csv file with the preference data as input,
> where each line represents a single preference in the form *
> userID,itemID,value* and outputs pairs of itemIDs with their associated
> similarity value."
>
> If I will pass the data in format "userId,groupId,1" it should output pairs
> of groupIDs with their similarities - or at least I hope so. Sounds easy :)
>
> Many thanks!
> Radek
>
> On 17 February 2011 17:42, Ted Dunning<te...@gmail.com>  wrote:
>
>> Yes.
>>
>> Simply transpose your data and then use standard similarity techniques.
>>
>> Transposition in this case means that you would reformulate your data to be
>>
>> group1: user ... user
>>
>> In practice, the standard input form for Mahout recommendations is more
>> like
>> this:
>>
>> user group rating
>>
>> where your ratings will always be 1.  Simply redesignation of the two first
>> columns suffices to transpose data like this.
>>
>> On Thu, Feb 17, 2011 at 3:34 AM, Radek Maciaszek
>> <ra...@gmail.com>wrote:
>>
>>> I am trying to find a similarities between the groups (not the users).
>> Some
>>> simple similarity metric (e.g. 0-1, close to 0 for not similar at all,
>>> close
>>> to 1 very similar) would be ideal. So essentially I need to calculate
>> such
>>> a
>>> metric for every pair of groups.
>>>
>>> Is it something Mahout can help me with?
>>>


Re: Similarity between users' groups

Posted by Radek Maciaszek <ra...@maciaszek.co.uk>.
Hi Ted,

Thanks for pointing me into the right direction. I just looked up more
closely on the recommendation wiki and I think I can do something you
proposed. To quote from
this<https://cwiki.apache.org/confluence/display/MAHOUT/Itembased+Collaborative+Filtering>page:
"*org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob* computes
all similar items. It expects a .csv file with the preference data as input,
where each line represents a single preference in the form *
userID,itemID,value* and outputs pairs of itemIDs with their associated
similarity value."

If I will pass the data in format "userId,groupId,1" it should output pairs
of groupIDs with their similarities - or at least I hope so. Sounds easy :)

Many thanks!
Radek

On 17 February 2011 17:42, Ted Dunning <te...@gmail.com> wrote:

> Yes.
>
> Simply transpose your data and then use standard similarity techniques.
>
> Transposition in this case means that you would reformulate your data to be
>
> group1: user ... user
>
> In practice, the standard input form for Mahout recommendations is more
> like
> this:
>
> user group rating
>
> where your ratings will always be 1.  Simply redesignation of the two first
> columns suffices to transpose data like this.
>
> On Thu, Feb 17, 2011 at 3:34 AM, Radek Maciaszek
> <ra...@gmail.com>wrote:
>
> > I am trying to find a similarities between the groups (not the users).
> Some
> > simple similarity metric (e.g. 0-1, close to 0 for not similar at all,
> > close
> > to 1 very similar) would be ideal. So essentially I need to calculate
> such
> > a
> > metric for every pair of groups.
> >
> > Is it something Mahout can help me with?
> >
>

Re: Similarity between users' groups

Posted by Ted Dunning <te...@gmail.com>.
Yes.

Simply transpose your data and then use standard similarity techniques.

Transposition in this case means that you would reformulate your data to be

group1: user ... user

In practice, the standard input form for Mahout recommendations is more like
this:

user group rating

where your ratings will always be 1.  Simply redesignation of the two first
columns suffices to transpose data like this.

On Thu, Feb 17, 2011 at 3:34 AM, Radek Maciaszek
<ra...@gmail.com>wrote:

> I am trying to find a similarities between the groups (not the users). Some
> simple similarity metric (e.g. 0-1, close to 0 for not similar at all,
> close
> to 1 very similar) would be ideal. So essentially I need to calculate such
> a
> metric for every pair of groups.
>
> Is it something Mahout can help me with?
>

Re: Similarity between users' groups

Posted by Sean Owen <sr...@gmail.com>.
I think the in-memory solution will work at that scale. You may have
to increase the heap to 4GB or more (that is, may have to find a
large-ish machine). But yes you can probably get just fine results by
sampling even a fraction of the input, which definitely fits. That's
the place to start I think

On Fri, Feb 18, 2011 at 6:41 PM, Radek Maciaszek
<ra...@gmail.com> wrote:
> Hi Sean,
>
> Thanks for so many ideas, I will look into these. Unfortunately the amount
> of data we are dealing with is quite substantial. There is about 1000+
> groups and about 40 millions of users to analyse. Moreover the business need
> is to have eventually even bigger number of groups. Each user can belong to
> many groups so the number of combinations is rather big. In fact this number
> of combinations is so large I am considering to sample the users and only
> analyse 1 in about 256 users. So essentially I would have about 1000+ groups
> and about 150k users. Since one user can potentially belong to many dozens
> of groups this will easily go into millions of records anyway but perhaps
> will be lower than 100M margin you mentioned.
>
> Yesterday I wasn't sure if my existing cluster is big enough for this and
> now I'm tempted to try to do this on one machine. Nice one.
>
> Cheers,
> Radek
>
> On 18 February 2011 15:13, Sean Owen <sr...@gmail.com> wrote:
>
>> This looks like a simple collaborative filtering problem, or at least
>> can be solved that way. It's not even recommendation, just an item
>> similarity problem.
>>
>> Users are users and groups are items. You are just computing item-item
>> similarity based on some metric and there are several implemented in
>> the library.
>>
>> Forget Hadoop for now as I doubt this is nearly of the scale where you
>> need it. For a quick solution, make a file of "userID,groupID" entries
>> for every membership. Create a FileDataModel on top of it. Then
>> instantiate LogLikelihoodtemSimilarity on top of that for example. It
>> will score the "simiarlity" between any two groups based on
>> membership. The result is between 0 and 1.
>>
>> On Thu, Feb 17, 2011 at 2:34 PM, Radek Maciaszek
>> <ra...@gmail.com> wrote:
>> > Hello,
>> >
>> > I have a following problem and I am trying to figure out if using Mahout
>> is
>> > a good idea for this or perhaps there may be a much simpler approach.
>> >
>> > Consider I have users who can belong to many groups:
>> > user1: group1, group2
>> > user2: group2
>> > user3: group2, group3
>> > ... and millions more
>> >
>> > I am trying to find a similarities between the groups (not the users).
>> Some
>> > simple similarity metric (e.g. 0-1, close to 0 for not similar at all,
>> close
>> > to 1 very similar) would be ideal. So essentially I need to calculate
>> such a
>> > metric for every pair of groups.
>> >
>> > Is it something Mahout can help me with?
>> >
>> > Many thanks,
>> > Radek
>> >
>>
>

Re: Similarity between users' groups

Posted by Ted Dunning <te...@gmail.com>.
It is pretty easy to set up a reservoir sampler as a combiner and as the front  end to a reducer. 

Sent from my iPhone

On Jul 2, 2011, at 14:22, Lance Norskog <go...@gmail.com> wrote:

> How to do this in an efficient way? No idea.

Re: Similarity between users' groups

Posted by Lance Norskog <go...@gmail.com>.
"reservoir sampling" lets you make good per-user sample sets. This has
code demonstrating the approach.

https://issues.apache.org/jira/browse/MAHOUT-676

How to do this in an efficient way? No idea.

On Sat, Jul 2, 2011 at 9:18 AM, Ted Dunning <te...@gmail.com> wrote:
> Don't sample at a constant rate.
>
> Either downsample user ratings so that no user has more than a reasonable
> number of ratings or downsample users so that no thing has more than a
> reasonable number of users rating it.
>
> I generally prefer the former, but either should be fine.
>
> On Sat, Jul 2, 2011 at 3:47 AM, Radek Maciaszek <ra...@maciaszek.co.uk>wrote:
>
>> Hello,
>>
>> This project was put on hold for a while so I only had a time to look into
>> it recently. I was thinking about the idea of down-sampling and different
>> sampling strategies.
>>
>> What would be the minimum rate of sampling the users? Right now I sample 1
>> in 256 users. But if there will be only 400 users in a group I will not get
>> as good estimate as if there would have 10k users. I am trying to find out
>> here the strategy for downsampling.
>>
>> I was hoping there should be some statistical way of estimating sampling
>> ratio?
>>
>> Cheers,
>> Radek
>>
>> On 18 February 2011 18:04, Sebastian Schelter <ss...@apache.org> wrote:
>>
>> > This shouldn't be too difficult and would maybe make a good newcomer or
>> > student project.
>> >
>> > --sebastian
>> >
>> > Am 18.02.2011 18:19, schrieb Ted Dunning:
>> > > A better way to sample is to find groups with a very large number of
>> > users
>> > > and downsample the number of users to a maximum of about 1000 (or even
>> > 200
>> > > if you want to be more aggressive).  Do the same with users.
>> > >
>> > > That won't delete a whole lot data volume, but it will make most
>> > > recommendation algorithms go much faster.  The idea is that after you
>> > have
>> > > 200 or more users in a group, you aren't learning anything new anyway.
>> > >
>> > > On Fri, Feb 18, 2011 at 7:41 AM, Radek Maciaszek
>> > > <ra...@gmail.com>wrote:
>> > >
>> > >>  Each user can belong to
>> > >> many groups so the number of combinations is rather big. In fact this
>> > >> number
>> > >> of combinations is so large I am considering to sample the users and
>> > only
>> > >> analyse 1 in about 256 users. So essentially I would have about 1000+
>> > >> groups
>> > >> and about 150k users. Since one user can potentially belong to many
>> > dozens
>> > >> of groups this will easily go into millions of records anyway but
>> > perhaps
>> > >> will be lower than 100M margin you mentioned.
>> > >>
>> > >
>> >
>> >
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Similarity between users' groups

Posted by Ted Dunning <te...@gmail.com>.
Don't sample at a constant rate.

Either downsample user ratings so that no user has more than a reasonable
number of ratings or downsample users so that no thing has more than a
reasonable number of users rating it.

I generally prefer the former, but either should be fine.

On Sat, Jul 2, 2011 at 3:47 AM, Radek Maciaszek <ra...@maciaszek.co.uk>wrote:

> Hello,
>
> This project was put on hold for a while so I only had a time to look into
> it recently. I was thinking about the idea of down-sampling and different
> sampling strategies.
>
> What would be the minimum rate of sampling the users? Right now I sample 1
> in 256 users. But if there will be only 400 users in a group I will not get
> as good estimate as if there would have 10k users. I am trying to find out
> here the strategy for downsampling.
>
> I was hoping there should be some statistical way of estimating sampling
> ratio?
>
> Cheers,
> Radek
>
> On 18 February 2011 18:04, Sebastian Schelter <ss...@apache.org> wrote:
>
> > This shouldn't be too difficult and would maybe make a good newcomer or
> > student project.
> >
> > --sebastian
> >
> > Am 18.02.2011 18:19, schrieb Ted Dunning:
> > > A better way to sample is to find groups with a very large number of
> > users
> > > and downsample the number of users to a maximum of about 1000 (or even
> > 200
> > > if you want to be more aggressive).  Do the same with users.
> > >
> > > That won't delete a whole lot data volume, but it will make most
> > > recommendation algorithms go much faster.  The idea is that after you
> > have
> > > 200 or more users in a group, you aren't learning anything new anyway.
> > >
> > > On Fri, Feb 18, 2011 at 7:41 AM, Radek Maciaszek
> > > <ra...@gmail.com>wrote:
> > >
> > >>  Each user can belong to
> > >> many groups so the number of combinations is rather big. In fact this
> > >> number
> > >> of combinations is so large I am considering to sample the users and
> > only
> > >> analyse 1 in about 256 users. So essentially I would have about 1000+
> > >> groups
> > >> and about 150k users. Since one user can potentially belong to many
> > dozens
> > >> of groups this will easily go into millions of records anyway but
> > perhaps
> > >> will be lower than 100M margin you mentioned.
> > >>
> > >
> >
> >
>

Re: Similarity between users' groups

Posted by Radek Maciaszek <ra...@maciaszek.co.uk>.
Hello,

This project was put on hold for a while so I only had a time to look into
it recently. I was thinking about the idea of down-sampling and different
sampling strategies.

What would be the minimum rate of sampling the users? Right now I sample 1
in 256 users. But if there will be only 400 users in a group I will not get
as good estimate as if there would have 10k users. I am trying to find out
here the strategy for downsampling.

I was hoping there should be some statistical way of estimating sampling
ratio?

Cheers,
Radek

On 18 February 2011 18:04, Sebastian Schelter <ss...@apache.org> wrote:

> This shouldn't be too difficult and would maybe make a good newcomer or
> student project.
>
> --sebastian
>
> Am 18.02.2011 18:19, schrieb Ted Dunning:
> > A better way to sample is to find groups with a very large number of
> users
> > and downsample the number of users to a maximum of about 1000 (or even
> 200
> > if you want to be more aggressive).  Do the same with users.
> >
> > That won't delete a whole lot data volume, but it will make most
> > recommendation algorithms go much faster.  The idea is that after you
> have
> > 200 or more users in a group, you aren't learning anything new anyway.
> >
> > On Fri, Feb 18, 2011 at 7:41 AM, Radek Maciaszek
> > <ra...@gmail.com>wrote:
> >
> >>  Each user can belong to
> >> many groups so the number of combinations is rather big. In fact this
> >> number
> >> of combinations is so large I am considering to sample the users and
> only
> >> analyse 1 in about 256 users. So essentially I would have about 1000+
> >> groups
> >> and about 150k users. Since one user can potentially belong to many
> dozens
> >> of groups this will easily go into millions of records anyway but
> perhaps
> >> will be lower than 100M margin you mentioned.
> >>
> >
>
>

Re: Similarity between users' groups

Posted by Sebastian Schelter <ss...@apache.org>.
The reminds of something I had in mind some weeks ago, we should invest
some work to give ItemSimilarityJob and RecommenderJob the ability to
use pluggable, customizable "sampling strategies".

This shouldn't be too difficult and would maybe make a good newcomer or
student project.

--sebastian

Am 18.02.2011 18:19, schrieb Ted Dunning:
> A better way to sample is to find groups with a very large number of users
> and downsample the number of users to a maximum of about 1000 (or even 200
> if you want to be more aggressive).  Do the same with users.
> 
> That won't delete a whole lot data volume, but it will make most
> recommendation algorithms go much faster.  The idea is that after you have
> 200 or more users in a group, you aren't learning anything new anyway.
> 
> On Fri, Feb 18, 2011 at 7:41 AM, Radek Maciaszek
> <ra...@gmail.com>wrote:
> 
>>  Each user can belong to
>> many groups so the number of combinations is rather big. In fact this
>> number
>> of combinations is so large I am considering to sample the users and only
>> analyse 1 in about 256 users. So essentially I would have about 1000+
>> groups
>> and about 150k users. Since one user can potentially belong to many dozens
>> of groups this will easily go into millions of records anyway but perhaps
>> will be lower than 100M margin you mentioned.
>>
> 


Re: Similarity between users' groups

Posted by Ted Dunning <te...@gmail.com>.
A better way to sample is to find groups with a very large number of users
and downsample the number of users to a maximum of about 1000 (or even 200
if you want to be more aggressive).  Do the same with users.

That won't delete a whole lot data volume, but it will make most
recommendation algorithms go much faster.  The idea is that after you have
200 or more users in a group, you aren't learning anything new anyway.

On Fri, Feb 18, 2011 at 7:41 AM, Radek Maciaszek
<ra...@gmail.com>wrote:

>  Each user can belong to
> many groups so the number of combinations is rather big. In fact this
> number
> of combinations is so large I am considering to sample the users and only
> analyse 1 in about 256 users. So essentially I would have about 1000+
> groups
> and about 150k users. Since one user can potentially belong to many dozens
> of groups this will easily go into millions of records anyway but perhaps
> will be lower than 100M margin you mentioned.
>

Re: Similarity between users' groups

Posted by Radek Maciaszek <ra...@gmail.com>.
Hi Sean,

Thanks for so many ideas, I will look into these. Unfortunately the amount
of data we are dealing with is quite substantial. There is about 1000+
groups and about 40 millions of users to analyse. Moreover the business need
is to have eventually even bigger number of groups. Each user can belong to
many groups so the number of combinations is rather big. In fact this number
of combinations is so large I am considering to sample the users and only
analyse 1 in about 256 users. So essentially I would have about 1000+ groups
and about 150k users. Since one user can potentially belong to many dozens
of groups this will easily go into millions of records anyway but perhaps
will be lower than 100M margin you mentioned.

Yesterday I wasn't sure if my existing cluster is big enough for this and
now I'm tempted to try to do this on one machine. Nice one.

Cheers,
Radek

On 18 February 2011 15:13, Sean Owen <sr...@gmail.com> wrote:

> This looks like a simple collaborative filtering problem, or at least
> can be solved that way. It's not even recommendation, just an item
> similarity problem.
>
> Users are users and groups are items. You are just computing item-item
> similarity based on some metric and there are several implemented in
> the library.
>
> Forget Hadoop for now as I doubt this is nearly of the scale where you
> need it. For a quick solution, make a file of "userID,groupID" entries
> for every membership. Create a FileDataModel on top of it. Then
> instantiate LogLikelihoodtemSimilarity on top of that for example. It
> will score the "simiarlity" between any two groups based on
> membership. The result is between 0 and 1.
>
> On Thu, Feb 17, 2011 at 2:34 PM, Radek Maciaszek
> <ra...@gmail.com> wrote:
> > Hello,
> >
> > I have a following problem and I am trying to figure out if using Mahout
> is
> > a good idea for this or perhaps there may be a much simpler approach.
> >
> > Consider I have users who can belong to many groups:
> > user1: group1, group2
> > user2: group2
> > user3: group2, group3
> > ... and millions more
> >
> > I am trying to find a similarities between the groups (not the users).
> Some
> > simple similarity metric (e.g. 0-1, close to 0 for not similar at all,
> close
> > to 1 very similar) would be ideal. So essentially I need to calculate
> such a
> > metric for every pair of groups.
> >
> > Is it something Mahout can help me with?
> >
> > Many thanks,
> > Radek
> >
>

Re: Similarity between users' groups

Posted by Sean Owen <sr...@gmail.com>.
This looks like a simple collaborative filtering problem, or at least
can be solved that way. It's not even recommendation, just an item
similarity problem.

Users are users and groups are items. You are just computing item-item
similarity based on some metric and there are several implemented in
the library.

Forget Hadoop for now as I doubt this is nearly of the scale where you
need it. For a quick solution, make a file of "userID,groupID" entries
for every membership. Create a FileDataModel on top of it. Then
instantiate LogLikelihoodtemSimilarity on top of that for example. It
will score the "simiarlity" between any two groups based on
membership. The result is between 0 and 1.

On Thu, Feb 17, 2011 at 2:34 PM, Radek Maciaszek
<ra...@gmail.com> wrote:
> Hello,
>
> I have a following problem and I am trying to figure out if using Mahout is
> a good idea for this or perhaps there may be a much simpler approach.
>
> Consider I have users who can belong to many groups:
> user1: group1, group2
> user2: group2
> user3: group2, group3
> ... and millions more
>
> I am trying to find a similarities between the groups (not the users). Some
> simple similarity metric (e.g. 0-1, close to 0 for not similar at all, close
> to 1 very similar) would be ideal. So essentially I need to calculate such a
> metric for every pair of groups.
>
> Is it something Mahout can help me with?
>
> Many thanks,
> Radek
>