You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Sean Owen <sr...@gmail.com> on 2010/07/16 17:51:12 UTC

Re: Cooccurrence to align different categorization systems (many to many occurrence)

Lets clarify your situation. You are making recommendations or what?
Shouldn't have anything to do with Lucene per se. You do not need Hadoop for
recommendations if you don't want. ItemSimilarity is not related to Hadoop.
Yes you can define whatever notion of similarity that you like this way. Its
up to you not the framework really. But are you doing recommendations?

On Jul 16, 2010 2:01 PM, "Chantal Ackermann" <
chantal.ackermann@btelligent.de> wrote:
> Hi all,
>
> my goal is to align two slightly different categorization systems where
> each categorized item can have multiple categories in one of these
> systems.
>
> E.g.:
> Categorized item: "Harry Potter"
> Category system 1: Fiction, Fantasy, Children
> Category system 2: Youth, Fantasy
> The alignment would then produce a similarity between "Fantasy" (used in
> both systems) and "Children" (1) and Youth (2).
>
> I *think* ItemSimilarity is what I want but if anyone can provide me
> with the correct keywords for googling - that would be great.
>
> If a Lucene/SOLR index is more efficient as source than the lists I have
> I'm fine with setting that up. However, I am not sure how the schema
> would have to be structured? Would it use the categorized items as
> document entities - if not what then?
>
> Any pointers where to start would be very much appreciated! Also the
> information whether I need a full Hadoop installation or whether Mahout
> as checked out from trunk is sufficient. It is not very much data
> altogether (<10k categorized items).
>
> Thanks!
> Chantal
>
>

Re: Cooccurrence to align different categorization systems (many to many occurrence)

Posted by Ted Dunning <te...@gmail.com>.
You can also turn this around and consider the classes as users and
categories as items to be recommended.

On Mon, Jul 19, 2010 at 1:27 AM, Chantal Ackermann <
chantal.ackermann@btelligent.de> wrote:

> I find Sean's suggestion on
> thinking of categories as users and using the recommendation classes for
> the task the easiest to understand, right now.
>

Re: Cooccurrence to align different categorization systems (many to many occurrence)

Posted by Sean Owen <sr...@gmail.com>.
You have it right. The easiest way to deal with community 1 vs
community 2 is to pool all of the categories together into one data
model, but simply ignore most-similar categories from the wrong
category. That is you're computing similarity between a community 1
"user" and al community 2 "users" only.

On Mon, Jul 19, 2010 at 4:27 AM, Chantal Ackermann
<ch...@btelligent.de> wrote:
> Hi Sean,
> hi Ted,
> hi Sebastian,
>
> thanks a lot for all those detailed answers. I'll need some time to
> digest the technical details, I'm afraid. I find Sean's suggestion on
> thinking of categories as users and using the recommendation classes for
> the task the easiest to understand, right now.
>
> It's not completely the same situation, though. Or only if thinking of
> two user communities, and the recommendations presented to a user of
> Community 1 should be from Community 2.
>
> (@Sebastian)
> Each item is categorized in each of the systems but it's allowed that
> the item can have zero categories. There are a few hundred categories in
> each system.
>
> The data is in lists of the following structure:
> <ITEM (ID)> [List of categories System 1] [List of categories System 2]
>
> The approach I'll take:
> 1. normalize all the cateogory strings and give them unique number
> identifiers (unique across both systems, distinct ranges).
> 2. walk trough the list and per item: extract one category (= user) and
> create a BooleanPreference for that user and item pair.
> 3. for each category (System 1) request similar categories (=user
> similarity) that are from System 2. I probably have to request a mixed
> list (both systems) and filter out the ones from System 1.
>
> I'll keep you posted. If you have more tipps or things I should take
> into account - or if you think that this approach won't return any
> decent results I'm glad if you could share.
>
> Thanks!
> Chantal
>
>

Re: Cooccurrence to align different categorization systems (many to many occurrence)

Posted by Sean Owen <sr...@gmail.com>.
Yeah that's fine. You could do this too. You're not actually making
recommendations, just computing most similar items instead of most
similar users, so lots of stuff works here.

On Mon, Jul 19, 2010 at 2:55 PM, Chantal Ackermann
<ch...@btelligent.de> wrote:
> Hi,
>
> mainly for the records:
>
> I've now mapped my items onto what in Mahout is called "User", and
> mapped the categories onto Mahout "Items", instead of mapping my items
> onto "Item" and the categories onto "User".
>
> I changed the plan because that way, it was easier to create the
> GenericBooleanPrefDataModel from my input. I actually think that it fits
> better that way - what's your opinion?
>
> The input to the data model looks a bit like this (I've shortened it for
> the sake of readability):
> [id=15901,title=Infamous] CAT1={3=Drama}
> [id=15888,title=Millions] CAT1={3=Drama, 4=Crime, 8=Thriller}
> [id=16421,title=The Departed] CAT1={3=Drama, 8=Thriller}
>
> NOTE that the data from the second category system is MISSING!
> (I have not yet all data accumulated, but while waiting for it I am
> preparing the code to process the similarities.) It would come as an
> additional list per item:
> CAT2={<id>=<value> ...}
> Where id is in a distinctly different range from the ids used for CAT1.
>
> I am using the code from Grant Ingersoll's article:
>
> // prefs is:
> // FastByIDMap with id:=itemId,
> // FastIDSet := list of CAT1 (and CAT2) ids
> DataModel dataModel = new GenericBooleanPrefDataModel(prefs);
> ItemSimilarity itemSimilarity = new LogLikelihoodSimilarity(dataModel);
> ItemBasedRecommender recommender =
>        new GenericItemBasedRecommender(dataModel, itemSimilarity);
> //Get the recommendations for the Item
> // loop over all items
> for (items in CAT1) {
>        List<RecommendedItem> simItems =
>        recommender.mostSimilarItems(id, numRecs);
>        // filter out CAT1, keep only CAT2
> }
>
> I've run the code but as CAT2 is missing, currently, I am not filtering
> the results. It seems fine, from what I can tell.
>
> Thanks again for your help!
> Chantal
>
>

Re: Cooccurrence to align different categorization systems (many to many occurrence)

Posted by Ted Dunning <te...@gmail.com>.
I hadn't noticed this when I sent this same suggestion just now.  I think
you are cooking.

On Mon, Jul 19, 2010 at 6:55 AM, Chantal Ackermann <
chantal.ackermann@btelligent.de> wrote:

> I changed the plan because that way, it was easier to create the
> GenericBooleanPrefDataModel from my input. I actually think that it fits
> better that way - what's your opinion?
>

Re: Cooccurrence to align different categorization systems (many to many occurrence)

Posted by Chantal Ackermann <ch...@btelligent.de>.
Hi,

mainly for the records:

I've now mapped my items onto what in Mahout is called "User", and
mapped the categories onto Mahout "Items", instead of mapping my items
onto "Item" and the categories onto "User".

I changed the plan because that way, it was easier to create the
GenericBooleanPrefDataModel from my input. I actually think that it fits
better that way - what's your opinion?

The input to the data model looks a bit like this (I've shortened it for
the sake of readability):
[id=15901,title=Infamous] CAT1={3=Drama}
[id=15888,title=Millions] CAT1={3=Drama, 4=Crime, 8=Thriller}
[id=16421,title=The Departed] CAT1={3=Drama, 8=Thriller}

NOTE that the data from the second category system is MISSING!
(I have not yet all data accumulated, but while waiting for it I am
preparing the code to process the similarities.) It would come as an
additional list per item:
CAT2={<id>=<value> ...}
Where id is in a distinctly different range from the ids used for CAT1.

I am using the code from Grant Ingersoll's article:

// prefs is:
// FastByIDMap with id:=itemId,
// FastIDSet := list of CAT1 (and CAT2) ids
DataModel dataModel = new GenericBooleanPrefDataModel(prefs);
ItemSimilarity itemSimilarity = new LogLikelihoodSimilarity(dataModel);
ItemBasedRecommender recommender =
	new GenericItemBasedRecommender(dataModel, itemSimilarity);
//Get the recommendations for the Item
// loop over all items
for (items in CAT1) {
	List<RecommendedItem> simItems =
	recommender.mostSimilarItems(id, numRecs);
	// filter out CAT1, keep only CAT2
}

I've run the code but as CAT2 is missing, currently, I am not filtering
the results. It seems fine, from what I can tell.

Thanks again for your help!
Chantal


Re: Cooccurrence to align different categorization systems (many to many occurrence)

Posted by Sebastian Schelter <ss...@googlemail.com>.
Hi Chantal,

I think you're taking the right approach.

I missed the line in your first mail, where you say you have only 10k
items, so you definitely won't need hadoop or the RowSimilarityJob.

--sebastian

Am 19.07.2010 10:27, schrieb Chantal Ackermann:
> Hi Sean,
> hi Ted,
> hi Sebastian,
>
> thanks a lot for all those detailed answers. I'll need some time to
> digest the technical details, I'm afraid. I find Sean's suggestion on
> thinking of categories as users and using the recommendation classes for
> the task the easiest to understand, right now.
>
> It's not completely the same situation, though. Or only if thinking of
> two user communities, and the recommendations presented to a user of
> Community 1 should be from Community 2.
>
> (@Sebastian)
> Each item is categorized in each of the systems but it's allowed that
> the item can have zero categories. There are a few hundred categories in
> each system.
>
> The data is in lists of the following structure:
> <ITEM (ID)> [List of categories System 1] [List of categories System 2]
>
> The approach I'll take:
> 1. normalize all the cateogory strings and give them unique number
> identifiers (unique across both systems, distinct ranges).
> 2. walk trough the list and per item: extract one category (= user) and
> create a BooleanPreference for that user and item pair.
> 3. for each category (System 1) request similar categories (=user
> similarity) that are from System 2. I probably have to request a mixed
> list (both systems) and filter out the ones from System 1.
>
> I'll keep you posted. If you have more tipps or things I should take
> into account - or if you think that this approach won't return any
> decent results I'm glad if you could share.
>
> Thanks!
> Chantal
>
>   



Re: Cooccurrence to align different categorization systems (many to many occurrence)

Posted by Chantal Ackermann <ch...@btelligent.de>.
Hi Sean,
hi Ted,
hi Sebastian,

thanks a lot for all those detailed answers. I'll need some time to
digest the technical details, I'm afraid. I find Sean's suggestion on
thinking of categories as users and using the recommendation classes for
the task the easiest to understand, right now.

It's not completely the same situation, though. Or only if thinking of
two user communities, and the recommendations presented to a user of
Community 1 should be from Community 2.

(@Sebastian)
Each item is categorized in each of the systems but it's allowed that
the item can have zero categories. There are a few hundred categories in
each system.

The data is in lists of the following structure:
<ITEM (ID)> [List of categories System 1] [List of categories System 2]

The approach I'll take:
1. normalize all the cateogory strings and give them unique number
identifiers (unique across both systems, distinct ranges).
2. walk trough the list and per item: extract one category (= user) and
create a BooleanPreference for that user and item pair.
3. for each category (System 1) request similar categories (=user
similarity) that are from System 2. I probably have to request a mixed
list (both systems) and filter out the ones from System 1.

I'll keep you posted. If you have more tipps or things I should take
into account - or if you think that this approach won't return any
decent results I'm glad if you could share.

Thanks!
Chantal


Re: Cooccurrence to align different categorization systems (many to many occurrence)

Posted by Sean Owen <sr...@gmail.com>.
You could construe this as a recommendation problem by thinking of
categories as "users". Then you are simply finding the pairs of users
that are most similar based on their mappings to items. Just use
LogLikelihoodSimilarity and compute all pairs of similarities.

This is a subset of a recommender problem. I imagine it is nowhere
near big enough to need Hadoop either. It's just a matter of setting
up your data in a file or something and writing about 10 lines of
code.

On Fri, Jul 16, 2010 at 12:26 PM, Chantal Ackermann
<ch...@btelligent.de> wrote:
> Hi Sean,
>
> I wouldn't call it recommendations because the target audience is not
> the end user.
> I would like to do this as a first step to create a mapping between
> those two categorization systems. It's a bit like merging two datasets
> and you would like to now how similar the data in certain (multivalued)
> fields is (say field 1 and field 2).
> This would require pairing each item in field 1 with each item in field
> 2? (Matrix?)
> As a result I would expect something similar to a recommendation system,
> yes. In the sense that when I ask for a value from field 1 I would get
> the values from field 2 that could be seen most equivalent to the input
> value (with some statistical indication if possible).
>
> I've been rereading the MAHOUT-418 issue (Computing the pairwise
> similarities of the rows of a matrix) and I wonder whether this is what
> I need.
> I've also read through the hadoop word count tutorial and installed
> hadoop (which was as easy as it can be).
>
> I just don't know where to start as I have not enough experience to
> judge what is relevant for my use case.
>
> Thanks!
> Chantal
>
>
> On Fri, 2010-07-16 at 17:51 +0200, Sean Owen wrote:
>> Lets clarify your situation. You are making recommendations or what?
>> Shouldn't have anything to do with Lucene per se. You do not need Hadoop for
>> recommendations if you don't want. ItemSimilarity is not related to Hadoop.
>> Yes you can define whatever notion of similarity that you like this way. Its
>> up to you not the framework really. But are you doing recommendations?
>>
>> On Jul 16, 2010 2:01 PM, "Chantal Ackermann" <
>> chantal.ackermann@btelligent.de> wrote:
>> > Hi all,
>> >
>> > my goal is to align two slightly different categorization systems where
>> > each categorized item can have multiple categories in one of these
>> > systems.
>> >
>> > E.g.:
>> > Categorized item: "Harry Potter"
>> > Category system 1: Fiction, Fantasy, Children
>> > Category system 2: Youth, Fantasy
>> > The alignment would then produce a similarity between "Fantasy" (used in
>> > both systems) and "Children" (1) and Youth (2).
>> >
>> > I *think* ItemSimilarity is what I want but if anyone can provide me
>> > with the correct keywords for googling - that would be great.
>> >
>> > If a Lucene/SOLR index is more efficient as source than the lists I have
>> > I'm fine with setting that up. However, I am not sure how the schema
>> > would have to be structured? Would it use the categorized items as
>> > document entities - if not what then?
>> >
>> > Any pointers where to start would be very much appreciated! Also the
>> > information whether I need a full Hadoop installation or whether Mahout
>> > as checked out from trunk is sufficient. It is not very much data
>> > altogether (<10k categorized items).
>> >
>> > Thanks!
>> > Chantal
>> >
>> >
>
>
>
>

Re: Cooccurrence to align different categorization systems (many to many occurrence)

Posted by Ted Dunning <te...@gmail.com>.
This is a good example of an abstract recommendation task, but I am not sure
that this is the best framework for what you want to do.

The approaches that I would use would start with two item x category
matrices that describe your two categorization systems.  Then what you want
to do is predict cooccurrence (actual or theoretical) between items of
category 1 with items of category 2.

I see at least three ways that are likely to accomplish this using Mahout.

The simplest is to simply use the cooccurrence counter and log-likelihood
measure to find all interesting cooccurent categories.  This will give you
pairs of categories that might or might not be from different categorization
schemes.  You could filter out the uninteresting ones and have your list of
potential pairs.

Another approach would be to use frequent itemset mining with the same goal
as the first approach.

>From there, I would move to latent variable techniques.  The idea is that
you should be able to describe your items and your categories in terms of
internal variables.  Similarity of internal representation should be
something like what you need.  The two major systems that produce latent
variable representations available in Mahout are SVD and LDA.  With SVD, you
define a matrix A that is the column-wise adjunct of the two item x category
matrices I mentioned above.  When you decompose this you will get left and
right singular vectors (U and V) and a diagonal matrix D such that

    A \approx U D V'

Now V will have as many rows as the sum of the numbers of categories of both
types.  You can, in fact, decompose V into parts corresponding to the types
of categories.  This will give you

   = [ V_1 ]
V  = [     ]
   = [ V_2 ]

you should be able to use the dot product of rows of V_1 versus the rows of
V_2 to get the similarity you want.  You may want to normalize the rows of V
before doing these dot products.  You can do this entire set of dot products
at once using V_1 V_2' but if you have massive numbers of categories that is
probably a bit of over-kill.

Using LDA would be quite similar to this except that the decomposition is
not in terms of a matrix product per se.  You would still come out with the
equivalent of the rows of V for each category and dot products would still
make sense.

On Fri, Jul 16, 2010 at 9:26 AM, Chantal Ackermann <
chantal.ackermann@btelligent.de> wrote:

> Hi Sean,
>
> I wouldn't call it recommendations because the target audience is not
> the end user.
> I would like to do this as a first step to create a mapping between
> those two categorization systems. It's a bit like merging two datasets
> and you would like to now how similar the data in certain (multivalued)
> fields is (say field 1 and field 2).
> This would require pairing each item in field 1 with each item in field
> 2? (Matrix?)
> As a result I would expect something similar to a recommendation system,
> yes. In the sense that when I ask for a value from field 1 I would get
> the values from field 2 that could be seen most equivalent to the input
> value (with some statistical indication if possible).
>
> I've been rereading the MAHOUT-418 issue (Computing the pairwise
> similarities of the rows of a matrix) and I wonder whether this is what
> I need.
> I've also read through the hadoop word count tutorial and installed
> hadoop (which was as easy as it can be).
>
> I just don't know where to start as I have not enough experience to
> judge what is relevant for my use case.
>
> Thanks!
> Chantal
>
>
> On Fri, 2010-07-16 at 17:51 +0200, Sean Owen wrote:
> > Lets clarify your situation. You are making recommendations or what?
> > Shouldn't have anything to do with Lucene per se. You do not need Hadoop
> for
> > recommendations if you don't want. ItemSimilarity is not related to
> Hadoop.
> > Yes you can define whatever notion of similarity that you like this way.
> Its
> > up to you not the framework really. But are you doing recommendations?
> >
> > On Jul 16, 2010 2:01 PM, "Chantal Ackermann" <
> > chantal.ackermann@btelligent.de> wrote:
> > > Hi all,
> > >
> > > my goal is to align two slightly different categorization systems where
> > > each categorized item can have multiple categories in one of these
> > > systems.
> > >
> > > E.g.:
> > > Categorized item: "Harry Potter"
> > > Category system 1: Fiction, Fantasy, Children
> > > Category system 2: Youth, Fantasy
> > > The alignment would then produce a similarity between "Fantasy" (used
> in
> > > both systems) and "Children" (1) and Youth (2).
> > >
> > > I *think* ItemSimilarity is what I want but if anyone can provide me
> > > with the correct keywords for googling - that would be great.
> > >
> > > If a Lucene/SOLR index is more efficient as source than the lists I
> have
> > > I'm fine with setting that up. However, I am not sure how the schema
> > > would have to be structured? Would it use the categorized items as
> > > document entities - if not what then?
> > >
> > > Any pointers where to start would be very much appreciated! Also the
> > > information whether I need a full Hadoop installation or whether Mahout
> > > as checked out from trunk is sufficient. It is not very much data
> > > altogether (<10k categorized items).
> > >
> > > Thanks!
> > > Chantal
> > >
> > >
>
>
>
>

Re: Cooccurrence to align different categorization systems (many to many occurrence)

Posted by Sebastian Schelter <ss...@googlemail.com>.
Hi Chantal,

Can you give us more detailed information about your data? Is each item
categorized in each of the two category systems?

If you can think of a way to vectorize your data, I could maybe help you
find a way to implement a custom DistributedVectorSimilarity so you can
use the RowSimilarityJob, but we'd have to make sure your problem can
really be modeled like this.

--sebastian


Am 16.07.2010 18:26, schrieb Chantal Ackermann:
> Hi Sean,
>
> I wouldn't call it recommendations because the target audience is not
> the end user.
> I would like to do this as a first step to create a mapping between
> those two categorization systems. It's a bit like merging two datasets
> and you would like to now how similar the data in certain (multivalued)
> fields is (say field 1 and field 2).
> This would require pairing each item in field 1 with each item in field
> 2? (Matrix?)
> As a result I would expect something similar to a recommendation system,
> yes. In the sense that when I ask for a value from field 1 I would get
> the values from field 2 that could be seen most equivalent to the input
> value (with some statistical indication if possible).
>
> I've been rereading the MAHOUT-418 issue (Computing the pairwise
> similarities of the rows of a matrix) and I wonder whether this is what
> I need.
> I've also read through the hadoop word count tutorial and installed
> hadoop (which was as easy as it can be).
>
> I just don't know where to start as I have not enough experience to
> judge what is relevant for my use case.
>
> Thanks!
> Chantal
>
>
> On Fri, 2010-07-16 at 17:51 +0200, Sean Owen wrote:
>   
>> Lets clarify your situation. You are making recommendations or what?
>> Shouldn't have anything to do with Lucene per se. You do not need Hadoop for
>> recommendations if you don't want. ItemSimilarity is not related to Hadoop.
>> Yes you can define whatever notion of similarity that you like this way. Its
>> up to you not the framework really. But are you doing recommendations?
>>
>> On Jul 16, 2010 2:01 PM, "Chantal Ackermann" <
>> chantal.ackermann@btelligent.de> wrote:
>>     
>>> Hi all,
>>>
>>> my goal is to align two slightly different categorization systems where
>>> each categorized item can have multiple categories in one of these
>>> systems.
>>>
>>> E.g.:
>>> Categorized item: "Harry Potter"
>>> Category system 1: Fiction, Fantasy, Children
>>> Category system 2: Youth, Fantasy
>>> The alignment would then produce a similarity between "Fantasy" (used in
>>> both systems) and "Children" (1) and Youth (2).
>>>
>>> I *think* ItemSimilarity is what I want but if anyone can provide me
>>> with the correct keywords for googling - that would be great.
>>>
>>> If a Lucene/SOLR index is more efficient as source than the lists I have
>>> I'm fine with setting that up. However, I am not sure how the schema
>>> would have to be structured? Would it use the categorized items as
>>> document entities - if not what then?
>>>
>>> Any pointers where to start would be very much appreciated! Also the
>>> information whether I need a full Hadoop installation or whether Mahout
>>> as checked out from trunk is sufficient. It is not very much data
>>> altogether (<10k categorized items).
>>>
>>> Thanks!
>>> Chantal
>>>
>>>
>>>       
>
>
>   


Re: Cooccurrence to align different categorization systems (many to many occurrence)

Posted by Chantal Ackermann <ch...@btelligent.de>.
Hi Sean,

I wouldn't call it recommendations because the target audience is not
the end user.
I would like to do this as a first step to create a mapping between
those two categorization systems. It's a bit like merging two datasets
and you would like to now how similar the data in certain (multivalued)
fields is (say field 1 and field 2).
This would require pairing each item in field 1 with each item in field
2? (Matrix?)
As a result I would expect something similar to a recommendation system,
yes. In the sense that when I ask for a value from field 1 I would get
the values from field 2 that could be seen most equivalent to the input
value (with some statistical indication if possible).

I've been rereading the MAHOUT-418 issue (Computing the pairwise
similarities of the rows of a matrix) and I wonder whether this is what
I need.
I've also read through the hadoop word count tutorial and installed
hadoop (which was as easy as it can be).

I just don't know where to start as I have not enough experience to
judge what is relevant for my use case.

Thanks!
Chantal


On Fri, 2010-07-16 at 17:51 +0200, Sean Owen wrote:
> Lets clarify your situation. You are making recommendations or what?
> Shouldn't have anything to do with Lucene per se. You do not need Hadoop for
> recommendations if you don't want. ItemSimilarity is not related to Hadoop.
> Yes you can define whatever notion of similarity that you like this way. Its
> up to you not the framework really. But are you doing recommendations?
> 
> On Jul 16, 2010 2:01 PM, "Chantal Ackermann" <
> chantal.ackermann@btelligent.de> wrote:
> > Hi all,
> >
> > my goal is to align two slightly different categorization systems where
> > each categorized item can have multiple categories in one of these
> > systems.
> >
> > E.g.:
> > Categorized item: "Harry Potter"
> > Category system 1: Fiction, Fantasy, Children
> > Category system 2: Youth, Fantasy
> > The alignment would then produce a similarity between "Fantasy" (used in
> > both systems) and "Children" (1) and Youth (2).
> >
> > I *think* ItemSimilarity is what I want but if anyone can provide me
> > with the correct keywords for googling - that would be great.
> >
> > If a Lucene/SOLR index is more efficient as source than the lists I have
> > I'm fine with setting that up. However, I am not sure how the schema
> > would have to be structured? Would it use the categorized items as
> > document entities - if not what then?
> >
> > Any pointers where to start would be very much appreciated! Also the
> > information whether I need a full Hadoop installation or whether Mahout
> > as checked out from trunk is sufficient. It is not very much data
> > altogether (<10k categorized items).
> >
> > Thanks!
> > Chantal
> >
> >