You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by 王建国 <jo...@gmail.com> on 2014/10/14 08:43:19 UTC

How to build a recommendation system based on mahout serving millions even billions of users ?

Hi,Owen and all:
    I am a developer from china.I am building a recommendation sysytem
based on mahhout in version-0.9.Since the userids and itemids are string,
I need to map them to long.But I found that  there is a Long-to-Int mapping
provided by the function "int TasteHadoopUtils.idToIndex(long)".
Considering there may be millions  even billions of users,I wonder if  it
possible to have many long mapped into one int? If ture,that does do much
harm .
This is quite confusing.What solution should I choose in this
situation?Meanwhile,I read the answer from you as followed.Could you please
tell me
which data structure indexed by long you use in Myrrix. Thanks in advance.
wangjiangwei

Question:
I have read some code about item-based recommendation in version-0.6,
starting from "org.apache.mahout.cf.taste.
hadoop.item.RecommenderJob". I found that there is a Long-to-Int mapping
provided by the function "int TasteHadoopUtils.idToIndex(long)".
Long-to-Int is performed both on userId and itemId. I wonder if it possible
to have two long mapped into one int? If it is the case, then we would
likely to merge vectors from different itemids/uids, right? This is quite
confusing.
Is it better to provide a RandomAccessSparseVector implemented by
OpenLongDoubleHashMap instead of OpenIntDoubleHashMap? Thanks in advance.
Wei Feng
Answer:
    That's right. It ought to be uncommon but can happen. For recommenders,
it
"only" means that you start to treat two users or two items as the same
thing. That doesn't do much harm though. Maybe one user's recs are a little
funny.
I do think it would have been useful to index by long, but that would have
significantly increased memory requirements too.
(In developing Myrrix I have switched to use a data structure indexed by
long though, because it becomes more necessary to avoid the mapping.)
Sean Owen

Re: How to build a recommendation system based on mahout serving millions even billions of users ?

Posted by 王建国 <jo...@gmail.com>.

Hi,Ted.
       I don't know why I can't download the book.Maybe,the network is very
poor.Can you sent it to me ? I am looking forward to read it.Thanks.

2014-10-15 5:47 GMT+08:00 Ted Dunning <te...@gmail.com>:

> You should move forward to version 0.9.
>
> Take a look at more recent methods in this book:
>
> https://www.mapr.com/practical-machine-learning
>
>
>
> On Tue, Oct 14, 2014 at 2:43 AM, 王建国 <jo...@gmail.com> wrote:
>
> > Hi,Owen and all:
> >     I am a developer from china.I am building a recommendation sysytem
> > based on mahhout in version-0.9.Since the userids and itemids are string,
> > I need to map them to long.But I found that  there is a Long-to-Int
> mapping
> > provided by the function "int TasteHadoopUtils.idToIndex(long)".
> > Considering there may be millions  even billions of users,I wonder if  it
> > possible to have many long mapped into one int? If ture,that does do much
> > harm .
> > This is quite confusing.What solution should I choose in this
> > situation?Meanwhile,I read the answer from you as followed.Could you
> please
> > tell me
> > which data structure indexed by long you use in Myrrix. Thanks in
> advance.
> > wangjiangwei
> >
> > Question:
> > I have read some code about item-based recommendation in version-0.6,
> > starting from "org.apache.mahout.cf.taste.
> > hadoop.item.RecommenderJob". I found that there is a Long-to-Int mapping
> > provided by the function "int TasteHadoopUtils.idToIndex(long)".
> > Long-to-Int is performed both on userId and itemId. I wonder if it
> possible
> > to have two long mapped into one int? If it is the case, then we would
> > likely to merge vectors from different itemids/uids, right? This is quite
> > confusing.
> > Is it better to provide a RandomAccessSparseVector implemented by
> > OpenLongDoubleHashMap instead of OpenIntDoubleHashMap? Thanks in advance.
> > Wei Feng
> > Answer:
> >     That's right. It ought to be uncommon but can happen. For
> recommenders,
> > it
> > "only" means that you start to treat two users or two items as the same
> > thing. That doesn't do much harm though. Maybe one user's recs are a
> little
> > funny.
> > I do think it would have been useful to index by long, but that would
> have
> > significantly increased memory requirements too.
> > (In developing Myrrix I have switched to use a data structure indexed by
> > long though, because it becomes more necessary to avoid the mapping.)
> > Sean Owen
> >
>

Re: How to build a recommendation system based on mahout serving millions even billions of users ?

Posted by 王建国 <jo...@gmail.com>.

Hi,Ted!
      Thank you for advising the book to me.I have look through it .It
mainly describes the infrastructure of recommendation system.But my problem
is how to generate a contiguous set of integer indexes for
the RowSimilarityJob.If use hash algorithm like MD5, there would be a lot
of integer indexes repeated.It is terrible when there are millions even
billions of users and items.I don't find  a concrete solution in the book.I
t just said in the book:
      The logs used the artist, track, and album IDs from the Postgres copy
of the MusicBrainz data. These IDs are, however, not suitable directly for
use with the RowSimilarityJob from Mahout since that program requires that
all IDs be converted to a contiguous set of integer indexes.In our Music
Machine recommender, this conversion was done using a Pig program that
produced two outputs. The first output from Pig is input for the
RowSimilarityJob, and the second output is a dictionary that records the
mapping from the original and the Mahout versions of the IDs.

2014-10-15 5:47 GMT+08:00 Ted Dunning <te...@gmail.com>:

> You should move forward to version 0.9.
>
> Take a look at more recent methods in this book:
>
> https://www.mapr.com/practical-machine-learning
>
>
>
> On Tue, Oct 14, 2014 at 2:43 AM, 王建国 <jo...@gmail.com> wrote:
>
> > Hi,Owen and all:
> >     I am a developer from china.I am building a recommendation sysytem
> > based on mahhout in version-0.9.Since the userids and itemids are string,
> > I need to map them to long.But I found that  there is a Long-to-Int
> mapping
> > provided by the function "int TasteHadoopUtils.idToIndex(long)".
> > Considering there may be millions  even billions of users,I wonder if  it
> > possible to have many long mapped into one int? If ture,that does do much
> > harm .
> > This is quite confusing.What solution should I choose in this
> > situation?Meanwhile,I read the answer from you as followed.Could you
> please
> > tell me
> > which data structure indexed by long you use in Myrrix. Thanks in
> advance.
> > wangjiangwei
> >
> > Question:
> > I have read some code about item-based recommendation in version-0.6,
> > starting from "org.apache.mahout.cf.taste.
> > hadoop.item.RecommenderJob". I found that there is a Long-to-Int mapping
> > provided by the function "int TasteHadoopUtils.idToIndex(long)".
> > Long-to-Int is performed both on userId and itemId. I wonder if it
> possible
> > to have two long mapped into one int? If it is the case, then we would
> > likely to merge vectors from different itemids/uids, right? This is quite
> > confusing.
> > Is it better to provide a RandomAccessSparseVector implemented by
> > OpenLongDoubleHashMap instead of OpenIntDoubleHashMap? Thanks in advance.
> > Wei Feng
> > Answer:
> >     That's right. It ought to be uncommon but can happen. For
> recommenders,
> > it
> > "only" means that you start to treat two users or two items as the same
> > thing. That doesn't do much harm though. Maybe one user's recs are a
> little
> > funny.
> > I do think it would have been useful to index by long, but that would
> have
> > significantly increased memory requirements too.
> > (In developing Myrrix I have switched to use a data structure indexed by
> > long though, because it becomes more necessary to avoid the mapping.)
> > Sean Owen
> >
>

Re: How to build a recommendation system based on mahout serving millions even billions of users ?

Posted by 王建国 <jo...@gmail.com>.

Thank you very much! It is version 0.9 I am leaning now. I will read the
book as you advise.

2014-10-15 5:47 GMT+08:00 Ted Dunning <te...@gmail.com>:

> You should move forward to version 0.9.
>
> Take a look at more recent methods in this book:
>
> https://www.mapr.com/practical-machine-learning
>
>
>
> On Tue, Oct 14, 2014 at 2:43 AM, 王建国 <jo...@gmail.com> wrote:
>
> > Hi,Owen and all:
> >     I am a developer from china.I am building a recommendation sysytem
> > based on mahhout in version-0.9.Since the userids and itemids are string,
> > I need to map them to long.But I found that  there is a Long-to-Int
> mapping
> > provided by the function "int TasteHadoopUtils.idToIndex(long)".
> > Considering there may be millions  even billions of users,I wonder if  it
> > possible to have many long mapped into one int? If ture,that does do much
> > harm .
> > This is quite confusing.What solution should I choose in this
> > situation?Meanwhile,I read the answer from you as followed.Could you
> please
> > tell me
> > which data structure indexed by long you use in Myrrix. Thanks in
> advance.
> > wangjiangwei
> >
> > Question:
> > I have read some code about item-based recommendation in version-0.6,
> > starting from "org.apache.mahout.cf.taste.
> > hadoop.item.RecommenderJob". I found that there is a Long-to-Int mapping
> > provided by the function "int TasteHadoopUtils.idToIndex(long)".
> > Long-to-Int is performed both on userId and itemId. I wonder if it
> possible
> > to have two long mapped into one int? If it is the case, then we would
> > likely to merge vectors from different itemids/uids, right? This is quite
> > confusing.
> > Is it better to provide a RandomAccessSparseVector implemented by
> > OpenLongDoubleHashMap instead of OpenIntDoubleHashMap? Thanks in advance.
> > Wei Feng
> > Answer:
> >     That's right. It ought to be uncommon but can happen. For
> recommenders,
> > it
> > "only" means that you start to treat two users or two items as the same
> > thing. That doesn't do much harm though. Maybe one user's recs are a
> little
> > funny.
> > I do think it would have been useful to index by long, but that would
> have
> > significantly increased memory requirements too.
> > (In developing Myrrix I have switched to use a data structure indexed by
> > long though, because it becomes more necessary to avoid the mapping.)
> > Sean Owen
> >
>

Re: How to build a recommendation system based on mahout serving millions even billions of users ?

Posted by Ted Dunning <te...@gmail.com>.

You should move forward to version 0.9.

Take a look at more recent methods in this book:

https://www.mapr.com/practical-machine-learning



On Tue, Oct 14, 2014 at 2:43 AM, 王建国 <jo...@gmail.com> wrote:

> Hi,Owen and all:
>     I am a developer from china.I am building a recommendation sysytem
> based on mahhout in version-0.9.Since the userids and itemids are string,
> I need to map them to long.But I found that  there is a Long-to-Int mapping
> provided by the function "int TasteHadoopUtils.idToIndex(long)".
> Considering there may be millions  even billions of users,I wonder if  it
> possible to have many long mapped into one int? If ture,that does do much
> harm .
> This is quite confusing.What solution should I choose in this
> situation?Meanwhile,I read the answer from you as followed.Could you please
> tell me
> which data structure indexed by long you use in Myrrix. Thanks in advance.
> wangjiangwei
>
> Question:
> I have read some code about item-based recommendation in version-0.6,
> starting from "org.apache.mahout.cf.taste.
> hadoop.item.RecommenderJob". I found that there is a Long-to-Int mapping
> provided by the function "int TasteHadoopUtils.idToIndex(long)".
> Long-to-Int is performed both on userId and itemId. I wonder if it possible
> to have two long mapped into one int? If it is the case, then we would
> likely to merge vectors from different itemids/uids, right? This is quite
> confusing.
> Is it better to provide a RandomAccessSparseVector implemented by
> OpenLongDoubleHashMap instead of OpenIntDoubleHashMap? Thanks in advance.
> Wei Feng
> Answer:
>     That's right. It ought to be uncommon but can happen. For recommenders,
> it
> "only" means that you start to treat two users or two items as the same
> thing. That doesn't do much harm though. Maybe one user's recs are a little
> funny.
> I do think it would have been useful to index by long, but that would have
> significantly increased memory requirements too.
> (In developing Myrrix I have switched to use a data structure indexed by
> long though, because it becomes more necessary to avoid the mapping.)
> Sean Owen
>