You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Dhruv Kumar <dk...@ecs.umass.edu> on 2011/09/10 21:06:03 UTC

Ehcache and Mahout

Has anyone over here used EHcache with Mahout (or pure Hadoop jobs)?

http://ehcache.org/

For iterative MapReduce applications running on a NoSQL data store, it
should provide a good performance boost by providing an in-memory object
cache (I think). Any comments?

Re: Ehcache and Mahout

Posted by Marko Ciric <ci...@gmail.com>.

Actually, there are a whole set of "generic" classes that do caching by
using FastMap classes (you can checkout the source of Mahout from Apache
repo).
These implementations actually gives you the same effect as the EhCache - by
holding all data inside the memory.

The drawback of using only Mahout caching on the heap is that it happens
while constructing these objects (not incrementally, by loading data into
memory,
as can be implemented with EhCache). If you are not going to do distributed
calculations with MapReduce algorithms, you'll need caching to speed up.
If your data isn't to big and it can fit into JVM heap well, you can use
Mahout without EhCache but if you can't load all the data at once, you
should try to implement
your own caching (it is possible with EhCache itself) and make sure you
don't run out of memory manually.

On 11 September 2011 07:32, Ted Dunning <te...@gmail.com> wrote:

> Caching in-process like this is likely to have much more satisfactory
> results than an external caching process.  Also, caching structures with
> repetitive access patterns is obviously better than caching single access
> data.  Thus caching small side data works well.  Map inputs do not.
>
> On Sat, Sep 10, 2011 at 6:28 PM, Robin Anil <ro...@gmail.com> wrote:
>
> > I once wrote a simple cache for HBaseDatastore in naive Bayes classifier
> > package and yes the speedup was really awesome, weights of high freq
> words
> > got cached and incremental lookup for rest of the words in a document was
> > really low. I had posted numbers on the old JIRA ticket
> >  On Sep 11, 2011 12:36 AM, "Dhruv Kumar" <dk...@ecs.umass.edu> wrote:
> > > Has anyone over here used EHcache with Mahout (or pure Hadoop jobs)?
> > >
> > > http://ehcache.org/
> > >
> > > For iterative MapReduce applications running on a NoSQL data store, it
> > > should provide a good performance boost by providing an in-memory
> object
> > > cache (I think). Any comments?
> >
>

-- 
--
Marko Ćirić
ciric.marko@gmail.com

Re: Recommendation with a dataset with no/same preference

Posted by Sean Owen <sr...@gmail.com>.

This is small enough that you can fit this into memory on one machine,
and you do not need Hadoop.

I would simply start with a GenericBooleanPrefItemBasedRecommender,
and attach it to a LogLikelihoodSimilarity similarity metric. Wrap the
LogLikelihoodSimilarity in a CachingItemSimilarity. You can feed your
associations in anyway you want, but one easy way is as a CSV file of
"userID,itemID" and a FileDataModel.

This ought to work pretty well for you, but is just a starting point.

On Sun, Sep 11, 2011 at 6:01 PM, Manju <ma...@yahoo.com> wrote:
> Dear Mahout team,
>
> Need some advice. The books "Mahout/Hadoop in action" and online information has helped me digest the basic concepts and setup a single node hadoop + mahout (run examples/write test programs/build etc.).
>
> I am prototyping a solution for an analytics problem using User/Itemrecommender structure (to start with). I have a list of 300 thousand users who have bought (on average) 10 items from a finite set of 300 items. I dont have individual preferences for each item bought. As the items are expensive, require pre-buy research and have very low complaint/returns, I am assuming that users liked the items they bought (for first iteration till I get more sophisticated data).
>
> Any advice on how best to approach the scenario with item or user based recommendation (given the lack of spread in ratings/preferences)?
>
> Appreciate your advice.
> Manju
>

Re: Recommendation with a dataset with no/same preference

Posted by Ted Dunning <te...@gmail.com>.

Good luck.

Let us know how it turns out.


On Sun, Sep 11, 2011 at 2:55 PM, Manju <ma...@yahoo.com> wrote:

> Ted and Sean,
> Thanks for the suggestion/advice. My prototype ran successfully
> (programatically:) with GenericBooleanPrefItemBasedRecommender. I am
> reviewing/reflecting on the output.
> Thanks again.
> Manju
>
> ------------------------------
> *From:* Ted Dunning <te...@gmail.com>
> *To:* user@mahout.apache.org; Manju <ma...@yahoo.com>
> *Cc:* "dev@mahout.apache.org" <de...@mahout.apache.org>
> *Sent:* Sunday, September 11, 2011 3:55 PM
> *Subject:* Re: Recommendation with a dataset with no/same preference
>
> Binary preferences are fine.
>
> In fact, I generally recommend that all ratings and related information be
> distilled down to a single binary indicator such as you already have.
>
> The fact that you have so few items will be both your advantage and
> disadvantage.  It will help you avoid problems with sparsity and lack of
> overlap between users, but it will also make your life harder because theere
> aren't so many items to recommend.  This will be exacerbated by your
> customers' tendency to exhaustively research items before purchase ... it is
> likely that they will know about most related items already.
>
> On Sun, Sep 11, 2011 at 10:01 AM, Manju <ma...@yahoo.com> wrote:
>
> ... have purchase data but not rating data ...
>
>
>
> Any advice on how best to approach the scenario with item or user based
> recommendation (given the lack of spread in ratings/preferences)?
>
>
>
>

Re: Recommendation with a dataset with no/same preference

Posted by Ted Dunning <te...@gmail.com>.

Good luck.

Let us know how it turns out.


On Sun, Sep 11, 2011 at 2:55 PM, Manju <ma...@yahoo.com> wrote:

> Ted and Sean,
> Thanks for the suggestion/advice. My prototype ran successfully
> (programatically:) with GenericBooleanPrefItemBasedRecommender. I am
> reviewing/reflecting on the output.
> Thanks again.
> Manju
>
> ------------------------------
> *From:* Ted Dunning <te...@gmail.com>
> *To:* user@mahout.apache.org; Manju <ma...@yahoo.com>
> *Cc:* "dev@mahout.apache.org" <de...@mahout.apache.org>
> *Sent:* Sunday, September 11, 2011 3:55 PM
> *Subject:* Re: Recommendation with a dataset with no/same preference
>
> Binary preferences are fine.
>
> In fact, I generally recommend that all ratings and related information be
> distilled down to a single binary indicator such as you already have.
>
> The fact that you have so few items will be both your advantage and
> disadvantage.  It will help you avoid problems with sparsity and lack of
> overlap between users, but it will also make your life harder because theere
> aren't so many items to recommend.  This will be exacerbated by your
> customers' tendency to exhaustively research items before purchase ... it is
> likely that they will know about most related items already.
>
> On Sun, Sep 11, 2011 at 10:01 AM, Manju <ma...@yahoo.com> wrote:
>
> ... have purchase data but not rating data ...
>
>
>
> Any advice on how best to approach the scenario with item or user based
> recommendation (given the lack of spread in ratings/preferences)?
>
>
>
>

Re: Recommendation with a dataset with no/same preference

Posted by Manju <ma...@yahoo.com>.

Ted and Sean,
Thanks for the suggestion/advice. My prototype ran successfully (programatically:) with GenericBooleanPrefItemBasedRecommender. I am reviewing/reflecting on the output.

Thanks again.

Manju

________________________________
From: Ted Dunning <te...@gmail.com>
To: user@mahout.apache.org; Manju <ma...@yahoo.com>
Cc: "dev@mahout.apache.org" <de...@mahout.apache.org>
Sent: Sunday, September 11, 2011 3:55 PM
Subject: Re: Recommendation with a dataset with no/same preference

Binary preferences are fine.

In fact, I generally recommend that all ratings and related information be distilled down to a single binary indicator such as you already have.

The fact that you have so few items will be both your advantage and disadvantage.  It will help you avoid problems with sparsity and lack of overlap between users, but it will also make your life harder because theere aren't so many items to recommend.  This will be exacerbated by your customers' tendency to exhaustively research items before purchase ... it is likely that they will know about most related items already.

On Sun, Sep 11, 2011 at 10:01 AM, Manju <ma...@yahoo.com> wrote:

... have purchase data but not rating data ...
>

Any advice on how best to approach the scenario with item or user based recommendation (given the lack of spread in ratings/preferences)?
>
>

Re: Recommendation with a dataset with no/same preference

Posted by Ted Dunning <te...@gmail.com>.

Binary preferences are fine.

In fact, I generally recommend that all ratings and related information be
distilled down to a single binary indicator such as you already have.

The fact that you have so few items will be both your advantage and
disadvantage.  It will help you avoid problems with sparsity and lack of
overlap between users, but it will also make your life harder because theere
aren't so many items to recommend.  This will be exacerbated by your
customers' tendency to exhaustively research items before purchase ... it is
likely that they will know about most related items already.

On Sun, Sep 11, 2011 at 10:01 AM, Manju <ma...@yahoo.com> wrote:

> ... have purchase data but not rating data ...
>

> Any advice on how best to approach the scenario with item or user based
> recommendation (given the lack of spread in ratings/preferences)?
>
>

Re: Recommendation with a dataset with no/same preference

Posted by Ted Dunning <te...@gmail.com>.

Binary preferences are fine.

In fact, I generally recommend that all ratings and related information be
distilled down to a single binary indicator such as you already have.

The fact that you have so few items will be both your advantage and
disadvantage.  It will help you avoid problems with sparsity and lack of
overlap between users, but it will also make your life harder because theere
aren't so many items to recommend.  This will be exacerbated by your
customers' tendency to exhaustively research items before purchase ... it is
likely that they will know about most related items already.

On Sun, Sep 11, 2011 at 10:01 AM, Manju <ma...@yahoo.com> wrote:

> ... have purchase data but not rating data ...
>

> Any advice on how best to approach the scenario with item or user based
> recommendation (given the lack of spread in ratings/preferences)?
>
>

Recommendation with a dataset with no/same preference

Posted by Manju <ma...@yahoo.com>.

Dear Mahout team,

Need some advice. The books "Mahout/Hadoop in action" and online information has helped me digest the basic concepts and setup a single node hadoop + mahout (run examples/write test programs/build etc.).

I am prototyping a solution for an analytics problem using User/Itemrecommender structure (to start with). I have a list of 300 thousand users who have bought (on average) 10 items from a finite set of 300 items. I dont have individual preferences for each item bought. As the items are expensive, require pre-buy research and have very low complaint/returns, I am assuming that users liked the items they bought (for first iteration till I get more sophisticated data).

Any advice on how best to approach the scenario with item or user based recommendation (given the lack of spread in ratings/preferences)?

Appreciate your advice.
Manju

Re: Ehcache and Mahout

Posted by Ted Dunning <te...@gmail.com>.

Caching in-process like this is likely to have much more satisfactory
results than an external caching process.  Also, caching structures with
repetitive access patterns is obviously better than caching single access
data.  Thus caching small side data works well.  Map inputs do not.

On Sat, Sep 10, 2011 at 6:28 PM, Robin Anil <ro...@gmail.com> wrote:

> I once wrote a simple cache for HBaseDatastore in naive Bayes classifier
> package and yes the speedup was really awesome, weights of high freq words
> got cached and incremental lookup for rest of the words in a document was
> really low. I had posted numbers on the old JIRA ticket
>  On Sep 11, 2011 12:36 AM, "Dhruv Kumar" <dk...@ecs.umass.edu> wrote:
> > Has anyone over here used EHcache with Mahout (or pure Hadoop jobs)?
> >
> > http://ehcache.org/
> >
> > For iterative MapReduce applications running on a NoSQL data store, it
> > should provide a good performance boost by providing an in-memory object
> > cache (I think). Any comments?
>

Re: Ehcache and Mahout

Posted by Ted Dunning <te...@gmail.com>.

Caching in-process like this is likely to have much more satisfactory
results than an external caching process.  Also, caching structures with
repetitive access patterns is obviously better than caching single access
data.  Thus caching small side data works well.  Map inputs do not.

On Sat, Sep 10, 2011 at 6:28 PM, Robin Anil <ro...@gmail.com> wrote:

> I once wrote a simple cache for HBaseDatastore in naive Bayes classifier
> package and yes the speedup was really awesome, weights of high freq words
> got cached and incremental lookup for rest of the words in a document was
> really low. I had posted numbers on the old JIRA ticket
>  On Sep 11, 2011 12:36 AM, "Dhruv Kumar" <dk...@ecs.umass.edu> wrote:
> > Has anyone over here used EHcache with Mahout (or pure Hadoop jobs)?
> >
> > http://ehcache.org/
> >
> > For iterative MapReduce applications running on a NoSQL data store, it
> > should provide a good performance boost by providing an in-memory object
> > cache (I think). Any comments?
>

Re: Ehcache and Mahout

Posted by Robin Anil <ro...@gmail.com>.

I once wrote a simple cache for HBaseDatastore in naive Bayes classifier
package and yes the speedup was really awesome, weights of high freq words
got cached and incremental lookup for rest of the words in a document was
really low. I had posted numbers on the old JIRA ticket
 On Sep 11, 2011 12:36 AM, "Dhruv Kumar" <dk...@ecs.umass.edu> wrote:
> Has anyone over here used EHcache with Mahout (or pure Hadoop jobs)?
>
> http://ehcache.org/
>
> For iterative MapReduce applications running on a NoSQL data store, it
> should provide a good performance boost by providing an in-memory object
> cache (I think). Any comments?

Re: Ehcache and Mahout

Posted by Sean Owen <sr...@gmail.com>.

This is more a question of how you store your input to Hadoop... it's
not directly tied to Mahout I think.

NoSQL data stores are good at fast random-access. The Hadoop model for
input is much more about sequential reads. So you can read from
Cassandra for sure; Cassandra's nice properties aren't really being
used in that case.

Ehcache would only be helping, if anything, speed up random access,
which would not really help.

I can think of several uses for Ehcache but this might not quite be
it. For example -- many M/Rs 'cheat' by trying to cache and read side
information for performance. You can bet it would be useful there.

On Sat, Sep 10, 2011 at 8:43 PM, Dhruv Kumar <dk...@ecs.umass.edu> wrote:
> Well, my understanding was that Ehcache allows name-value pairs to be stored
> in-memory, reducing disk transactions. So, if I put Ehcache on top of a
> NoSQL persistence store such as Cassandra which is also a key-value store,
> it should speed up the performance of a MapReduce app.
>
> On Sat, Sep 10, 2011 at 3:32 PM, Sean Owen <sr...@gmail.com> wrote:
>
>> What are you thinking it might cache?
>>
>> On Sat, Sep 10, 2011 at 8:06 PM, Dhruv Kumar <dk...@ecs.umass.edu> wrote:
>> > Has anyone over here used EHcache with Mahout (or pure Hadoop jobs)?
>> >
>> > http://ehcache.org/
>> >
>> > For iterative MapReduce applications running on a NoSQL data store, it
>> > should provide a good performance boost by providing an in-memory object
>> > cache (I think). Any comments?
>> >
>>
>

Re: Ehcache and Mahout

Posted by Dhruv Kumar <dk...@ecs.umass.edu>.

Well, my understanding was that Ehcache allows name-value pairs to be stored
in-memory, reducing disk transactions. So, if I put Ehcache on top of a
NoSQL persistence store such as Cassandra which is also a key-value store,
it should speed up the performance of a MapReduce app.

On Sat, Sep 10, 2011 at 3:32 PM, Sean Owen <sr...@gmail.com> wrote:

> What are you thinking it might cache?
>
> On Sat, Sep 10, 2011 at 8:06 PM, Dhruv Kumar <dk...@ecs.umass.edu> wrote:
> > Has anyone over here used EHcache with Mahout (or pure Hadoop jobs)?
> >
> > http://ehcache.org/
> >
> > For iterative MapReduce applications running on a NoSQL data store, it
> > should provide a good performance boost by providing an in-memory object
> > cache (I think). Any comments?
> >
>

Re: Ehcache and Mahout

Posted by Sean Owen <sr...@gmail.com>.

What are you thinking it might cache?

On Sat, Sep 10, 2011 at 8:06 PM, Dhruv Kumar <dk...@ecs.umass.edu> wrote:
> Has anyone over here used EHcache with Mahout (or pure Hadoop jobs)?
>
> http://ehcache.org/
>
> For iterative MapReduce applications running on a NoSQL data store, it
> should provide a good performance boost by providing an in-memory object
> cache (I think). Any comments?
>

Re: Ehcache and Mahout

Posted by Robin Anil <ro...@gmail.com>.

I once wrote a simple cache for HBaseDatastore in naive Bayes classifier
package and yes the speedup was really awesome, weights of high freq words
got cached and incremental lookup for rest of the words in a document was
really low. I had posted numbers on the old JIRA ticket
 On Sep 11, 2011 12:36 AM, "Dhruv Kumar" <dk...@ecs.umass.edu> wrote:
> Has anyone over here used EHcache with Mahout (or pure Hadoop jobs)?
>
> http://ehcache.org/
>
> For iterative MapReduce applications running on a NoSQL data store, it
> should provide a good performance boost by providing an in-memory object
> cache (I think). Any comments?