You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Vinicius Carvalho <vi...@gmail.com> on 2010/01/30 03:01:29 UTC

Getting Taste to work on 10M dataset

Hello there! I'm trying to get Taste to work on the 10M dataset but, even
following some tips from Sean on Mahout in Action, I can't get it working
using SlopeOneRecommender and JDBC.

The machine I run the examples is a Core 2 Duo 2.8 Ghz with 4GB RAM Ubuntu
9.10 64bit JDK 1.6

The Mysql is set to use up to 512MB of table cache

The JVM is running with Xmx=2048mb


I'm using spring to make things simpler but bottom line I create a slopeone
recommender using the constructor:

DataModel:JDBCDataModel
weighting:Weighted
weighting:Weighted
diffStorage:memory

The memory storage is configured:

DataModel:JDBCDataModel
weighting:Weighted
compact:false
maxEntries:100000


Running the code I get this exception:

Caused by: java.lang.NullPointerException
    at com.mysql.jdbc.ResultSetImpl.getLong(ResultSetImpl.java:2843)
    at com.mysql.jdbc.ResultSetImpl.getLong(ResultSetImpl.java:2830)
    at
org.apache.commons.dbcp.DelegatingResultSet.getLong(DelegatingResultSet.java:190)
    at
org.apache.mahout.cf.taste.impl.model.jdbc.AbstractJDBCDataModel.getLongColumn(AbstractJDBCDataModel.java:602)
    at
org.apache.mahout.cf.taste.impl.model.jdbc.AbstractJDBCDataModel$ResultSetIDIterator.nextLong(AbstractJDBCDataModel.java:677)
    at
org.apache.mahout.cf.taste.impl.recommender.slopeone.MemoryDiffStorage.buildAverageDiffs(MemoryDiffStorage.java:221)
    at
org.apache.mahout.cf.taste.impl.recommender.slopeone.MemoryDiffStorage.<init>(MemoryDiffStorage.java:115)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
    at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
    at
org.springframework.beans.BeanUtils.instantiateClass(BeanUtils.java:100)
    ... 57 more

I tried to replace the MemoryDiff by a JDBCDiff. After almost 1 hour running
the code, and 100% CPU usage by my mysql process. I decided to quit, is it
supposed to take so long?

I tried to change the maxEntries to a smaller value but the NullPointer
always happens. My table is not exactly like the one used as sample on the
source code, but I do inform the correct columns and they are of the same
type.

Any ideas?





-- 
The intuitive mind is a sacred gift and the
rational mind is a faithful servant. We have
created a society that honors the servant and
has forgotten the gift.

Re: Getting Taste to work on 10M dataset

Posted by Sean Owen <sr...@gmail.com>.

On Sat, Jan 30, 2010 at 2:50 AM, Vinicius Carvalho
<vi...@gmail.com> wrote:
>> Checked the DB there's no null columns. Also, the table was built using the
> sample CREATE TABLE provided at the source code

Yes but I was referring to your diffs table, how about that?

Re: Getting Taste to work on 10M dataset

Posted by Vinicius Carvalho <vi...@gmail.com>.

On Sat, Jan 30, 2010 at 12:40 AM, Sean Owen <sr...@gmail.com> wrote:

> On Sat, Jan 30, 2010 at 2:34 AM, Vinicius Carvalho
> <vi...@gmail.com> wrote:
> > I'm trying the 5.1.10 the latest one available at maven repositories,
> > running it right now, since it takes a while, I'll inform of the results
>
> OK but this would be something you can check in your table right now.
> No columns should be nullable, or have nulls. If they do, that's the
> problem.
>
> Checked the DB there's no null columns. Also, the table was built using the
sample CREATE TABLE provided at the source code

 * CREATE TABLE taste_preferences (
 *   user_id BIGINT NOT NULL,
 *   item_id BIGINT NOT NULL,
 *   preference FLOAT NOT NULL,
 *   PRIMARY KEY (user_id, item_id),
 *   INDEX (user_id),
 *   INDEX (item_id)
 * )

>
> > At first I'm just creating the slopeonerecommender. did not even get to
> the
> > actual code, all that time is used on the construction of the object
>
> OK then it's the time spent in building diffs.
>
>
> > You mean for the DiffStorage right? The datamodel would be good to be at
> > JDBC right? I'm interested in item2item recommendations. I did this
> before
>
> For both, 10M ratings isn't terribly big. I think you can get it into
> memory in 2GB, plus the diffs, if you cap the number of diffs at some
> reasonable value.
>
> Only problem with in memory for datamodel would be volatility I guess.


>  > using taste by hand by computing the SimilarityMatrix and storing it on
> DB.
> > (I used as reference the book Collective Intelligence in action) and it
> > worked fine. Just the Similarity Matrix took a while to be recalculated
> by
> > it was a batch job running every hour. After that computing
> recomendations
> > was just a breeze.
>
> You mean you are interested in item-based recommenders, or
> recommending items to other items?
>

that would be item based recommenders

>
> Slope-one wouldn't have anything to do with item-item similarities, it
> works a bit differently. yes you could pre-compute similarities and
> use them with a custom ItemSimilarity implementation which reads from
> a DB table, and use that with GenericItemBasedRecommender.
>
> You could also do the similarity calculations with something like
> PearsonCorrelationSimilarity, and store that in the DB, and proceed
> with the above. Again, you'd have to write a little code but pretty
> easy.
>
> Or you could skip the DB altogether and let it compute item-item
> similarities on the fly.
>

I'll try your ideas, and post the results. Thanks for all the help Sean

-- 
The intuitive mind is a sacred gift and the
rational mind is a faithful servant. We have
created a society that honors the servant and
has forgotten the gift.

Re: Getting Taste to work on 10M dataset

Posted by Sean Owen <sr...@gmail.com>.

On Sat, Jan 30, 2010 at 2:34 AM, Vinicius Carvalho
<vi...@gmail.com> wrote:
> I'm trying the 5.1.10 the latest one available at maven repositories,
> running it right now, since it takes a while, I'll inform of the results

OK but this would be something you can check in your table right now.
No columns should be nullable, or have nulls. If they do, that's the
problem.

> At first I'm just creating the slopeonerecommender. did not even get to the
> actual code, all that time is used on the construction of the object

OK then it's the time spent in building diffs.

> You mean for the DiffStorage right? The datamodel would be good to be at
> JDBC right? I'm interested in item2item recommendations. I did this before

For both, 10M ratings isn't terribly big. I think you can get it into
memory in 2GB, plus the diffs, if you cap the number of diffs at some
reasonable value.

> using taste by hand by computing the SimilarityMatrix and storing it on DB.
> (I used as reference the book Collective Intelligence in action) and it
> worked fine. Just the Similarity Matrix took a while to be recalculated by
> it was a batch job running every hour. After that computing recomendations
> was just a breeze.

You mean you are interested in item-based recommenders, or
recommending items to other items?

Slope-one wouldn't have anything to do with item-item similarities, it
works a bit differently. yes you could pre-compute similarities and
use them with a custom ItemSimilarity implementation which reads from
a DB table, and use that with GenericItemBasedRecommender.

You could also do the similarity calculations with something like
PearsonCorrelationSimilarity, and store that in the DB, and proceed
with the above. Again, you'd have to write a little code but pretty
easy.

Or you could skip the DB altogether and let it compute item-item
similarities on the fly.

Re: Getting Taste to work on 10M dataset

Posted by Vinicius Carvalho <vi...@gmail.com>.

Thanks for the quick reply :)

On Sat, Jan 30, 2010 at 12:15 AM, Sean Owen <sr...@gmail.com> wrote:

> On Sat, Jan 30, 2010 at 2:01 AM, Vinicius Carvalho
> <vi...@gmail.com> wrote:
> > Hello there! I'm trying to get Taste to work on the 10M dataset but, even
> > following some tips from Sean on Mahout in Action, I can't get it working
> > using SlopeOneRecommender and JDBC.
>
> You are referring to the GroupLens data set right?
>
> Yep, that's right

>
> > Caused by: java.lang.NullPointerException
> >    at com.mysql.jdbc.ResultSetImpl.getLong(ResultSetImpl.java:2843)
> >    at com.mysql.jdbc.ResultSetImpl.getLong(ResultSetImpl.java:2830)
> >    at
> >
> org.apache.commons.dbcp.DelegatingResultSet.getLong(DelegatingResultSet.java:190)
> >    at
> >
> org.apache.mahout.cf.taste.impl.model.jdbc.AbstractJDBCDataModel.getLongColumn(AbstractJDBCDataModel.java:602)
>
> That's odd, it's an error from the MySQL driver. I am taking a guess
> that one of the columns in the table had a null value? everything
> should be non-NULL.
>
> Or else... not sure, sounds like a driver bug? try the latest version
> 5.1.11.
>

I'm trying the 5.1.10 the latest one available at maven repositories,
running it right now, since it takes a while, I'll inform of the results


> > I tried to replace the MemoryDiff by a JDBCDiff. After almost 1 hour
> running
> > the code, and 100% CPU usage by my mysql process. I decided to quit, is
> it
> > supposed to take so long?
>
> Depends on what you're doing -- what job are you running, computing
> recs for all users?
> Is your table indexed properly? that makes a huge difference.
>

At first I'm just creating the slopeonerecommender. did not even get to the
actual code, all that time is used on the construction of the object


> Why switch to JDBC? memory is a lot faster and seems like you have a
> lot of heap.
> That makes the driver issue a moot point.
>

You mean for the DiffStorage right? The datamodel would be good to be at
JDBC right? I'm interested in item2item recommendations. I did this before
using taste by hand by computing the SimilarityMatrix and storing it on DB.
(I used as reference the book Collective Intelligence in action) and it
worked fine. Just the Similarity Matrix took a while to be recalculated by
it was a batch job running every hour. After that computing recomendations
was just a breeze.

What I'd like to achieve is the same thing I was doing by hand, but using
taste :)




>  >
> > I tried to change the maxEntries to a smaller value but the NullPointer
> > always happens. My table is not exactly like the one used as sample on
> the
> > source code, but I do inform the correct columns and they are of the same
> > type.
>



-- 
The intuitive mind is a sacred gift and the
rational mind is a faithful servant. We have
created a society that honors the servant and
has forgotten the gift.

Re: Getting Taste to work on 10M dataset

Posted by Sean Owen <sr...@gmail.com>.

On Sat, Jan 30, 2010 at 2:01 AM, Vinicius Carvalho
<vi...@gmail.com> wrote:
> Hello there! I'm trying to get Taste to work on the 10M dataset but, even
> following some tips from Sean on Mahout in Action, I can't get it working
> using SlopeOneRecommender and JDBC.

You are referring to the GroupLens data set right?

> Caused by: java.lang.NullPointerException
>    at com.mysql.jdbc.ResultSetImpl.getLong(ResultSetImpl.java:2843)
>    at com.mysql.jdbc.ResultSetImpl.getLong(ResultSetImpl.java:2830)
>    at
> org.apache.commons.dbcp.DelegatingResultSet.getLong(DelegatingResultSet.java:190)
>    at
> org.apache.mahout.cf.taste.impl.model.jdbc.AbstractJDBCDataModel.getLongColumn(AbstractJDBCDataModel.java:602)

That's odd, it's an error from the MySQL driver. I am taking a guess
that one of the columns in the table had a null value? everything
should be non-NULL.

Or else... not sure, sounds like a driver bug? try the latest version 5.1.11.

> I tried to replace the MemoryDiff by a JDBCDiff. After almost 1 hour running
> the code, and 100% CPU usage by my mysql process. I decided to quit, is it
> supposed to take so long?

Depends on what you're doing -- what job are you running, computing
recs for all users?
Is your table indexed properly? that makes a huge difference.

Why switch to JDBC? memory is a lot faster and seems like you have a
lot of heap.
That makes the driver issue a moot point.

>
> I tried to change the maxEntries to a smaller value but the NullPointer
> always happens. My table is not exactly like the one used as sample on the
> source code, but I do inform the correct columns and they are of the same
> type.