You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Tan Shern Shiou <sh...@mnc.com.my> on 2011/11/26 18:13:49 UTC

Using Brisk with Mahout

Hello,

I am planning to use Mahout with Hadoop and Cassandra as datastore. I 
have been reading about the goodness of using Brisk. Can we use Mahout 
with Brisk as is from the package like how we implement it with Hadoop 
and Cassandra?

Whats the difference of using Brisk and a combination of Hadoop and 
Cassandra?

thanks.

Re: Using Brisk with Mahout

Posted by Sean Owen <sr...@gmail.com>.

You can stick CassandraDataModel into your non-distributed recommender,
yes. It is still going to cache the Cassandra data in memory -- even
reading out of a fast Cassandra cluster is too slow for this kind of
intense access pattern -- but yes it will read just fine.

On Tue, Nov 29, 2011 at 2:37 PM, Tan Shern Shiou
<sh...@mnc.com.my>wrote:

> Thanks for the advice.. I would look into it.
> Another question, taste-web can support Cassandra without major rewrite
> right?
>
>
> On 29/11/2011 10:32 PM, Sean Owen wrote:
>
>> like your input is actually not
>> what you think it is, or something else you're doing is consuming a great
>> deal of memory. I would debug and/or profile to see where the memory is
>> used. For example you've not shown where the OutOfMemoryError is thrown.
>>
>> Sean
>>
>

Re: Using Brisk with Mahout

Posted by Tan Shern Shiou <sh...@mnc.com.my>.

Thanks for the advice.. I would look into it.
Another question, taste-web can support Cassandra without major rewrite 
right?

On 29/11/2011 10:32 PM, Sean Owen wrote:
> like your input is actually not
> what you think it is, or something else you're doing is consuming a great
> deal of memory. I would debug and/or profile to see where the memory is
> used. For example you've not shown where the OutOfMemoryError is thrown.
>
> Sean

Re: Using Brisk with Mahout

Posted by Sean Owen <sr...@gmail.com>.

In general, you have to increase a JVM's heap size if you're running
anything that needs non-trivial memory. I think the default heap size is
still 32M or 64M, which is quite small for these purposes. So I am not
surprised if you must increase the heap size, in general.

It is still surprising to me that the 1M data set would work with a default
heap size but your much smaller data set would not. Does your data set
contain a bunch of different item IDs? Because the slope-one diff storage
scales with the square of the number of items. But, I doubt you do, and the
size of the diff storage is capped anyway.

I think there's something else wrong here, like your input is actually not
what you think it is, or something else you're doing is consuming a great
deal of memory. I would debug and/or profile to see where the memory is
used. For example you've not shown where the OutOfMemoryError is thrown.

Sean

On Tue, Nov 29, 2011 at 2:21 PM, Tan Shern Shiou
<sh...@mnc.com.my>wrote:

> I am sorry if I didnt make myself clear.
> I have 2 datasets here.
> 1. Grouplens 1M
> 2. My own with 10,000 ratings (very small test data)
>
> The taste-web running fine with default heapsize. But when I load with my
> own dataset (10,000+), it crash after slopone recommendation. I need to
> change to 2048M heapsize for it to run.
>
> It doesnt make sense. What could be the problem? I am afraid if I scale
> this to production, it may not run. :(
>
>
> On 29/11/2011 10:15 PM, Sean Owen wrote:
>
>> This doesn't make sense -- it runs with 1M ratings, but not with 10000?
>> are
>> you sure you have your numbers right? 10000 is a tiny data set.
>>
>> Heap size should not be 2048, but something like 2048M.
>>
>> On Tue, Nov 29, 2011 at 2:03 PM, Tan Shern Shiou
>> <sh...@mnc.com.my>**wrote:
>>
>>   Currently I am running into some problem. My taste-web run fined with
>>> Grouplens 1M without changing MAVEN_OPS heapsize.
>>>
>>> However, my datapoints has only 1200++ users with 10000++ ratings can
>>> only
>>> run if I change the heapsize to 2048. Any suggestion to solve this when I
>>> need to scale up my apps. Thanks.
>>>
>>>
>>>
>
> --
>

Re: Using Brisk with Mahout

Posted by Tan Shern Shiou <sh...@mnc.com.my>.

I am sorry if I didnt make myself clear.
I have 2 datasets here.
1. Grouplens 1M
2. My own with 10,000 ratings (very small test data)

The taste-web running fine with default heapsize. But when I load with 
my own dataset (10,000+), it crash after slopone recommendation. I need 
to change to 2048M heapsize for it to run.

It doesnt make sense. What could be the problem? I am afraid if I scale 
this to production, it may not run. :(

On 29/11/2011 10:15 PM, Sean Owen wrote:
> This doesn't make sense -- it runs with 1M ratings, but not with 10000? are
> you sure you have your numbers right? 10000 is a tiny data set.
>
> Heap size should not be 2048, but something like 2048M.
>
> On Tue, Nov 29, 2011 at 2:03 PM, Tan Shern Shiou
> <sh...@mnc.com.my>wrote:
>
>>   Currently I am running into some problem. My taste-web run fined with
>> Grouplens 1M without changing MAVEN_OPS heapsize.
>>
>> However, my datapoints has only 1200++ users with 10000++ ratings can only
>> run if I change the heapsize to 2048. Any suggestion to solve this when I
>> need to scale up my apps. Thanks.
>>
>>

--

Re: Using Brisk with Mahout

Posted by Sean Owen <sr...@gmail.com>.

This doesn't make sense -- it runs with 1M ratings, but not with 10000? are
you sure you have your numbers right? 10000 is a tiny data set.

Heap size should not be 2048, but something like 2048M.

On Tue, Nov 29, 2011 at 2:03 PM, Tan Shern Shiou
<sh...@mnc.com.my>wrote:

>  Currently I am running into some problem. My taste-web run fined with
> Grouplens 1M without changing MAVEN_OPS heapsize.
>
> However, my datapoints has only 1200++ users with 10000++ ratings can only
> run if I change the heapsize to 2048. Any suggestion to solve this when I
> need to scale up my apps. Thanks.
>
>

Re: Using Brisk with Mahout

Posted by Tan Shern Shiou <sh...@mnc.com.my>.

Currently I am running into some problem. My taste-web run fined with 
Grouplens 1M without changing MAVEN_OPS heapsize.

However, my datapoints has only 1200++ users with 10000++ ratings can 
only run if I change the heapsize to 2048. Any suggestion to solve this 
when I need to scale up my apps. Thanks.
On 27/11/2011 1:35 AM, Sean Owen wrote:
> Moving to Hadoop means using entirely different code, and it's nothing
> to do with taste-web, no.
>
> How much data do you have? You ought to be able to get away with not
> using Hadoop to many tens of millions of ratings, depending on your
> use case. I'd only bother if you can't avoid it any other way.
>
> On Sat, Nov 26, 2011 at 5:31 PM, Tan Shern Shiou
> <sh...@mnc.com.my>  wrote:
>> Thanks a lot Sean.
>>
>> Yeah I am contemplating to choose the right technology as my recommendation
>> engine is scaling up. It is not working well with my FileDataModel now. If
>> using hadoop alone can help, then I will go for the most natural way.
>>
>> Another question can I use the taste-web along with Hadoop only?
>>
>


-- 
Best Regards,
Shern Shiou Tan
Software Engineer, Music Streaming
Tel: +603 7955 9448 ext 716
<http://www.mnc.com.my>
T: + 603 7955 9448 | F: + 603 7955 9148 | W: www.mnc.com.my 
<http://www.mnc.com.my>

Facebook <http://www.facebook.com/MNCwireless?ref=ts> Twitter 
<http://twitter.com/mncwireless> Worpress 
<http://mncwireless.wordpress.com/> RSS 
<http://mncwireless.wordpress.com/feed/> <http://www.wowloud.com/>

	

------------------------------------------------------------------------
This email is intended only for the use of the individual or entity 
named above and may contain information that is confidential and/or 
privileged. If you are not the intended recipient or received this email 
in error, you are hereby notified that any action in reliance upon this 
email is strictly prohibited. Please notify us immediately by return 
email or telephone + 603 7955 9448 and destroy the original message. 
Thank you.

Re: Using Brisk with Mahout

Posted by Sean Owen <sr...@gmail.com>.

Moving to Hadoop means using entirely different code, and it's nothing
to do with taste-web, no.

How much data do you have? You ought to be able to get away with not
using Hadoop to many tens of millions of ratings, depending on your
use case. I'd only bother if you can't avoid it any other way.

On Sat, Nov 26, 2011 at 5:31 PM, Tan Shern Shiou
<sh...@mnc.com.my> wrote:
> Thanks a lot Sean.
>
> Yeah I am contemplating to choose the right technology as my recommendation
> engine is scaling up. It is not working well with my FileDataModel now. If
> using hadoop alone can help, then I will go for the most natural way.
>
> Another question can I use the taste-web along with Hadoop only?
>

Re: Using Brisk with Mahout

Posted by Tan Shern Shiou <sh...@mnc.com.my>.

Thanks a lot Sean.

Yeah I am contemplating to choose the right technology as my 
recommendation engine is scaling up. It is not working well with my 
FileDataModel now. If using hadoop alone can help, then I will go for 
the most natural way.

Another question can I use the taste-web along with Hadoop only?

Thanks.

On 27/11/2011 1:25 AM, Sean Owen wrote:
> I haven't used Brisk, but doubt there would be any real difference for
> your use case. Hadoop's integration with Cassandra is fairly light and
> arms-length, so the details of the Cassandra distro behind it ought
> not matter too much.
>
> Cassandra isn't the most natural choice as a Hadoop data store, but it
> can be made to work. If you're choosing tools from scratch, just using
> HDFS as your data store is probably more natural.
>
> Mahout has virtually no direct relationship to Cassandra, so the same
> comments apply even more as regards Brisk vs vanilla Cassandra and
> Mahout. It should not matter much if at all.
>
> The only direct integration with Cassandra is in the non-distributed
> Recommender, where I cobbled together a CassandraDataModel for an
> article I wrote
> (http://www.acunu.com/blogs/sean-owen/recommending-cassandra/).
>
> Mahout doesn't use Cassandra directly when using Hadoop, but you can
> modify it to work with Cassandra as an InputFormat. Again,
> conveniently, I wrote about that recently for Acunu:
> http://www.acunu.com/blogs/sean-owen/scaling-cassandra-and-mahout-hadoop/
>
> Sean
>
> On Sat, Nov 26, 2011 at 5:13 PM, Tan Shern Shiou
> <sh...@mnc.com.my>  wrote:
>> Hello,
>>
>> I am planning to use Mahout with Hadoop and Cassandra as datastore. I have
>> been reading about the goodness of using Brisk. Can we use Mahout with Brisk
>> as is from the package like how we implement it with Hadoop and Cassandra?
>>
>> Whats the difference of using Brisk and a combination of Hadoop and
>> Cassandra?
>>
>> thanks.
>>
>


-- 
Best Regards,
Shern Shiou Tan
Software Engineer, Music Streaming
Tel: +603 7955 9448 ext 716
<http://www.mnc.com.my>
T: + 603 7955 9448 | F: + 603 7955 9148 | W: www.mnc.com.my 
<http://www.mnc.com.my>

Facebook <http://www.facebook.com/MNCwireless?ref=ts> Twitter 
<http://twitter.com/mncwireless> Worpress 
<http://mncwireless.wordpress.com/> RSS 
<http://mncwireless.wordpress.com/feed/> <http://www.wowloud.com/>

	

------------------------------------------------------------------------
This email is intended only for the use of the individual or entity 
named above and may contain information that is confidential and/or 
privileged. If you are not the intended recipient or received this email 
in error, you are hereby notified that any action in reliance upon this 
email is strictly prohibited. Please notify us immediately by return 
email or telephone + 603 7955 9448 and destroy the original message. 
Thank you.

Re: Using Brisk with Mahout

Posted by Sean Owen <sr...@gmail.com>.

I haven't used Brisk, but doubt there would be any real difference for
your use case. Hadoop's integration with Cassandra is fairly light and
arms-length, so the details of the Cassandra distro behind it ought
not matter too much.

Cassandra isn't the most natural choice as a Hadoop data store, but it
can be made to work. If you're choosing tools from scratch, just using
HDFS as your data store is probably more natural.

Mahout has virtually no direct relationship to Cassandra, so the same
comments apply even more as regards Brisk vs vanilla Cassandra and
Mahout. It should not matter much if at all.

The only direct integration with Cassandra is in the non-distributed
Recommender, where I cobbled together a CassandraDataModel for an
article I wrote
(http://www.acunu.com/blogs/sean-owen/recommending-cassandra/).

Mahout doesn't use Cassandra directly when using Hadoop, but you can
modify it to work with Cassandra as an InputFormat. Again,
conveniently, I wrote about that recently for Acunu:
http://www.acunu.com/blogs/sean-owen/scaling-cassandra-and-mahout-hadoop/

Sean

On Sat, Nov 26, 2011 at 5:13 PM, Tan Shern Shiou
<sh...@mnc.com.my> wrote:
> Hello,
>
> I am planning to use Mahout with Hadoop and Cassandra as datastore. I have
> been reading about the goodness of using Brisk. Can we use Mahout with Brisk
> as is from the package like how we implement it with Hadoop and Cassandra?
>
> Whats the difference of using Brisk and a combination of Hadoop and
> Cassandra?
>
> thanks.
>