You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Sisir Koppaka <si...@gmail.com> on 2010/04/04 10:29:38 UTC

Re: Reg. Netflix Prize Apache Mahout GSoC Application

Thanks Robin, Ted, Jake and Sean for your feedback. I've refined my
proposal, added in a milestone timeline, with design details, and have
submitted it at the GSoC site. The title of the proposal is *Restricted
Boltzmann Machines on the Netflix Dataset. Please do give me your feedback
on the proposal located
here<https://docs.google.com/fileview?id=0B-jUrudTSg7-ZTg3YTU5YTktZDBhZC00NWFiLTk4MTQtNzVlODZhOWEzYTU0&hl=en>
.*

I have a couple of queries that'd help me further refine my proposal.
Firstly, I am expecting to reuse the code at
*org.apache.mahout.cf.taste.example.netflix
*and have mentioned so in my proposal. Please let me know if this is OK, or
if you foresee any problems doing this. Secondly, I will implement a
HBase-based datastore as well as a InMemory-based one, but is the
InMemory-based one a pre-requisite for the HBase-based one to be used?
(Eventually everything has to go to memory, so is this being done elsewhere
or does the InMemory datastore do it?)

Thanking you,
Best regards,
Sisir Koppaka

Re: Reg. Netflix Prize Apache Mahout GSoC Application

Posted by Sisir Koppaka <si...@gmail.com>.

I have put up the processed Netflix dataset
here<https://mahout-rbm-netflix.s3.amazonaws.com/netflix-dataset-wodates.csv>.
This file does not contain dates, and is 1.5GB in
size.<https://mahout-rbm-netflix.s3.amazonaws.com/netflix-dataset-wodates.csv?torrent>

Re: Reg. Netflix Prize Apache Mahout GSoC Application

Posted by Sisir Koppaka <si...@gmail.com>.

On Sun, Apr 4, 2010 at 4:10 PM, Sean Owen <sr...@gmail.com> wrote:

> I think you want to write this to accept "generic" data, and not
> necessarily assume the Netflix input format. I suggest you accept CSV
> data, in the form "userID,itemID,value", since that is what all the
> recommenders do.
>
> Sure, I'll write it for "userID, movieID, rating". Netflix also provides
dates but we can ignore it for the time being.


> You may need a quick utility program to convert Netflix data format to
> this. this wouldn't be part of the project, or else, we can put it in
> utils later.
>
> I have done this already. I have a 1.2GB CSV file containing all the 100
million records in the Netflix dataset as "userID, movieID, rating, date".


-- 
SK

Re: Reg. Netflix Prize Apache Mahout GSoC Application

Posted by Sean Owen <sr...@gmail.com>.

I think you want to write this to accept "generic" data, and not
necessarily assume the Netflix input format. I suggest you accept CSV
data, in the form "userID,itemID,value", since that is what all the
recommenders do.

You may need a quick utility program to convert Netflix data format to
this. this wouldn't be part of the project, or else, we can put it in
utils later.

I don't think data storage is an issue here -- the files will live on
HDFS/S3, that's it. No code is needed. I don't think this has anything
to do with the classifier data stores unless I misunderstand the
project.

On Sun, Apr 4, 2010 at 11:17 AM, Sisir Koppaka <si...@gmail.com> wrote:
> Thanks, this is what I wanted to know. So, now, there would be a separate
> example that reads-in the Netflix dataset in a distributed way, that would
> be utilize the RBM implementation. Would that be right?
>
> The datastore I was referring to in the proposal was based on
> mahout.classifier.bayes.datastore. I understand the HBase, Cassandra and
> other adapters are being refactored out in a separate ticket, so I'll just
> stick with HDFS and S3.
>
> If there's anything else that I would need to add in the proposal, do let me
> know.

Re: Reg. Netflix Prize Apache Mahout GSoC Application

Posted by Sisir Koppaka <si...@gmail.com>.

that would *be* utilize - sorry!

I'll start off by implementing the distributed Netfflix read-in, if that's
OK by you.

Re: Reg. Netflix Prize Apache Mahout GSoC Application

Posted by Sisir Koppaka <si...@gmail.com>.

Thanks, this is what I wanted to know. So, now, there would be a separate
example that reads-in the Netflix dataset in a distributed way, that would
be utilize the RBM implementation. Would that be right?

The datastore I was referring to in the proposal was based on
mahout.classifier.bayes.datastore. I understand the HBase, Cassandra and
other adapters are being refactored out in a separate ticket, so I'll just
stick with HDFS and S3.

If there's anything else that I would need to add in the proposal, do let me
know.

On Sun, Apr 4, 2010 at 3:09 PM, Sean Owen <sr...@gmail.com> wrote:

> Reusing code is fine, in principle. The code you mention, however,
> will not help you much. It is non-distributed and has nothing to do
> with Hadoop. You might reuse a bit of code to parse the input files,
> that's about it.
>
> Which data store are you referring to... if I understand right, you
> are implementing an algorithm on Hadoop. You would definitely not
> implement anything to load into memory, and I think you want to work
> with HDFS and Amazon S3, not Hbase.
>
> On Sun, Apr 4, 2010 at 9:29 AM, Sisir Koppaka <si...@gmail.com>
> wrote:
> > Firstly, I am expecting to reuse the code at
> > *org.apache.mahout.cf.taste.example.netflix
> > *and have mentioned so in my proposal. Please let me know if this is OK,
> or
> > if you foresee any problems doing this. Secondly, I will implement a
> > HBase-based datastore as well as a InMemory-based one, but is the
> > InMemory-based one a pre-requisite for the HBase-based one to be used?
> > (Eventually everything has to go to memory, so is this being done
> elsewhere
> > or does the InMemory datastore do it?)
>

-- 
SK

Re: Reg. Netflix Prize Apache Mahout GSoC Application

Posted by Sean Owen <sr...@gmail.com>.

Reusing code is fine, in principle. The code you mention, however,
will not help you much. It is non-distributed and has nothing to do
with Hadoop. You might reuse a bit of code to parse the input files,
that's about it.

Which data store are you referring to... if I understand right, you
are implementing an algorithm on Hadoop. You would definitely not
implement anything to load into memory, and I think you want to work
with HDFS and Amazon S3, not Hbase.

On Sun, Apr 4, 2010 at 9:29 AM, Sisir Koppaka <si...@gmail.com> wrote:
> Firstly, I am expecting to reuse the code at
> *org.apache.mahout.cf.taste.example.netflix
> *and have mentioned so in my proposal. Please let me know if this is OK, or
> if you foresee any problems doing this. Secondly, I will implement a
> HBase-based datastore as well as a InMemory-based one, but is the
> InMemory-based one a pre-requisite for the HBase-based one to be used?
> (Eventually everything has to go to memory, so is this being done elsewhere
> or does the InMemory datastore do it?)