You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Jake Mannix <ja...@gmail.com> on 2010/06/01 01:59:55 UTC
Re: [GSOC] Matrix Operations on HDFS

Hey Sisir,

  Some questions, to hopefully spark some of your thinking:

If you try to step into the mindset of map/reduce steps, can you outline
what
the algorithm is doing during the steps a) updating the input weights,
b) updating the hidden weights, and c) contrastive divergence?

If you put all of the ratings of a particular user into a Vector (which
structurally looks like a Map<Integer,Double>, but is more efficient and
has lots of mathematical methods available on it), with the keys
being the movieId's the user has rated, and the value being their rating,
then your entire data set could live on HDFS as a DistributedRowMatrix.
Similarly, if you take the transpose() of that matrix, you get back
an HDFS file whose rows are Vector instances of the ratings for
a given movie, with the entries keyed on the userId who rated them.

The algorithm for training in the Hinton paper you linked to seems
to allow for a pretty nice "per-user" parallelization (reading section
2.2, in particular) - each user gets "their own" separate RBM,
with a constraint that after every iteration, you need to average
together all of the connection weights (and biases) between all
of the users, and use this to initialize into the next iteration.

In particular, it looks like what you do is distribute the connection
coefficient matrix (well, matrices, because for the discrete
weightings case that you and Hinton are considering, you actually
have one set of weights per rating, 1,...,5) via DistributedCache
to all of the Hadoop nodes.  This mostly fine: the size of this
data is 5 * numItems (lets not get to caught up in them being
movies ;) ) * numHiddenNodes * 8bytes, so mappers would
be able to fit this all in memory for 100 hidden units up to about
100K movies for small-memory footprint Mappers, and you could
probably scale up to 10x that if you give up the memory allowable
in the Mapper JVM's.  To scale further, we would need to do
further parallelization.

But ok, so you distribute the "current" set of biases and
connections to all of your mappers, and load them into memory
at the start of everything, and then the Mappers iterate over
each user at a time (getting all of their ratings at once), and
you can iterate over each of their ratings, updating a copy of
a portion of the connections and biases matrix (just the part
which corresponds to the items the user has rated) using
CD, and then you *write the updated values to HDFS* (
this is where some cleverness should probably come in:
you want to write out key/value pairs, keyed so that
the same key gets multiple values, you can do averaging
easily in parallel too.  Probably keying on index of the
hidden units might work: send any updated weights and
biases for hidden unit "j" out as (j, deltaW_j), where
deltaW_j is really a matrix of values deltaW_ijk being
the connections between hidden unit j to rating k of item i.

The reduce step would then be really just a sum and
average, and after that map/reduce pass you need to then
redistribute the new W matrix out to all the nodes again
for the next iteration.

I'm not a RBM-expert, so I'm not sure if I have the way the
training on these things work, but it certainly seems like
the operation looks a lot like Ted was saying: a matrix
multiplication of the observed rating matrix times the
weights matrix (plus the bias, and then taking the
logistic function applied to each of the values on the
output).

It's a little trickier than normal matrix multiplication,
because the matrix in question has this extra index
of "which rating" it rated, but that could get folded into
the inner map function.  Essentially you could do a
map-side join of the visible ratings matrix and the
weights matrix, joined on the key they share in
common: movies (items), and do (for each movie i,
hidden unit j, and rating k) :

X_ij = Sum_(k=1-5) (v_ik *  W_ijk)

then emit key,value pairs as (j, X_ij).

The reducer then sums up all of the different X_ij
for different movies i, adds on the bias b_j (which will
have to have been loaded into memory), takes
the logit, and you've got the current guess as to
p(h_j = 1 | V).

This technique will require more map-reduce steps
than the one above, but will scale better (no memory
limitations).

So I'm not sure if any of this was helpful, but I'm just
trying to brainstorm along with you on ways you can
distribute this problem such that you're not stuck
calculating how much fits in memory.

Let me know which parts of this make sense, and which
parts are the ramblings of someone who doesn't
understand that much about neural nets in general, let
alone RBM's (but I'd like to learn more!).

  -jake



On Mon, May 31, 2010 at 2:32 PM, Sisir Koppaka <si...@gmail.com>wrote:

> On Tue, Jun 1, 2010 at 2:21 AM, Ted Dunning <te...@gmail.com> wrote:
>
> > On Mon, May 31, 2010 at 12:40 PM, Sisir Koppaka <sisir.koppaka@gmail.com
> > >wrote:
> >
> >
> > > The general access pattern to the dataset within any algo would be to
> > find
> > > the lists of (users,ratings) for each movie and to find the lists of
> > > (movies, ratings) for each user in a fast way. In C++, this design
> > > requirement can be satisfied using an in-memory approach.
> > >
> >
> > This pattern of access is essentially the same as a matrix
> multiplication.
> >  IF A is the user x movie matrix, then your computation has the same
> > computational shape as the multiplication A' A.
> >
> >
> This was insightful. I wanted to get out of my current thinking of the
> problem. Thanks.
>
>
> > > One solution is we first design a bit access pattern such that we fit
> > > (user,
> > > rating) or (movie, rating) in 32 bits(Actually even the date can fit,
> but
> > > that depends if we have a temporal model to use it :) ).
> >
> >
> > I don't think that this is a good design decision at all.  Much better to
> > keep your structures general.  Many large applications will have more
> than
> > 20,000 items.  To use Veoh as an example again, we had tens of millions
> of
> > users and about 10 million videos.
>
>
> I will definitely change that. I was trying to explain why I used userent[]
> in the code...which was because I was still thinking along those lines.
> Course correction for sure. I was trying to find possible Mahout data
> structures here that would be useful in replacing the current non-generic
> format.
>
>
>
> > So the first one would be like:
> > >
> > > User 1:
> > > Movie 3, Rating 4
> > > Movie 17, Rating 5
> >
> >
> > This is just a sparse matrix.  You should use a pre-designed data
> structure
> > for this.
> >
>
> Ok...
>
>
> >
>
> > > and vice versa. With dates if necessary.
> > >
> >
> > Adding dates is easily done with a second matrix.  Make sure you notice
> > that
> > there are two kinds of sparse matrix.  You want the sequential access
> kind.
> >  You might look to see if you can reduce your storage requirements by
> > storing the equivalent of A'A instead of A' and A as you are suggesting.
> >
> >
> This was also very useful, thanks.
>
>
> >
> > > A few examples of the operations -
> > >
> > > for (i=0; i<softmax; i++) {
> > >  visbiases[j][i] =
> > > Math.log(((double)moviercount[j*softmax+i])/((double)mtot));
> > > }
> > >
> > > Here's another example -
> > > CDinc[m][rr][h] = Momentum * CDinc[m][rr][h] + EpsilonW * ((CDp - CDn)
> -
> > > weightCost * vishid[m][rr][h]);
> > >
> >     c = new DenseMatrix(a.rows(), b.columns())
> >     for (int i = 0; i < a.rows(); i++) {
> >        for (int j = 0; j < b.columns(); j++) {
> >           double sum = 0;
> >           for (int k = 0; k < b.columns(); k++) {
> >               sum += a.get(i, k) * b.get(k, j);
> >           }
> >           c.put(i, j, sum);
> >       }
> >    }
> >
> > Since memory is much slower than the arithmetic in modern processes, this
> > creates a memory bottleneck.  What you need to do, then is to re-use
> > elements many times to avoid reading them from main memory.  To do this,
> you
> > need to do a block decomposition of the matrix multiply and you may need
> to
> > do two levels of block decomposition in order to reflect L1 cache and
> number
> > of available registers
> >
> >
> >
> Thanks again for this...
>
> ...
> > > So in short, the second issue I'm faced with is:
> > >
> > > 2. What are the distributed data structures/operations/iterators that
> > would
> > > be perfect for these algorithm-related data structures/operations in
> > > Mahout?
> > >
> >
> > Write down your algorithm first.  Don't just point to a paper that
> > references another paper.
> >
> > Yes I've made notes, it's that I was focussing on individual operations
> rather than the bigger picture and typographically, I thought it more
> convenient to quote the paper in the mail.
>
>
> > > I really think that rashly doing something is going to cause me a lot
> of
> > > days of misspent time,
> >
> >
> > You are correct here, but not in the way that you think.
> >
>
> :) Thanks, I hope so.
>
> Sisir
>