You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Sisir Koppaka <si...@gmail.com> on 2010/05/30 13:51:39 UTC

[GSOC] Matrix Operations on HDFS

Hi,
I was looking for distributed map-reduce based 1D, 2D, and 3D operations on
HDFS for the RBM algorithm. o.a.m.math.matrix has them but they are marked
"@deprecated until unit tests are in place.  Until this time, this
class/interface is unsupported."

Jake posted about  o.a.m.math.hadoop.decomposer.DistributedLanczosSolver in
Shannon's thread a few days ago - is there something like that for
distributed map-reduce operations on HDFC for generic matrices? I need these
operations because they don't fit in memory for large datasets.

--
Sisir

Re: [GSOC] Matrix Operations on HDFS

Posted by Sean Owen <sr...@gmail.com>.

Wow that's a lot of thinking. Maybe we can back up a bit and speak in
generalities.

In this world there's no such thing as cramming the data set into
memory. SparseMatrix is out. There isn't any effiicient way to access
arbitrary data, at all. You'll think in terms of an algorithm that
proceeds in a series of large, bulk operations (e.g. matrix times
matrix) which are decomposable into small, independent computations.

Simple ideas like "divide everything by the count of all users" can be
surprisingly tricky. You have to count, output, then inject that value
into a new mapreduce.

You can see my implementation of a simplistic co-occurrence based
recommender in org.apache.mahout.cf.taste.hadoop. It proceeds mostly
in terms of operations on vectors and it's a little tangled.

DistributedRowMatrix may streamline a lot of this, I'm not personally
familiar with it (but sounds like an excellent sort of idea). I
operate in terms of Vectors, and "matrices" as lots of Vectors
together. There's never an operation whose atomic component is in
terms of "a whole matrix".

That may send you back to the drawing boards a bit, but this is the
structure / data structure to have in mind on Hadoop.

others I'm sure have more nuanced ideas here.

Re: [GSOC] Matrix Operations on HDFS

Posted by Jake Mannix <ja...@gmail.com>.

Hey Sisir,

  Some questions, to hopefully spark some of your thinking:

If you try to step into the mindset of map/reduce steps, can you outline
what
the algorithm is doing during the steps a) updating the input weights,
b) updating the hidden weights, and c) contrastive divergence?

If you put all of the ratings of a particular user into a Vector (which
structurally looks like a Map<Integer,Double>, but is more efficient and
has lots of mathematical methods available on it), with the keys
being the movieId's the user has rated, and the value being their rating,
then your entire data set could live on HDFS as a DistributedRowMatrix.
Similarly, if you take the transpose() of that matrix, you get back
an HDFS file whose rows are Vector instances of the ratings for
a given movie, with the entries keyed on the userId who rated them.

The algorithm for training in the Hinton paper you linked to seems
to allow for a pretty nice "per-user" parallelization (reading section
2.2, in particular) - each user gets "their own" separate RBM,
with a constraint that after every iteration, you need to average
together all of the connection weights (and biases) between all
of the users, and use this to initialize into the next iteration.

In particular, it looks like what you do is distribute the connection
coefficient matrix (well, matrices, because for the discrete
weightings case that you and Hinton are considering, you actually
have one set of weights per rating, 1,...,5) via DistributedCache
to all of the Hadoop nodes.  This mostly fine: the size of this
data is 5 * numItems (lets not get to caught up in them being
movies ;) ) * numHiddenNodes * 8bytes, so mappers would
be able to fit this all in memory for 100 hidden units up to about
100K movies for small-memory footprint Mappers, and you could
probably scale up to 10x that if you give up the memory allowable
in the Mapper JVM's.  To scale further, we would need to do
further parallelization.

But ok, so you distribute the "current" set of biases and
connections to all of your mappers, and load them into memory
at the start of everything, and then the Mappers iterate over
each user at a time (getting all of their ratings at once), and
you can iterate over each of their ratings, updating a copy of
a portion of the connections and biases matrix (just the part
which corresponds to the items the user has rated) using
CD, and then you *write the updated values to HDFS* (
this is where some cleverness should probably come in:
you want to write out key/value pairs, keyed so that
the same key gets multiple values, you can do averaging
easily in parallel too.  Probably keying on index of the
hidden units might work: send any updated weights and
biases for hidden unit "j" out as (j, deltaW_j), where
deltaW_j is really a matrix of values deltaW_ijk being
the connections between hidden unit j to rating k of item i.

The reduce step would then be really just a sum and
average, and after that map/reduce pass you need to then
redistribute the new W matrix out to all the nodes again
for the next iteration.

I'm not a RBM-expert, so I'm not sure if I have the way the
training on these things work, but it certainly seems like
the operation looks a lot like Ted was saying: a matrix
multiplication of the observed rating matrix times the
weights matrix (plus the bias, and then taking the
logistic function applied to each of the values on the
output).

It's a little trickier than normal matrix multiplication,
because the matrix in question has this extra index
of "which rating" it rated, but that could get folded into
the inner map function.  Essentially you could do a
map-side join of the visible ratings matrix and the
weights matrix, joined on the key they share in
common: movies (items), and do (for each movie i,
hidden unit j, and rating k) :

X_ij = Sum_(k=1-5) (v_ik *  W_ijk)

then emit key,value pairs as (j, X_ij).

The reducer then sums up all of the different X_ij
for different movies i, adds on the bias b_j (which will
have to have been loaded into memory), takes
the logit, and you've got the current guess as to
p(h_j = 1 | V).

This technique will require more map-reduce steps
than the one above, but will scale better (no memory
limitations).

So I'm not sure if any of this was helpful, but I'm just
trying to brainstorm along with you on ways you can
distribute this problem such that you're not stuck
calculating how much fits in memory.

Let me know which parts of this make sense, and which
parts are the ramblings of someone who doesn't
understand that much about neural nets in general, let
alone RBM's (but I'd like to learn more!).

  -jake

On Mon, May 31, 2010 at 2:32 PM, Sisir Koppaka <si...@gmail.com>wrote:

> On Tue, Jun 1, 2010 at 2:21 AM, Ted Dunning <te...@gmail.com> wrote:
>
> > On Mon, May 31, 2010 at 12:40 PM, Sisir Koppaka <sisir.koppaka@gmail.com
> > >wrote:
> >
> >
> > > The general access pattern to the dataset within any algo would be to
> > find
> > > the lists of (users,ratings) for each movie and to find the lists of
> > > (movies, ratings) for each user in a fast way. In C++, this design
> > > requirement can be satisfied using an in-memory approach.
> > >
> >
> > This pattern of access is essentially the same as a matrix
> multiplication.
> >  IF A is the user x movie matrix, then your computation has the same
> > computational shape as the multiplication A' A.
> >
> >
> This was insightful. I wanted to get out of my current thinking of the
> problem. Thanks.
>
>
> > > One solution is we first design a bit access pattern such that we fit
> > > (user,
> > > rating) or (movie, rating) in 32 bits(Actually even the date can fit,
> but
> > > that depends if we have a temporal model to use it :) ).
> >
> >
> > I don't think that this is a good design decision at all.  Much better to
> > keep your structures general.  Many large applications will have more
> than
> > 20,000 items.  To use Veoh as an example again, we had tens of millions
> of
> > users and about 10 million videos.
>
>
> I will definitely change that. I was trying to explain why I used userent[]
> in the code...which was because I was still thinking along those lines.
> Course correction for sure. I was trying to find possible Mahout data
> structures here that would be useful in replacing the current non-generic
> format.
>
>
>
> > So the first one would be like:
> > >
> > > User 1:
> > > Movie 3, Rating 4
> > > Movie 17, Rating 5
> >
> >
> > This is just a sparse matrix.  You should use a pre-designed data
> structure
> > for this.
> >
>
> Ok...
>
>
> >
>
> > > and vice versa. With dates if necessary.
> > >
> >
> > Adding dates is easily done with a second matrix.  Make sure you notice
> > that
> > there are two kinds of sparse matrix.  You want the sequential access
> kind.
> >  You might look to see if you can reduce your storage requirements by
> > storing the equivalent of A'A instead of A' and A as you are suggesting.
> >
> >
> This was also very useful, thanks.
>
>
> >
> > > A few examples of the operations -
> > >
> > > for (i=0; i<softmax; i++) {
> > >  visbiases[j][i] =
> > > Math.log(((double)moviercount[j*softmax+i])/((double)mtot));
> > > }
> > >
> > > Here's another example -
> > > CDinc[m][rr][h] = Momentum * CDinc[m][rr][h] + EpsilonW * ((CDp - CDn)
> -
> > > weightCost * vishid[m][rr][h]);
> > >
> >     c = new DenseMatrix(a.rows(), b.columns())
> >     for (int i = 0; i < a.rows(); i++) {
> >        for (int j = 0; j < b.columns(); j++) {
> >           double sum = 0;
> >           for (int k = 0; k < b.columns(); k++) {
> >               sum += a.get(i, k) * b.get(k, j);
> >           }
> >           c.put(i, j, sum);
> >       }
> >    }
> >
> > Since memory is much slower than the arithmetic in modern processes, this
> > creates a memory bottleneck.  What you need to do, then is to re-use
> > elements many times to avoid reading them from main memory.  To do this,
> you
> > need to do a block decomposition of the matrix multiply and you may need
> to
> > do two levels of block decomposition in order to reflect L1 cache and
> number
> > of available registers
> >
> >
> >
> Thanks again for this...
>
> ...
> > > So in short, the second issue I'm faced with is:
> > >
> > > 2. What are the distributed data structures/operations/iterators that
> > would
> > > be perfect for these algorithm-related data structures/operations in
> > > Mahout?
> > >
> >
> > Write down your algorithm first.  Don't just point to a paper that
> > references another paper.
> >
> > Yes I've made notes, it's that I was focussing on individual operations
> rather than the bigger picture and typographically, I thought it more
> convenient to quote the paper in the mail.
>
>
> > > I really think that rashly doing something is going to cause me a lot
> of
> > > days of misspent time,
> >
> >
> > You are correct here, but not in the way that you think.
> >
>
> :) Thanks, I hope so.
>
> Sisir
>

Re: [GSOC] Matrix Operations on HDFS

Posted by Sisir Koppaka <si...@gmail.com>.

On Tue, Jun 1, 2010 at 2:21 AM, Ted Dunning <te...@gmail.com> wrote:

> On Mon, May 31, 2010 at 12:40 PM, Sisir Koppaka <sisir.koppaka@gmail.com
> >wrote:
>
>
> > The general access pattern to the dataset within any algo would be to
> find
> > the lists of (users,ratings) for each movie and to find the lists of
> > (movies, ratings) for each user in a fast way. In C++, this design
> > requirement can be satisfied using an in-memory approach.
> >
>
> This pattern of access is essentially the same as a matrix multiplication.
>  IF A is the user x movie matrix, then your computation has the same
> computational shape as the multiplication A' A.
>
>
This was insightful. I wanted to get out of my current thinking of the
problem. Thanks.


> > One solution is we first design a bit access pattern such that we fit
> > (user,
> > rating) or (movie, rating) in 32 bits(Actually even the date can fit, but
> > that depends if we have a temporal model to use it :) ).
>
>
> I don't think that this is a good design decision at all.  Much better to
> keep your structures general.  Many large applications will have more than
> 20,000 items.  To use Veoh as an example again, we had tens of millions of
> users and about 10 million videos.


I will definitely change that. I was trying to explain why I used userent[]
in the code...which was because I was still thinking along those lines.
Course correction for sure. I was trying to find possible Mahout data
structures here that would be useful in replacing the current non-generic
format.



> So the first one would be like:
> >
> > User 1:
> > Movie 3, Rating 4
> > Movie 17, Rating 5
>
>
> This is just a sparse matrix.  You should use a pre-designed data structure
> for this.
>

Ok...


>

> > and vice versa. With dates if necessary.
> >
>
> Adding dates is easily done with a second matrix.  Make sure you notice
> that
> there are two kinds of sparse matrix.  You want the sequential access kind.
>  You might look to see if you can reduce your storage requirements by
> storing the equivalent of A'A instead of A' and A as you are suggesting.
>
>
This was also very useful, thanks.


>
> > A few examples of the operations -
> >
> > for (i=0; i<softmax; i++) {
> >  visbiases[j][i] =
> > Math.log(((double)moviercount[j*softmax+i])/((double)mtot));
> > }
> >
> > Here's another example -
> > CDinc[m][rr][h] = Momentum * CDinc[m][rr][h] + EpsilonW * ((CDp - CDn) -
> > weightCost * vishid[m][rr][h]);
> >
>     c = new DenseMatrix(a.rows(), b.columns())
>     for (int i = 0; i < a.rows(); i++) {
>        for (int j = 0; j < b.columns(); j++) {
>           double sum = 0;
>           for (int k = 0; k < b.columns(); k++) {
>               sum += a.get(i, k) * b.get(k, j);
>           }
>           c.put(i, j, sum);
>       }
>    }
>
> Since memory is much slower than the arithmetic in modern processes, this
> creates a memory bottleneck.  What you need to do, then is to re-use
> elements many times to avoid reading them from main memory.  To do this, you
> need to do a block decomposition of the matrix multiply and you may need to
> do two levels of block decomposition in order to reflect L1 cache and number
> of available registers
>
>
>
Thanks again for this...

...
> > So in short, the second issue I'm faced with is:
> >
> > 2. What are the distributed data structures/operations/iterators that
> would
> > be perfect for these algorithm-related data structures/operations in
> > Mahout?
> >
>
> Write down your algorithm first.  Don't just point to a paper that
> references another paper.
>
> Yes I've made notes, it's that I was focussing on individual operations
rather than the bigger picture and typographically, I thought it more
convenient to quote the paper in the mail.


> > I really think that rashly doing something is going to cause me a lot of
> > days of misspent time,
>
>
> You are correct here, but not in the way that you think.
>

:) Thanks, I hope so.

Sisir

Re: [GSOC] Matrix Operations on HDFS

Posted by Ted Dunning <te...@gmail.com>.

On Mon, May 31, 2010 at 12:40 PM, Sisir Koppaka <si...@gmail.com>wrote:

> Sure, I'll discuss a bit of what I wanted to do and what operations are
> required.
>
> The issue I had when I used naive Java to load up the Netflix dataset was
> that it simply didn't fit in memory. Even on a machine with 24GB RAM,
> probably due to the appendage that any data type comes with in Java.
> There's
> no question that the source data has to live in some distributed fashion,
> as
> with any other algorithm, in Mahout.
>

This memory usage is not inherent in Java itself, merely in the use of the
highly generic collections that come for free in Java (although the size you
quote is a bit high).

If you switch to using the sparse matrix structures from Mahout, you should
be able to do much better.

Keep in mind, though, that the 100 million ratings in the Netflix data set
is only a moderate sized dataset.  At Veoh, for instance, our recommendation
engine had to handle 50 billion user x item observations and we never really
did anything but barely step into the top 100 web-sites.  There are lots of
places that handle far more data.

> The general access pattern to the dataset within any algo would be to find
> the lists of (users,ratings) for each movie and to find the lists of
> (movies, ratings) for each user in a fast way. In C++, this design
> requirement can be satisfied using an in-memory approach.
>

This pattern of access is essentially the same as a matrix multiplication.
 IF A is the user x movie matrix, then your computation has the same
computational shape as the multiplication A' A.

> One solution is we first design a bit access pattern such that we fit
> (user,
> rating) or (movie, rating) in 32 bits(Actually even the date can fit, but
> that depends if we have a temporal model to use it :) ).

I don't think that this is a good design decision at all.  Much better to
keep your structures general.  Many large applications will have more than
20,000 items.  To use Veoh as an example again, we had tens of millions of
users and about 10 million videos.

> So the first one would be like:
>
> User 1:
> Movie 3, Rating 4
> Movie 17, Rating 5

This is just a sparse matrix.  You should use a pre-designed data structure
for this.

> and vice versa. With dates if necessary.
>

Adding dates is easily done with a second matrix.  Make sure you notice that
there are two kinds of sparse matrix.  You want the sequential access kind.
 You might look to see if you can reduce your storage requirements by
storing the equivalent of A'A instead of A' and A as you are suggesting.

> In the current context, I had the following problems with the above plan:
> 1. *Read the netflix-dataset.csv file of (user,movie,rating,date) and write
> it as ? in HDFS.* Should ? be a sparse matrix or a distributed row matrix
> or
> something else?

I think that this is just a bit premature.  Do you have a document that
describes what you want to do at an algorithmic level?  You have given some
hints about this, but picking a data structure should be done after
describing the algorithm.  Especially when you are talking about
parallelization, you need to think algorithmically and not get too low level
too soon.

C++ format? The guarantee that the format here should provide at the minimum
> is a fast access to each user's or movie's entries throughout the dataset.
> I'm not clear on what to choose for this purpose from those within Mahout.
>

You keep talking in terms of essentially random access to data structures.
 To effectively parallelize this algorithm you will need to move beyond that
and describe how you can smoothly scan the data in parallel.  This requires
that you not think about the algorithm in terms of indexing this array or
twiddling that bit, but move up a level to think about what you really
intend.

> A few examples of the operations -
>
> for (i=0; i<softmax; i++) {
>  visbiases[j][i] =
> Math.log(((double)moviercount[j*softmax+i])/((double)mtot));
> }
>
> ...
> Summation of a particular length of an array is something that is done
> extensively again and again in many places.
>
> Here's another example -
> CDinc[m][rr][h] = Momentum * CDinc[m][rr][h] + EpsilonW * ((CDp - CDn) -
> weightCost * vishid[m][rr][h]);
>

These loops need to be viewed in a large context to make an intelligent
recommendation.  To take a very simple example, many people would write a
dense matrix multiplication this way (this is pseudo code, don't try to
compile it):

     c = new DenseMatrix(a.rows(), b.columns())
     for (int i = 0; i < a.rows(); i++) {
        for (int j = 0; j < b.columns(); j++) {
           double sum = 0;
           for (int k = 0; k < b.columns(); k++) {
               sum += a.get(i, k) * b.get(k, j);
           }
           c.put(i, j, sum);
       }
    }

This looks fine, but is typically nearly an order of magnitude slower than
the optimal code.  If you try to "optimize" this code as it stands, you are
very unlikely to succeed.  If, on the other hand, you back up a step and
think about what matrix multiplication really is, and what it costs to do,
you will quickly realize that the real problem is that there are O(n^3)
arithmetic operations against O(n^2) data elements, but that this set of
loops is doing O(n^3) memory fetches.  Since memory is much slower than the
arithmetic in modern processes, this creates a memory bottleneck.  What you
need to do, then is to re-use elements many times to avoid reading them from
main memory.  To do this, you need to do a block decomposition of the matrix
multiply and you may need to do two levels of block decomposition in order
to reflect L1 cache and number of available registers

The lesson here is not just to re-use elements.  That is a valid
micro-lesson, but the big lesson is to back up to the mathematics before
trying to write the code, especially when moving to a new architecture.  A
medium sized lesson is to use abstract operations for as big a piece of what
you are going as possible so that you make use of somebody else's
optimization efforts.

...
> So in short, the second issue I'm faced with is:
>
> 2. What are the distributed data structures/operations/iterators that would
> be perfect for these algorithm-related data structures/operations in
> Mahout?
>

Write down your algorithm first.  Don't just point to a paper that
references another paper.

> So, in short, the first issue is with obtaining a fast iterator across
> user-indexed entries and movie-indexed entries(whether it be with a sparse
> matrix, or distributed row matrix is the doubt), and the second issue is
> how
> do I go about parallelizing the above operations within the algorithm(once
> I've got the user iterator and movie iterator, and format cleared up),
> which
> are terribly inefficient in a serial format(and one of the primary issues
> with any deep belief net).
>

Actually, these are probably something like the 4th and 5th issues that you
face.  First write your algorithm down using high level notation (NOT
element level pseudo-code).  Then think about how you can decompose that in
a map-reduce framework.  Then think about what matrix primitives can help
with the decomposed form.  Only then should you start to talk about data
structures.

> I really think that rashly doing something is going to cause me a lot of
> days of misspent time,

You are correct here, but not in the way that you think.

Re: [GSOC] Matrix Operations on HDFS

Posted by Sisir Koppaka <si...@gmail.com>.

Sure, I'll discuss a bit of what I wanted to do and what operations are
required.

The issue I had when I used naive Java to load up the Netflix dataset was
that it simply didn't fit in memory. Even on a machine with 24GB RAM,
probably due to the appendage that any data type comes with in Java. There's
no question that the source data has to live in some distributed fashion, as
with any other algorithm, in Mahout.

The general access pattern to the dataset within any algo would be to find
the lists of (users,ratings) for each movie and to find the lists of
(movies, ratings) for each user in a fast way. In C++, this design
requirement can be satisfied using an in-memory approach.

One solution is we first design a bit access pattern such that we fit (user,
rating) or (movie, rating) in 32 bits(Actually even the date can fit, but
that depends if we have a temporal model to use it :) ). This does put some
limitations on the size of the user, movie but at least for Netflix -
500,000 odd users and 17,770 movies can be made to fit within 32 bits. Then
we generate two different arrays of 32-bit units, one indexed by user and
other by movie.

So the first one would be like:

User 1:
Movie 3, Rating 4
Movie 17, Rating 5
...
...
User 2:
Movie 5, Rating 1
...

and vice versa. With dates if necessary.

Each of these arrays are around 300MB, so both of them could fit into memory
within 600 MB. These are precomputed and the dataset is always accessed in
this format. Using a mmap(), they're loaded into memory and all
algorithms(KNN, SVD, RBM) take advantage of the fast iterating ability and
run quite fast.

There's a short directory index in the beginning, which essentially mentions
the indices where the user-indexed array begins, and where the movie-indexed
array begins - for each of the 480,000 odd users and for each of the 17,770
movies. So, the cost of finding the location of a user's or movie's entries
is limited to two array calls.

These are the errors that show up in the current .diff implementation of a
pure RBM - because I've worked with the above idea(a preconceived notion due
to a previous C++ based approach) while I wrote it till I find the right way
of doing this on Mahout(right data structure, right access iterator, and so
on) after discussing on the list.

userent[base0+j]&USER_MOVIEMASK  is marked TODO: Replace at several places.
  The base0 refers to the index where a particular user's entries begin that
is updated as needed before any user or movie-specific iteration for loading
the RBM data structures at any point. The mask separates the rating and
there are a couple of masks used.

In the current context, I had the following problems with the above plan:
1. *Read the netflix-dataset.csv file of (user,movie,rating,date) and write
it as ? in HDFS.* Should ? be a sparse matrix or a distributed row matrix or
something else? Either way, would it be necessary to convert it to the above
C++ format? The guarantee that the format here should provide at the minimum
is a fast access to each user's or movie's entries throughout the dataset.
I'm not clear on what to choose for this purpose from those within Mahout.

The second part of the problem is, there are a ton of data structures that
we keep building during the algorithm training. This is the second issue I
have with regard to choosing the right Mahout-specific access methods/data
structures. But a short excursion now to the algorithm itself.

This <http://www.machinelearning.org/proceedings/icml2007/papers/407.pdf> is
the algorithm that has been tested on the Netflix dataset. There is a MATLAB
version of it provided by the authors for MNIST digits dataset, and there
are independent C++ versions for Netflix.

A key design idea is that we'll have just two layers - a visible softmax
layer and a binary/gaussian/etc. hidden layer as per the variants. The
mapping is also restricted in such a fashion to allow distributed
computation. Another key idea is that instead of differential-based
learning(which takes exponential time), we use Contrastive Divergence, which
was the author's contribution in a previous paper. This has been reported to
give a good approximation and performance balance. The algorithm predicts a
rating for a (user,movie) pair in time linear to the number of hidden units.

There are also variants which include *conditional *RBM's and *factored *RBM's.
The conditional variety uses the information that a given user has already
watched a movie in the test dataset - which does say something. Also, not
watching a movie is also useful information and it uses it. This is modelled
by using diff formulae for each stage of the RBM. Factoring is a way to
reduce the amount of factors used in predictions - it may or may not be a
good choice when we have a cluster to use so the aim is to provide all these
varieties in a neatly wrapped package in Mahout.

Now, the RBM implemented in MAHOUT-375.diff is a pure RBM - so no
conditional or factored versions yet. But I planned on writing this one
first so that I could then tackle the issue of the data input and data
structures and then rapidly iterate by refactoring, and add in the other
variants.

A few examples of the operations -

for (i=0; i<softmax; i++) {
  visbiases[j][i] =
Math.log(((double)moviercount[j*softmax+i])/((double)mtot));
}

So here, the biases of the visible units are being updated using a movie's
specific data. There's another for loop for summing up mtot (mtot +=
moviercount[j*softmax+k]) with k from 0 - 5. This happens a lot of times.

Summation of a particular length of an array is something that is done
extensively again and again in many places.

Here's another example -
CDinc[m][rr][h] = Momentum * CDinc[m][rr][h] + EpsilonW * ((CDp - CDn) -
weightCost * vishid[m][rr][h]);

So this is a part in the contrastive divergence learning approximation that
I mentioned earlier. rr varies from 0->softmax, h from 0->totalFeatures, and
m from 0-> numItems(480,000 in the case of Netflix dataset - this factor
alone). And this happens in every iteration. This part would benefit
immensely from distributed data structures and associated operations.

Another naive idea is of course, that the predictions for 2.8 million
entries(in the Netflix Prize qualifying set) or for any other equal or
larger generic recommender system would benefit immensely from doing the
predictions in parallel. I guess this would be clear once the previous issue
is figured out.

So in short, the second issue I'm faced with is:

2. What are the distributed data structures/operations/iterators that would
be perfect for these algorithm-related data structures/operations in
Mahout?

So, in short, the first issue is with obtaining a fast iterator across
user-indexed entries and movie-indexed entries(whether it be with a sparse
matrix, or distributed row matrix is the doubt), and the second issue is how
do I go about parallelizing the above operations within the algorithm(once
I've got the user iterator and movie iterator, and format cleared up), which
are terribly inefficient in a serial format(and one of the primary issues
with any deep belief net).

I really think that rashly doing something is going to cause me a lot of
days of misspent time, so that's why I am trying to figure this out very
clearly on the list with your advices before going ahead. Please do mention
anything that you find of relevance - it would be immensely useful to me.

Thanks really for the response so far - it's very encouraging.

Sisir

Re: [GSOC] Matrix Operations on HDFS

Posted by Jake Mannix <ja...@gmail.com>.

On Sun, May 30, 2010 at 6:30 PM, Ted Dunning <te...@gmail.com> wrote:

> The Distributed Row Matrix should be ideal for this.  When you run mappers
> against this data structure, each mapper gets a different row.  You can use
> assign to compute your function on each element of a row in the mapper.
>  Define number of reducers = 0 and you are set.
>

Heh, Ted, you are aware that the instance methods I described were
hypothetical,
right?  I can certainly add them pretty easily (as I'm sure Sisir could too,
in a
patch), but they're "throwing NotYetImplemented" currently, as one might
say.

> Are you sure that you don't need some kind of reduction function, however?
>

This is the part I'd really like to write up (also NYE), where you give the
DistributedRowMatrix a "mapper" of some kind, which on each row, takes
that Vector and produces one or more of linear key-value pairs: keys could
be either Null or Integer (possibly Pair<Integer,Integer> ?), and values
could be Integer, Double, Vector, or Matrix.  You also pass into that
same method call a "reducer" which does the obvious thing, eventually
spitting out key, value pairs of the same linear types (and if it ends up
being (Null, Vector), or (Integer, Double) there could be nice way to make
this function have Vector return type, and if instead the reduce spits
out (Pair<Int,Int>, Double), (Int, Vector), or (Null, Matrix), it could
return
a DistributedRowMatrix).

I've been wanting to add something like this, a kind of "numeric-specific"
but otherwise generic MapReduce api, for a while, but I've been holding
off on account of not wanting to overengineer, if nobody would be using
it.

This is why, Sisir, I'd like to know exactly what kinds of operations you'd
want to do on a big sparse HDFS-backed matrix - simple mutation of
the rows, based on some inputs, or do you need to do some aggregation
across rows and make new kinds of reduced output, or what?  Could
you maybe give a little write-up of how the RBM you're coding up works,
for those of us not "in the thick of it", like Shannon did last week?

  -jake

> You might also look at the k-means clustering which probably is related to
> what you are doing in some sense.
>
> On Sun, May 30, 2010 at 3:24 PM, Sisir Koppaka <sisir.koppaka@gmail.com
> >wrote:
>
> > I think I need the sort of operation Jake described above  -
> > wherein I can call a function f on a vector of the whole matrix(the
> dataset
> > here, which is sparse) in a distributed fashion) I'll see this in detail
> > tomorrow. But any other pointers on this issue with reference to the
> > MAHOUT-375.diff update are very welcome.
> >
>

Re: [GSOC] Matrix Operations on HDFS

Posted by Ted Dunning <te...@gmail.com>.

The Distributed Row Matrix should be ideal for this.  When you run mappers
against this data structure, each mapper gets a different row.  You can use
assign to compute your function on each element of a row in the mapper.
 Define number of reducers = 0 and you are set.

Are you sure that you don't need some kind of reduction function, however?

You might also look at the k-means clustering which probably is related to
what you are doing in some sense.

On Sun, May 30, 2010 at 3:24 PM, Sisir Koppaka <si...@gmail.com>wrote:

> I think I need the sort of operation Jake described above  -
> wherein I can call a function f on a vector of the whole matrix(the dataset
> here, which is sparse) in a distributed fashion) I'll see this in detail
> tomorrow. But any other pointers on this issue with reference to the
> MAHOUT-375.diff update are very welcome.
>

Re: [GSOC] Matrix Operations on HDFS

Posted by Sisir Koppaka <si...@gmail.com>.

Hi,
I've updated JIRA <https://issues.apache.org/jira/browse/MAHOUT-375> with a
short update on my progress(as a .diff, of course, no theatricals). I've
implemented a decent portion of a pure RBM algorithm, albeit not in a
distributed fashion. The specific issues that are my bottlenecks now, in a
way partly because I focussed more on the algorithms variants rather than
the code base yet, are:

1. Getting the dataset in. One way is I assume, to read in the dataset from
a File using the FileDataModel on
this<https://mahout-rbm-netflix.s3.amazonaws.com/netflix-dataset-wodates.csv>
(clicking
in Chrome downloads a multi-GB file, but Safari lets you see the first few
lines of the csv without downloading), which is a csv-formatted version of
the Netflix dataset. But is it generic enough to expect a csv input for the
purpose of a recommender beyond the Netflix dataset?

I looked at GenericDataModel, but the comments state that it is useful only
for small experiments and not recommended for contexts where performance is
important. If anybody could give some pointers on how to use dataModels in
Taste, and why they are not used in the other parts(non o.a.m.cf parts) it'd
be really helpful.

2. An ancillary question to the above is can the python scripts that did the
above formatting for the Netflix dataset be added to a patch in the examples
section along with the wikipedia and reuters examples? Or is Python not
allowed? Is there a policy like anything that can be done with bash is ok
and rest is subject to discussion?

3. Are there any useful Vector data types, Iterators or Matrix
datatypes(sparse for the case of the Netflix dataset) which you think could
be useful for this case(in o.a.m.cf.taste)? I found some, but I would
definitely love any pointers towards specific stuff just to make sure I
don't end up rewriting stuff because of my ignorance. :)

4. Now, the issue of map-reducing the pure RBM and putting the algorithm
data structures on HDFS. I'm positive that the instance data structures of
the pure RBM will not fit into memory for Netflix dataset(At least, I've
tried Weka which can't load it, and even naive on-the-fly calculations of
the space required in the worst-case). I think it's safer to put them on
HDFS. Could you give me some pointers on how to get started on this? (Thanks
Jake and Ted, I think I need the sort of operation Jake described above  -
wherein I can call a function f on a vector of the whole matrix(the dataset
here, which is sparse) in a distributed fashion) I'll see this in detail
tomorrow. But any other pointers on this issue with reference to the
MAHOUT-375.diff update are very welcome.

I'm sure these are naive questions, so apologies. :) I desperately wish to
skip that hoop soon after this kickin-the-tyres phase is over.

Thanks once again,
It's good night here, but good day to everyone else!

--
Sisir

Re: [GSOC] Matrix Operations on HDFS

Posted by Jake Mannix <ja...@gmail.com>.

Hi Sisir,

  What operations do you want to do on a distributed matrix?  We don't have
really any 3D operations which are scalable at this time (not sure if I know
of
any algorithms that require them), but the matrix (2D) and vector (1D)
operations we do currently support in a distributed way are all in the
DistributedRowMatrix class.

  Some of the things we don't currently have in that class are ways to
mutate the matrix itself, but we certainly should have methods like:

  public DistributedRowMatrix assign(UnaryFunction f) {
    for(Vector row : this) { row = row.assign(f); }
  }
  public DistributedRowMatrix assign(Vector v, BinaryFunction f) {
    for(Vector row : this) { row = row.assign(v, f); }
  }

Which would at least allow you to update rows in a distributed
fashion.

What kind of operations would you want to do on a huge distributed
matrix?

  -jake

On Sun, May 30, 2010 at 4:51 AM, Sisir Koppaka <si...@gmail.com>wrote:

> Hi,
> I was looking for distributed map-reduce based 1D, 2D, and 3D operations on
> HDFS for the RBM algorithm. o.a.m.math.matrix has them but they are marked
> "@deprecated until unit tests are in place.  Until this time, this
> class/interface is unsupported."
>
> Jake posted about  o.a.m.math.hadoop.decomposer.DistributedLanczosSolver in
> Shannon's thread a few days ago - is there something like that for
> distributed map-reduce operations on HDFC for generic matrices? I need
> these
> operations because they don't fit in memory for large datasets.
>
> --
> Sisir
>

Re: [GSOC] Matrix Operations on HDFS

Posted by Ted Dunning <te...@gmail.com>.

The idea with those deprecations is that we imported a whole bunch of code
from the Colt package but it didn't have enough (any) unit tests.  To avoid
institutionalizing bugs, we marked everything in that import as deprecated
and are removing the deprecations as we need to use those methods and as we
test them.

If you need those deprecated methods, then go right ahead and write some
unit tests and undeprecate them.  While you are at it, make sure that you
address the checkstyle warnings.  If you have questions about how to do
this, feel free to ask about specific files.

None of these deprecated methods are likely to be distributed although they
may be of considerable use in a distributed implementation.

Jake's has implemented a kind of a distributed matrix that is useful for
certain kinds of multiplications.  Can you say more about what you need?  It
is possible that a small variant of what he has would work for you.

On Sun, May 30, 2010 at 4:51 AM, Sisir Koppaka <si...@gmail.com>wrote:

> Hi,
> I was looking for distributed map-reduce based 1D, 2D, and 3D operations on
> HDFS for the RBM algorithm. o.a.m.math.matrix has them but they are marked
> "@deprecated until unit tests are in place.  Until this time, this
> class/interface is unsupported."
>
> Jake posted about  o.a.m.math.hadoop.decomposer.DistributedLanczosSolver in
> Shannon's thread a few days ago - is there something like that for
> distributed map-reduce operations on HDFC for generic matrices? I need
> these
> operations because they don't fit in memory for large datasets.
>
> --
> Sisir
>