You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Alexander Hans <al...@ahans.de> on 2010/10/20 12:29:50 UTC

Sharing a vector between mappers

Hi,

I've finally got some work done on the LWLR implementation. It's already
functional when used with fixed weights of 1, i.e., linear regression. In
that case each mapper gets a vector from the training data and calculates
the A matrix (X'*W*X, with W being a diagonal matrix containing the
weights for each training vector, currently W = I) and b vector (B'*W*y,
again currently with W = I) for that training vector. The reducer then
sums the individual As and bs to get the final A and b which are then used
to calculate the coefficients vector theta (I think it would be a good
idea to have combiners calculating partial sums and then letting the
reducer calculate the final sum from the combiners' output). It then loads
another file containing input vectors for the prediction phase, constructs
a matrix X from those vectors, and calculates the output as y = X * theta.

Now for LWLR it doesn't work like that, since for each prediction input we
need another theta vector, so as a first step it would make sense to give
the algorithm set of training vectors (containing input vectors and target
scalars) and just one prediction input vector. Then each mapper would do
just the same as it does now, except that it would also calculate the
weight for its training vector using the training input vector and the
prediction input vector. Now I come to my question: How can I share the
prediction input vector between those individual mappers? I don't want
each mapper have it load from I file. I think a good solution would be to
pass it using the configuration. In a Hadoop related forum or list someone
suggested to serialize the object that you want to share to a String and
then put that String into the configuration. Do you think that's a good
idea? If yes, what is the proper Mahout way of serializing a Vector to a
String and deserializing from String to Vector later?


Thanks,

Alex


Re: Sharing a vector between mappers

Posted by Ted Dunning <te...@gmail.com>.
Passing small amounts of data via configuration is reasonable to do, but it
isn't clear that this
is a good idea for you.  Do you really only want to pass around a single
input vector for an entire
map-reduce invocation?  Map-reduce takes a looong time to get started.

If you might possibly want to pass many input vectors to each mapper then
the distributed cache
mechanism is probably a better bet for what you want to do.  Basically what
you are trying to do is
isomorphic to a map-side join and that is the normal mechanism used for
that.

Also, I have written an implementation of LSMR for iterative linear
solution.  Would that be helpful
for you?  I think that you may have mentioned that you were looking at LSQR
some time ago.  LSMR
is a follow on algorithm for that.  If you are interested, take a look at
https://issues.apache.org/jira/browse/MAHOUT-499
and the git repo referenced there.  As soon as the 0.4 release goes out, I
will likely commit that in
case you need it.

On Wed, Oct 20, 2010 at 3:29 AM, Alexander Hans <al...@ahans.de> wrote:

> Hi,
>
> I've finally got some work done on the LWLR implementation. It's already
> functional when used with fixed weights of 1, i.e., linear regression. In
> that case each mapper gets a vector from the training data and calculates
> the A matrix (X'*W*X, with W being a diagonal matrix containing the
> weights for each training vector, currently W = I) and b vector (B'*W*y,
> again currently with W = I) for that training vector. The reducer then
> sums the individual As and bs to get the final A and b which are then used
> to calculate the coefficients vector theta (I think it would be a good
> idea to have combiners calculating partial sums and then letting the
> reducer calculate the final sum from the combiners' output). It then loads
> another file containing input vectors for the prediction phase, constructs
> a matrix X from those vectors, and calculates the output as y = X * theta.
>
> Now for LWLR it doesn't work like that, since for each prediction input we
> need another theta vector, so as a first step it would make sense to give
> the algorithm set of training vectors (containing input vectors and target
> scalars) and just one prediction input vector. Then each mapper would do
> just the same as it does now, except that it would also calculate the
> weight for its training vector using the training input vector and the
> prediction input vector. Now I come to my question: How can I share the
> prediction input vector between those individual mappers? I don't want
> each mapper have it load from I file. I think a good solution would be to
> pass it using the configuration. In a Hadoop related forum or list someone
> suggested to serialize the object that you want to share to a String and
> then put that String into the configuration. Do you think that's a good
> idea? If yes, what is the proper Mahout way of serializing a Vector to a
> String and deserializing from String to Vector later?
>
>
> Thanks,
>
> Alex
>
>