You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Shannon Quinn <sq...@gatech.edu> on 2010/07/16 00:15:06 UTC

Complex equation over map/reduce

Hi all,

Thank you very much for your help thus far. I've got a question now that
seems like a doozie; maybe it's not.

I'm trying to figure out how to implement a very complex equation over
map/reduce. I won't reproduce the equation here (will post a picture of it
somewhere and hand out a link if anyone is interested), but the overall idea
is I'm computing a series of "S" values. Given a set of eigenvectors of
length n and their eigenvalues, as well as an arbitrary diagonal matrix with
dimensions n x n, I need to compute S. It goes something like this:

For each eigenvector u_k (and eigenvalue lambda_k):
  for each index i in u_k:
    for each index j in u_k:
      if i != j:
        compute S(lambda_k, u_k[i], u_k[j])

Obviously this will create a ton of output ( n*(n-1) for each eigenvector,
enough for one DistributedRowMatrix per eigenvector, but that's another
problem), but my question is: how do I feed the necessary inputs to the
map/reduce task?

Prior to beginning the task, I'll have a DRM of k eigenvectors of length n.
I'll also have a List of k eigenvalues, in addition to another DRM that is a
diagonal matrix. All of these need to reach the map/reduce job.

I would envision somehow splitting the task further, such that one map task
is simply a computation of a specific i and j (eliminating the loops in the
above pseudocode), that way all the values are scalars and can be stored in
the Configuration object for each map task. Or something like that.

(I believe this is along the lines of the discussion Sisir and I had a few
weeks ago, about the possibility of recursive map/reduce jobs :P )

Beyond that, perhaps storing the RowPath values for the eigenvector and
diagonal DRMs in the Configuration object, then using the map key value to
retrieve the correct row and iterate across it. But since the eigenvalues
are stored as a List, that still leaves the question of how to access the
correct eigenvalue for the computation.

I know this is all probably clear as mud, so please don't hesitate to ask
for clarification. Any insight would be much appreciated. Thank you!

Regards,
Shannon Quinn

Re: Complex equation over map/reduce

Posted by Ted Dunning <te...@gmail.com>.

Keep in mind that it is often feasible to have the mappers read in a side
file at configuration time and retain this in memory during the life of the
map.  This is done all the time with map-side joins.  You can do that with
the diagonal matrix which will, of course, be quite small.

On Thu, Jul 15, 2010 at 4:11 PM, Shannon Quinn <sq...@gatech.edu> wrote:

> Hi Ted,
>
> I put together the entire equation:
>
>
> http://spectrallyclustered.files.wordpress.com/2010/07/sensitivityequation.png
>
> (b is a constant throughout all the calculations, which can be trivially
> set
> in the Configuration object. lambda is the corresponding eigenvalue to the
> eigenvector u, u_i and u_j are components of the eigenvector, and d_i and
> d_j are elements of the diagonal matrix: D_ii and D_jj, respectively)
>
> Unfortunately there isn't any straightforward multiplication of terms
> involved, but I do like your idea of using an index to fix one of the two
> eigenvector components as the first operand (i, as you said), then treating
> j as a variable through all the other elements.
>
> The only problem with that approach is that I also need access to the
> diagonal matrix; when varying j, I'll essentially need access to the whole
> matrix in every mapper.
>
> The hack I came up with was to concatenate the two matrices (of
> eigenvectors
> and the diagonal one) side by side. Since they'll both have the same number
> of columns, I can set a value in the Configuration object so the boundary
> between the two is known to the mappers. But it's a dirty hack, and I
> wanted
> to see if there was a more reasonable approach.
>
> Thanks again!
>
> Shannon
>
> On Thu, Jul 15, 2010 at 6:44 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > The simplest implementation would be to have each mapper get a few
> > eigenvectors and compute the lambda_k u_k u_k' product.
> >
> > You have a choice at this point whether to emit individual values or to
> > emit
> > entire vectors.  I would tend to emit entire vectors with i as key and
> > lambda_k u_k[i] u_k as the vector value.  The combiner and reducer would
> > add
> > these rows together.  The result will be one row per file.  The combiner
> > will be absolutely crucial for performance
> >
> > But this actually looks like it is a large dense product.
> >
> > Do you mean (need) to compute the entire product?
> >
> > On Thu, Jul 15, 2010 at 3:15 PM, Shannon Quinn <sq...@gatech.edu>
> wrote:
> >
> > > Hi all,
> > >
> > > Thank you very much for your help thus far. I've got a question now
> that
> > > seems like a doozie; maybe it's not.
> > >
> > > I'm trying to figure out how to implement a very complex equation over
> > > map/reduce. I won't reproduce the equation here (will post a picture of
> > it
> > > somewhere and hand out a link if anyone is interested), but the overall
> > > idea
> > > is I'm computing a series of "S" values. Given a set of eigenvectors of
> > > length n and their eigenvalues, as well as an arbitrary diagonal matrix
> > > with
> > > dimensions n x n, I need to compute S. It goes something like this:
> > >
> > > For each eigenvector u_k (and eigenvalue lambda_k):
> > >  for each index i in u_k:
> > >    for each index j in u_k:
> > >      if i != j:
> > >        compute S(lambda_k, u_k[i], u_k[j])
> > >
> > > Obviously this will create a ton of output ( n*(n-1) for each
> > eigenvector,
> > > enough for one DistributedRowMatrix per eigenvector, but that's another
> > > problem), but my question is: how do I feed the necessary inputs to the
> > > map/reduce task?
> > >
> > > Prior to beginning the task, I'll have a DRM of k eigenvectors of
> length
> > n.
> > > I'll also have a List of k eigenvalues, in addition to another DRM that
> > is
> > > a
> > > diagonal matrix. All of these need to reach the map/reduce job.
> > >
> > > I would envision somehow splitting the task further, such that one map
> > task
> > > is simply a computation of a specific i and j (eliminating the loops in
> > the
> > > above pseudocode), that way all the values are scalars and can be
> stored
> > in
> > > the Configuration object for each map task. Or something like that.
> > >
> > > (I believe this is along the lines of the discussion Sisir and I had a
> > few
> > > weeks ago, about the possibility of recursive map/reduce jobs :P )
> > >
> > > Beyond that, perhaps storing the RowPath values for the eigenvector and
> > > diagonal DRMs in the Configuration object, then using the map key value
> > to
> > > retrieve the correct row and iterate across it. But since the
> eigenvalues
> > > are stored as a List, that still leaves the question of how to access
> the
> > > correct eigenvalue for the computation.
> > >
> > > I know this is all probably clear as mud, so please don't hesitate to
> ask
> > > for clarification. Any insight would be much appreciated. Thank you!
> > >
> > > Regards,
> > > Shannon Quinn
> > >
> >
>

Re: Complex equation over map/reduce

Posted by Shannon Quinn <sq...@gatech.edu>.

Hi Ted,

I put together the entire equation:

http://spectrallyclustered.files.wordpress.com/2010/07/sensitivityequation.png

(b is a constant throughout all the calculations, which can be trivially set
in the Configuration object. lambda is the corresponding eigenvalue to the
eigenvector u, u_i and u_j are components of the eigenvector, and d_i and
d_j are elements of the diagonal matrix: D_ii and D_jj, respectively)

Unfortunately there isn't any straightforward multiplication of terms
involved, but I do like your idea of using an index to fix one of the two
eigenvector components as the first operand (i, as you said), then treating
j as a variable through all the other elements.

The only problem with that approach is that I also need access to the
diagonal matrix; when varying j, I'll essentially need access to the whole
matrix in every mapper.

The hack I came up with was to concatenate the two matrices (of eigenvectors
and the diagonal one) side by side. Since they'll both have the same number
of columns, I can set a value in the Configuration object so the boundary
between the two is known to the mappers. But it's a dirty hack, and I wanted
to see if there was a more reasonable approach.

Thanks again!

Shannon

On Thu, Jul 15, 2010 at 6:44 PM, Ted Dunning <te...@gmail.com> wrote:

> The simplest implementation would be to have each mapper get a few
> eigenvectors and compute the lambda_k u_k u_k' product.
>
> You have a choice at this point whether to emit individual values or to
> emit
> entire vectors.  I would tend to emit entire vectors with i as key and
> lambda_k u_k[i] u_k as the vector value.  The combiner and reducer would
> add
> these rows together.  The result will be one row per file.  The combiner
> will be absolutely crucial for performance
>
> But this actually looks like it is a large dense product.
>
> Do you mean (need) to compute the entire product?
>
> On Thu, Jul 15, 2010 at 3:15 PM, Shannon Quinn <sq...@gatech.edu> wrote:
>
> > Hi all,
> >
> > Thank you very much for your help thus far. I've got a question now that
> > seems like a doozie; maybe it's not.
> >
> > I'm trying to figure out how to implement a very complex equation over
> > map/reduce. I won't reproduce the equation here (will post a picture of
> it
> > somewhere and hand out a link if anyone is interested), but the overall
> > idea
> > is I'm computing a series of "S" values. Given a set of eigenvectors of
> > length n and their eigenvalues, as well as an arbitrary diagonal matrix
> > with
> > dimensions n x n, I need to compute S. It goes something like this:
> >
> > For each eigenvector u_k (and eigenvalue lambda_k):
> >  for each index i in u_k:
> >    for each index j in u_k:
> >      if i != j:
> >        compute S(lambda_k, u_k[i], u_k[j])
> >
> > Obviously this will create a ton of output ( n*(n-1) for each
> eigenvector,
> > enough for one DistributedRowMatrix per eigenvector, but that's another
> > problem), but my question is: how do I feed the necessary inputs to the
> > map/reduce task?
> >
> > Prior to beginning the task, I'll have a DRM of k eigenvectors of length
> n.
> > I'll also have a List of k eigenvalues, in addition to another DRM that
> is
> > a
> > diagonal matrix. All of these need to reach the map/reduce job.
> >
> > I would envision somehow splitting the task further, such that one map
> task
> > is simply a computation of a specific i and j (eliminating the loops in
> the
> > above pseudocode), that way all the values are scalars and can be stored
> in
> > the Configuration object for each map task. Or something like that.
> >
> > (I believe this is along the lines of the discussion Sisir and I had a
> few
> > weeks ago, about the possibility of recursive map/reduce jobs :P )
> >
> > Beyond that, perhaps storing the RowPath values for the eigenvector and
> > diagonal DRMs in the Configuration object, then using the map key value
> to
> > retrieve the correct row and iterate across it. But since the eigenvalues
> > are stored as a List, that still leaves the question of how to access the
> > correct eigenvalue for the computation.
> >
> > I know this is all probably clear as mud, so please don't hesitate to ask
> > for clarification. Any insight would be much appreciated. Thank you!
> >
> > Regards,
> > Shannon Quinn
> >
>

Re: Complex equation over map/reduce

Posted by Ted Dunning <te...@gmail.com>.

The simplest implementation would be to have each mapper get a few
eigenvectors and compute the lambda_k u_k u_k' product.

You have a choice at this point whether to emit individual values or to emit
entire vectors.  I would tend to emit entire vectors with i as key and
lambda_k u_k[i] u_k as the vector value.  The combiner and reducer would add
these rows together.  The result will be one row per file.  The combiner
will be absolutely crucial for performance

But this actually looks like it is a large dense product.

Do you mean (need) to compute the entire product?

On Thu, Jul 15, 2010 at 3:15 PM, Shannon Quinn <sq...@gatech.edu> wrote:

> Hi all,
>
> Thank you very much for your help thus far. I've got a question now that
> seems like a doozie; maybe it's not.
>
> I'm trying to figure out how to implement a very complex equation over
> map/reduce. I won't reproduce the equation here (will post a picture of it
> somewhere and hand out a link if anyone is interested), but the overall
> idea
> is I'm computing a series of "S" values. Given a set of eigenvectors of
> length n and their eigenvalues, as well as an arbitrary diagonal matrix
> with
> dimensions n x n, I need to compute S. It goes something like this:
>
> For each eigenvector u_k (and eigenvalue lambda_k):
>  for each index i in u_k:
>    for each index j in u_k:
>      if i != j:
>        compute S(lambda_k, u_k[i], u_k[j])
>
> Obviously this will create a ton of output ( n*(n-1) for each eigenvector,
> enough for one DistributedRowMatrix per eigenvector, but that's another
> problem), but my question is: how do I feed the necessary inputs to the
> map/reduce task?
>
> Prior to beginning the task, I'll have a DRM of k eigenvectors of length n.
> I'll also have a List of k eigenvalues, in addition to another DRM that is
> a
> diagonal matrix. All of these need to reach the map/reduce job.
>
> I would envision somehow splitting the task further, such that one map task
> is simply a computation of a specific i and j (eliminating the loops in the
> above pseudocode), that way all the values are scalars and can be stored in
> the Configuration object for each map task. Or something like that.
>
> (I believe this is along the lines of the discussion Sisir and I had a few
> weeks ago, about the possibility of recursive map/reduce jobs :P )
>
> Beyond that, perhaps storing the RowPath values for the eigenvector and
> diagonal DRMs in the Configuration object, then using the map key value to
> retrieve the correct row and iterate across it. But since the eigenvalues
> are stored as a List, that still leaves the question of how to access the
> correct eigenvalue for the computation.
>
> I know this is all probably clear as mud, so please don't hesitate to ask
> for clarification. Any insight would be much appreciated. Thank you!
>
> Regards,
> Shannon Quinn
>

Re: Complex equation over map/reduce

Posted by Ted Dunning <te...@gmail.com>.

Both of these methods (cache and configured path) are "side-channels" in my
book.

You may need to test to see if there is a performance difference.  It is
conceivable that one is faster than the other due to replication factors or
other differences.

On Sun, Jul 18, 2010 at 7:51 PM, Shannon Quinn <sq...@gatech.edu> wrote:

> I would load the list of eigenvalues into the *Hadoop cache*, pass the
> matrix of eigenvectors to the task as the m/r in-value, and *set the Path*to the diagonal matrix (it's a separate entity from the eigenvalues) in the
> Configuration object, so it can be accessed in the Mapper. Though it could
> also easily be passed via the *Hadoop cache*, like the list of
> eigenvalues.

Re: Complex equation over map/reduce

Posted by Shannon Quinn <sq...@gatech.edu>.

Just as a sanity check to make sure I'm on the right track - if I want to
generate this diagonal "matrix" (a vector) from an actual matrix (by summing
rows), should I use an output key that's constant (say, 0) and a
VectorWritable that I continuously read in from its SequenceFile in every
map(), write the new element to, and write it back to the context? Or is
there some sort of VectorWritable.Element I can use?

Thanks!

On Tue, Jul 20, 2010 at 8:37 PM, Shannon Quinn <sq...@gatech.edu> wrote:

> I went with an ArrayWritable and wrote my own Writable type, but
> NamedVector looks like it could be useful to me later. Thanks so much!
>
>
> On Tue, Jul 20, 2010 at 8:21 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:
>
>> Have you looked at NamedVector? It allows you to associate a String with
>> each Vector where you could encode (i,j). NamedVectors carry this annotation
>> transparently through all existing Vector computations.
>>
>> Alternatively, you could create a similar Writable class to store i and j
>> more directly.
>>
>>
>> On 7/20/10 2:38 PM, Shannon Quinn wrote:
>>
>>> There isn't really any way to get around the need for n^3 computations
>>> (as I
>>> need to compare all the values), but I could significantly reduce the
>>> amount
>>> of data actually retained. However, in order to do so I'll need to make a
>>> specific data type Writable from within the Mapper.
>>>
>>> Each eigenvector of length n produces n x n sensitivities. I can get away
>>> with writing only the sensitivities that pass inspection (non-maximal
>>> suppression), but in order to do so I need to preserve the (i, j)
>>> coordinate
>>> that a specific sensitivity belongs to. In essence, for an eigenvector of
>>> length n, I will have a VectorWritable of output of *at most* length n
>>> (at
>>> least, length 1), but those elements need to have their original (i, j)
>>> coordinates associated with them somehow.
>>> Mahout Vectors can only store doubles, and Lists aren't by themselves
>>> Writable. Any suggestions for, at the end of a specific Map task, storing
>>> some sort of list (doesn't need to preserve order) of objects consisting
>>> of
>>> two ints (i, j values) and a double (the computed sensitivity)?
>>>
>>> Thank you very much!
>>>
>>> Shannon
>>>
>>> On Tue, Jul 20, 2010 at 11:53 AM, Shannon Quinn<sq...@gatech.edu>
>>>  wrote:
>>>
>>>
>>>
>>>>
>>>>
>>>>> Hmm... n^3 showing up anywhere at all (even in an intermediate step and
>>>>> filtered later on), pretty much kills all hope of this being scalable -
>>>>> is
>>>>> it entirely necessary?  (10K double values)^3 = 8 TB, (100k)^3 = 8 PB.
>>>>>  And
>>>>> 100k is a pretty small matrix, on the scale of things...
>>>>>
>>>>>
>>>>>
>>>>>
>>>> Agreed.
>>>>
>>>>
>>>>
>>>>
>>>>> If you really need to operate on a column of a DRM, do transpose()
>>>>> (which
>>>>> costs you a M/R pass), and then a second pass at the transpose of your
>>>>> original matrix, column by column.
>>>>>
>>>>>
>>>>>
>>>>>
>>>> That's what I had in mind, I was just wondering about how to have an
>>>> arbitrary row from the original matrix and an arbitrary column from the
>>>> transpose in the same place so as to run the non-maximal suppression.
>>>>
>>>>
>>>>
>>>>
>>>>> You don't really need to have a full "diagonal matrix" which lives as a
>>>>> DRM
>>>>> which has a Path on HDFS.  If you load up the Hadoop cache with the
>>>>> vector
>>>>> of eigenvalues, so that at Mapper configure() time you now can put this
>>>>> Vector into memory, then your M/R job can just use them directly -
>>>>> figure
>>>>> out what the action is that you're going to do, and do it: if you're
>>>>> doing
>>>>> D^{1/2} L D^{1/2}, then you're zipping over every entry L_{ij} and
>>>>> sending
>>>>> them to d_{ii}^{1/2} * L_{ij} * d_{jj}^{1/2}:
>>>>>
>>>>> //
>>>>>  private Vector diag; // initialized from configure
>>>>>
>>>>>  public void map(IntWritable rowNum, VectorWritable vw, Context ctx) {
>>>>>    for(Vector.Element e : vw.vector()) {
>>>>>      e.set(e.get() * Math.sqrt(diag.get(rowNum.get()) ) *
>>>>> Math.sqrt(diag.get(e.index)));
>>>>>    }
>>>>>    ctx.collect(vw);
>>>>>  }
>>>>> //
>>>>>
>>>>>
>>>>>
>>>>>
>>>> This particular operation I can certainly rewrite in this fashion; at
>>>> the
>>>> time I simply found it easier to do D.times(L.times(D)). But your
>>>> suggestion
>>>> is certainly a good one for my current problem - I'd have two vectors
>>>> side
>>>> by side in the cache: one representing the diagonal matrix, the other
>>>> the
>>>> list of eigenvalues. That should work just fine.
>>>>
>>>> Thanks!
>>>>
>>>> Shannon
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>

Re: Complex equation over map/reduce

Posted by Shannon Quinn <sq...@gatech.edu>.

I went with an ArrayWritable and wrote my own Writable type, but NamedVector
looks like it could be useful to me later. Thanks so much!

On Tue, Jul 20, 2010 at 8:21 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> Have you looked at NamedVector? It allows you to associate a String with
> each Vector where you could encode (i,j). NamedVectors carry this annotation
> transparently through all existing Vector computations.
>
> Alternatively, you could create a similar Writable class to store i and j
> more directly.
>
>
> On 7/20/10 2:38 PM, Shannon Quinn wrote:
>
>> There isn't really any way to get around the need for n^3 computations (as
>> I
>> need to compare all the values), but I could significantly reduce the
>> amount
>> of data actually retained. However, in order to do so I'll need to make a
>> specific data type Writable from within the Mapper.
>>
>> Each eigenvector of length n produces n x n sensitivities. I can get away
>> with writing only the sensitivities that pass inspection (non-maximal
>> suppression), but in order to do so I need to preserve the (i, j)
>> coordinate
>> that a specific sensitivity belongs to. In essence, for an eigenvector of
>> length n, I will have a VectorWritable of output of *at most* length n (at
>> least, length 1), but those elements need to have their original (i, j)
>> coordinates associated with them somehow.
>> Mahout Vectors can only store doubles, and Lists aren't by themselves
>> Writable. Any suggestions for, at the end of a specific Map task, storing
>> some sort of list (doesn't need to preserve order) of objects consisting
>> of
>> two ints (i, j values) and a double (the computed sensitivity)?
>>
>> Thank you very much!
>>
>> Shannon
>>
>> On Tue, Jul 20, 2010 at 11:53 AM, Shannon Quinn<sq...@gatech.edu>
>>  wrote:
>>
>>
>>
>>>
>>>
>>>> Hmm... n^3 showing up anywhere at all (even in an intermediate step and
>>>> filtered later on), pretty much kills all hope of this being scalable -
>>>> is
>>>> it entirely necessary?  (10K double values)^3 = 8 TB, (100k)^3 = 8 PB.
>>>>  And
>>>> 100k is a pretty small matrix, on the scale of things...
>>>>
>>>>
>>>>
>>>>
>>> Agreed.
>>>
>>>
>>>
>>>
>>>> If you really need to operate on a column of a DRM, do transpose()
>>>> (which
>>>> costs you a M/R pass), and then a second pass at the transpose of your
>>>> original matrix, column by column.
>>>>
>>>>
>>>>
>>>>
>>> That's what I had in mind, I was just wondering about how to have an
>>> arbitrary row from the original matrix and an arbitrary column from the
>>> transpose in the same place so as to run the non-maximal suppression.
>>>
>>>
>>>
>>>
>>>> You don't really need to have a full "diagonal matrix" which lives as a
>>>> DRM
>>>> which has a Path on HDFS.  If you load up the Hadoop cache with the
>>>> vector
>>>> of eigenvalues, so that at Mapper configure() time you now can put this
>>>> Vector into memory, then your M/R job can just use them directly -
>>>> figure
>>>> out what the action is that you're going to do, and do it: if you're
>>>> doing
>>>> D^{1/2} L D^{1/2}, then you're zipping over every entry L_{ij} and
>>>> sending
>>>> them to d_{ii}^{1/2} * L_{ij} * d_{jj}^{1/2}:
>>>>
>>>> //
>>>>  private Vector diag; // initialized from configure
>>>>
>>>>  public void map(IntWritable rowNum, VectorWritable vw, Context ctx) {
>>>>    for(Vector.Element e : vw.vector()) {
>>>>      e.set(e.get() * Math.sqrt(diag.get(rowNum.get()) ) *
>>>> Math.sqrt(diag.get(e.index)));
>>>>    }
>>>>    ctx.collect(vw);
>>>>  }
>>>> //
>>>>
>>>>
>>>>
>>>>
>>> This particular operation I can certainly rewrite in this fashion; at the
>>> time I simply found it easier to do D.times(L.times(D)). But your
>>> suggestion
>>> is certainly a good one for my current problem - I'd have two vectors
>>> side
>>> by side in the cache: one representing the diagonal matrix, the other the
>>> list of eigenvalues. That should work just fine.
>>>
>>> Thanks!
>>>
>>> Shannon
>>>
>>>
>>>
>>
>>
>
>

Re: Complex equation over map/reduce

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Have you looked at NamedVector? It allows you to associate a String with 
each Vector where you could encode (i,j). NamedVectors carry this 
annotation transparently through all existing Vector computations.

Alternatively, you could create a similar Writable class to store i and 
j more directly.

On 7/20/10 2:38 PM, Shannon Quinn wrote:
> There isn't really any way to get around the need for n^3 computations (as I
> need to compare all the values), but I could significantly reduce the amount
> of data actually retained. However, in order to do so I'll need to make a
> specific data type Writable from within the Mapper.
>
> Each eigenvector of length n produces n x n sensitivities. I can get away
> with writing only the sensitivities that pass inspection (non-maximal
> suppression), but in order to do so I need to preserve the (i, j) coordinate
> that a specific sensitivity belongs to. In essence, for an eigenvector of
> length n, I will have a VectorWritable of output of *at most* length n (at
> least, length 1), but those elements need to have their original (i, j)
> coordinates associated with them somehow.
> Mahout Vectors can only store doubles, and Lists aren't by themselves
> Writable. Any suggestions for, at the end of a specific Map task, storing
> some sort of list (doesn't need to preserve order) of objects consisting of
> two ints (i, j values) and a double (the computed sensitivity)?
>
> Thank you very much!
>
> Shannon
>
> On Tue, Jul 20, 2010 at 11:53 AM, Shannon Quinn<sq...@gatech.edu>  wrote:
>
>    
>>      
>>> Hmm... n^3 showing up anywhere at all (even in an intermediate step and
>>> filtered later on), pretty much kills all hope of this being scalable - is
>>> it entirely necessary?  (10K double values)^3 = 8 TB, (100k)^3 = 8 PB.
>>>   And
>>> 100k is a pretty small matrix, on the scale of things...
>>>
>>>
>>>        
>> Agreed.
>>
>>
>>      
>>> If you really need to operate on a column of a DRM, do transpose() (which
>>> costs you a M/R pass), and then a second pass at the transpose of your
>>> original matrix, column by column.
>>>
>>>
>>>        
>> That's what I had in mind, I was just wondering about how to have an
>> arbitrary row from the original matrix and an arbitrary column from the
>> transpose in the same place so as to run the non-maximal suppression.
>>
>>
>>      
>>> You don't really need to have a full "diagonal matrix" which lives as a
>>> DRM
>>> which has a Path on HDFS.  If you load up the Hadoop cache with the vector
>>> of eigenvalues, so that at Mapper configure() time you now can put this
>>> Vector into memory, then your M/R job can just use them directly - figure
>>> out what the action is that you're going to do, and do it: if you're doing
>>> D^{1/2} L D^{1/2}, then you're zipping over every entry L_{ij} and sending
>>> them to d_{ii}^{1/2} * L_{ij} * d_{jj}^{1/2}:
>>>
>>> //
>>>   private Vector diag; // initialized from configure
>>>
>>>   public void map(IntWritable rowNum, VectorWritable vw, Context ctx) {
>>>     for(Vector.Element e : vw.vector()) {
>>>       e.set(e.get() * Math.sqrt(diag.get(rowNum.get()) ) *
>>> Math.sqrt(diag.get(e.index)));
>>>     }
>>>     ctx.collect(vw);
>>>   }
>>> //
>>>
>>>
>>>        
>> This particular operation I can certainly rewrite in this fashion; at the
>> time I simply found it easier to do D.times(L.times(D)). But your suggestion
>> is certainly a good one for my current problem - I'd have two vectors side
>> by side in the cache: one representing the diagonal matrix, the other the
>> list of eigenvalues. That should work just fine.
>>
>> Thanks!
>>
>> Shannon
>>
>>      
>

Re: Complex equation over map/reduce

Posted by Shannon Quinn <sq...@gatech.edu>.

There isn't really any way to get around the need for n^3 computations (as I
need to compare all the values), but I could significantly reduce the amount
of data actually retained. However, in order to do so I'll need to make a
specific data type Writable from within the Mapper.

Each eigenvector of length n produces n x n sensitivities. I can get away
with writing only the sensitivities that pass inspection (non-maximal
suppression), but in order to do so I need to preserve the (i, j) coordinate
that a specific sensitivity belongs to. In essence, for an eigenvector of
length n, I will have a VectorWritable of output of *at most* length n (at
least, length 1), but those elements need to have their original (i, j)
coordinates associated with them somehow.

Mahout Vectors can only store doubles, and Lists aren't by themselves
Writable. Any suggestions for, at the end of a specific Map task, storing
some sort of list (doesn't need to preserve order) of objects consisting of
two ints (i, j values) and a double (the computed sensitivity)?

Thank you very much!

Shannon

On Tue, Jul 20, 2010 at 11:53 AM, Shannon Quinn <sq...@gatech.edu> wrote:

>
>> Hmm... n^3 showing up anywhere at all (even in an intermediate step and
>> filtered later on), pretty much kills all hope of this being scalable - is
>> it entirely necessary?  (10K double values)^3 = 8 TB, (100k)^3 = 8 PB.
>>  And
>> 100k is a pretty small matrix, on the scale of things...
>>
>>
> Agreed.
>
>
>> If you really need to operate on a column of a DRM, do transpose() (which
>> costs you a M/R pass), and then a second pass at the transpose of your
>> original matrix, column by column.
>>
>>
> That's what I had in mind, I was just wondering about how to have an
> arbitrary row from the original matrix and an arbitrary column from the
> transpose in the same place so as to run the non-maximal suppression.
>
>
>> You don't really need to have a full "diagonal matrix" which lives as a
>> DRM
>> which has a Path on HDFS.  If you load up the Hadoop cache with the vector
>> of eigenvalues, so that at Mapper configure() time you now can put this
>> Vector into memory, then your M/R job can just use them directly - figure
>> out what the action is that you're going to do, and do it: if you're doing
>> D^{1/2} L D^{1/2}, then you're zipping over every entry L_{ij} and sending
>> them to d_{ii}^{1/2} * L_{ij} * d_{jj}^{1/2}:
>>
>> //
>>  private Vector diag; // initialized from configure
>>
>>  public void map(IntWritable rowNum, VectorWritable vw, Context ctx) {
>>    for(Vector.Element e : vw.vector()) {
>>      e.set(e.get() * Math.sqrt(diag.get(rowNum.get()) ) *
>> Math.sqrt(diag.get(e.index)));
>>    }
>>    ctx.collect(vw);
>>  }
>> //
>>
>>
> This particular operation I can certainly rewrite in this fashion; at the
> time I simply found it easier to do D.times(L.times(D)). But your suggestion
> is certainly a good one for my current problem - I'd have two vectors side
> by side in the cache: one representing the diagonal matrix, the other the
> list of eigenvalues. That should work just fine.
>
> Thanks!
>
> Shannon
>

Re: Complex equation over map/reduce

Posted by Shannon Quinn <sq...@gatech.edu>.

>
>
> Hmm... n^3 showing up anywhere at all (even in an intermediate step and
> filtered later on), pretty much kills all hope of this being scalable - is
> it entirely necessary?  (10K double values)^3 = 8 TB, (100k)^3 = 8 PB.  And
> 100k is a pretty small matrix, on the scale of things...
>
>
Agreed.


> If you really need to operate on a column of a DRM, do transpose() (which
> costs you a M/R pass), and then a second pass at the transpose of your
> original matrix, column by column.
>
>
That's what I had in mind, I was just wondering about how to have an
arbitrary row from the original matrix and an arbitrary column from the
transpose in the same place so as to run the non-maximal suppression.


> You don't really need to have a full "diagonal matrix" which lives as a DRM
> which has a Path on HDFS.  If you load up the Hadoop cache with the vector
> of eigenvalues, so that at Mapper configure() time you now can put this
> Vector into memory, then your M/R job can just use them directly - figure
> out what the action is that you're going to do, and do it: if you're doing
> D^{1/2} L D^{1/2}, then you're zipping over every entry L_{ij} and sending
> them to d_{ii}^{1/2} * L_{ij} * d_{jj}^{1/2}:
>
> //
>  private Vector diag; // initialized from configure
>
>  public void map(IntWritable rowNum, VectorWritable vw, Context ctx) {
>    for(Vector.Element e : vw.vector()) {
>      e.set(e.get() * Math.sqrt(diag.get(rowNum.get()) ) *
> Math.sqrt(diag.get(e.index)));
>    }
>    ctx.collect(vw);
>  }
> //
>
>
This particular operation I can certainly rewrite in this fashion; at the
time I simply found it easier to do D.times(L.times(D)). But your suggestion
is certainly a good one for my current problem - I'd have two vectors side
by side in the cache: one representing the diagonal matrix, the other the
list of eigenvalues. That should work just fine.

Thanks!

Shannon

Re: Complex equation over map/reduce

Posted by Jake Mannix <ja...@gmail.com>.

On Mon, Jul 19, 2010 at 4:51 AM, Shannon Quinn <sq...@gatech.edu> wrote:

> I fully agree; the n could get exceptionally large, and even on a
> distributed cluster I'm not wild about the idea of storing n^3 values.

Hmm... n^3 showing up anywhere at all (even in an intermediate step and
filtered later on), pretty much kills all hope of this being scalable - is
it entirely necessary?  (10K double values)^3 = 8 TB, (100k)^3 = 8 PB.  And
100k is a pretty small matrix, on the scale of things...

> These values are pared down significantly in the next step of the algorithm
> (see my latest blog post for more details on this:
> http://spectrallyclustered.wordpress.com/2010/07/15/sprint-3-three-last-mr-tasks/), however the next step presents another problem. In that step, I need to
> compare every value in a matrix to all the other values in the same row or
> column. This process will effectively sparsify each matrix to n values, or a
> single matrix worth of values.
>

I'll try to read your blog post in more detail in the next day or so (brain
not fully on right now - jetlag).

> But, since everything in DRM-world is a row, I'm not sure yet how to
> compare the columns, but I was going to look at times() and timesSquared()
> for some ideas.
>

If you really need to operate on a column of a DRM, do transpose() (which
costs you a M/R pass), and then a second pass at the transpose of your
original matrix, column by column.

> The "matrix" of eigenvalues will indeed only be a vector, though I'm not
> entirely sure what you meant by "side-channel". From the looks of things,
> when this job is being configured, I would load the list of eigenvalues into
> the Hadoop cache, pass the matrix of eigenvectors to the task as the m/r
> in-value, and set the Path to the diagonal matrix (it's a separate entity
> from the eigenvalues) in the Configuration object, so it can be accessed in
> the Mapper. Though it could also easily be passed via the Hadoop cache, like
> the list of eigenvalues.
>

You don't really need to have a full "diagonal matrix" which lives as a DRM
which has a Path on HDFS.  If you load up the Hadoop cache with the vector
of eigenvalues, so that at Mapper configure() time you now can put this
Vector into memory, then your M/R job can just use them directly - figure
out what the action is that you're going to do, and do it: if you're doing
D^{1/2} L D^{1/2}, then you're zipping over every entry L_{ij} and sending
them to d_{ii}^{1/2} * L_{ij} * d_{jj}^{1/2}:

//
  private Vector diag; // initialized from configure

  public void map(IntWritable rowNum, VectorWritable vw, Context ctx) {
    for(Vector.Element e : vw.vector()) {
      e.set(e.get() * Math.sqrt(diag.get(rowNum.get()) ) *
Math.sqrt(diag.get(e.index)));
    }
    ctx.collect(vw);
  }
//

  -jake

Re: Complex equation over map/reduce

Posted by Shannon Quinn <sq...@gatech.edu>.

I fully agree; the n could get exceptionally large, and even on a distributed cluster I'm not wild about the idea of storing n^3 values. These values are pared down significantly in the next step of the algorithm (see my latest blog post for more details on this: http://spectrallyclustered.wordpress.com/2010/07/15/sprint-3-three-last-mr-tasks/ ), however the next step presents another problem. In that step, I need to compare every value in a matrix to all the other values in the same row or column. This process will effectively sparsify each matrix to n values, or a single matrix worth of values. 

But, since everything in DRM-world is a row, I'm not sure yet how to compare the columns, but I was going to look at times() and timesSquared() for some ideas.

The "matrix" of eigenvalues will indeed only be a vector, though I'm not entirely sure what you meant by "side-channel". From the looks of things, when this job is being configured, I would load the list of eigenvalues into the Hadoop cache, pass the matrix of eigenvectors to the task as the m/r in-value, and set the Path to the diagonal matrix (it's a separate entity from the eigenvalues) in the Configuration object, so it can be accessed in the Mapper. Though it could also easily be passed via the Hadoop cache, like the list of eigenvalues. 

I hope this is clear. Please let me know if I should explain further. Thank you all again for your help. 

Shannon

Apologies for the brevity, this was sent from my iPhone

On Jul 18, 2010, at 1:44, Jake Mannix <ja...@gmail.com> wrote:

> Hey Shannon,
> 
>  One of the most important things that Ted mentions is the "load data
> in the configure() method via a side-channel".  In this case, yes, the
> eigenvalues should be loaded at mapper initialization.  As I mentioned
> in a separate thread, the diagonal matrix of eigenvalues should not
> live as a DRM, as it is tiny.  It can live in a single vector, which is
> then put in the distributed cache of Hadoop (see the TimeSquaredJob
> class for details on how to do this).
> 
>  Once you've computed the n(n-1)/2 outputs from each vector, what
> do you do with it?   For graph problems, the "n" which governs the
> size of the eigenvectors is the number of vertices, and so n^2 is
> prohibitively large to store even once, let alone for k different
> eigenvectors.  Do you need to keep all of these values?
> 
>  -jake
> 
> On Fri, Jul 16, 2010 at 12:15 AM, Shannon Quinn <sq...@gatech.edu> wrote:
> 
>> Hi all,
>> 
>> Thank you very much for your help thus far. I've got a question now that
>> seems like a doozie; maybe it's not.
>> 
>> I'm trying to figure out how to implement a very complex equation over
>> map/reduce. I won't reproduce the equation here (will post a picture of it
>> somewhere and hand out a link if anyone is interested), but the overall
>> idea
>> is I'm computing a series of "S" values. Given a set of eigenvectors of
>> length n and their eigenvalues, as well as an arbitrary diagonal matrix
>> with
>> dimensions n x n, I need to compute S. It goes something like this:
>> 
>> For each eigenvector u_k (and eigenvalue lambda_k):
>> for each index i in u_k:
>>   for each index j in u_k:
>>     if i != j:
>>       compute S(lambda_k, u_k[i], u_k[j])
>> 
>> Obviously this will create a ton of output ( n*(n-1) for each eigenvector,
>> enough for one DistributedRowMatrix per eigenvector, but that's another
>> problem), but my question is: how do I feed the necessary inputs to the
>> map/reduce task?
>> 
>> Prior to beginning the task, I'll have a DRM of k eigenvectors of length n.
>> I'll also have a List of k eigenvalues, in addition to another DRM that is
>> a
>> diagonal matrix. All of these need to reach the map/reduce job.
>> 
>> I would envision somehow splitting the task further, such that one map task
>> is simply a computation of a specific i and j (eliminating the loops in the
>> above pseudocode), that way all the values are scalars and can be stored in
>> the Configuration object for each map task. Or something like that.
>> 
>> (I believe this is along the lines of the discussion Sisir and I had a few
>> weeks ago, about the possibility of recursive map/reduce jobs :P )
>> 
>> Beyond that, perhaps storing the RowPath values for the eigenvector and
>> diagonal DRMs in the Configuration object, then using the map key value to
>> retrieve the correct row and iterate across it. But since the eigenvalues
>> are stored as a List, that still leaves the question of how to access the
>> correct eigenvalue for the computation.
>> 
>> I know this is all probably clear as mud, so please don't hesitate to ask
>> for clarification. Any insight would be much appreciated. Thank you!
>> 
>> Regards,
>> Shannon Quinn
>>

Re: Complex equation over map/reduce

Posted by Jake Mannix <ja...@gmail.com>.

Hey Shannon,

  One of the most important things that Ted mentions is the "load data
in the configure() method via a side-channel".  In this case, yes, the
eigenvalues should be loaded at mapper initialization.  As I mentioned
in a separate thread, the diagonal matrix of eigenvalues should not
live as a DRM, as it is tiny.  It can live in a single vector, which is
then put in the distributed cache of Hadoop (see the TimeSquaredJob
class for details on how to do this).

  Once you've computed the n(n-1)/2 outputs from each vector, what
do you do with it?   For graph problems, the "n" which governs the
size of the eigenvectors is the number of vertices, and so n^2 is
prohibitively large to store even once, let alone for k different
eigenvectors.  Do you need to keep all of these values?

  -jake

On Fri, Jul 16, 2010 at 12:15 AM, Shannon Quinn <sq...@gatech.edu> wrote:

> Hi all,
>
> Thank you very much for your help thus far. I've got a question now that
> seems like a doozie; maybe it's not.
>
> I'm trying to figure out how to implement a very complex equation over
> map/reduce. I won't reproduce the equation here (will post a picture of it
> somewhere and hand out a link if anyone is interested), but the overall
> idea
> is I'm computing a series of "S" values. Given a set of eigenvectors of
> length n and their eigenvalues, as well as an arbitrary diagonal matrix
> with
> dimensions n x n, I need to compute S. It goes something like this:
>
> For each eigenvector u_k (and eigenvalue lambda_k):
>  for each index i in u_k:
>    for each index j in u_k:
>      if i != j:
>        compute S(lambda_k, u_k[i], u_k[j])
>
> Obviously this will create a ton of output ( n*(n-1) for each eigenvector,
> enough for one DistributedRowMatrix per eigenvector, but that's another
> problem), but my question is: how do I feed the necessary inputs to the
> map/reduce task?
>
> Prior to beginning the task, I'll have a DRM of k eigenvectors of length n.
> I'll also have a List of k eigenvalues, in addition to another DRM that is
> a
> diagonal matrix. All of these need to reach the map/reduce job.
>
> I would envision somehow splitting the task further, such that one map task
> is simply a computation of a specific i and j (eliminating the loops in the
> above pseudocode), that way all the values are scalars and can be stored in
> the Configuration object for each map task. Or something like that.
>
> (I believe this is along the lines of the discussion Sisir and I had a few
> weeks ago, about the possibility of recursive map/reduce jobs :P )
>
> Beyond that, perhaps storing the RowPath values for the eigenvector and
> diagonal DRMs in the Configuration object, then using the map key value to
> retrieve the correct row and iterate across it. But since the eigenvalues
> are stored as a List, that still leaves the question of how to access the
> correct eigenvalue for the computation.
>
> I know this is all probably clear as mud, so please don't hesitate to ask
> for clarification. Any insight would be much appreciated. Thank you!
>
> Regards,
> Shannon Quinn
>