You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Shannon Quinn (JIRA)" <ji...@apache.org> on 2010/12/29 07:06:46 UTC

[jira] Updated: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

     [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shannon Quinn updated MAHOUT-537:
---------------------------------

    Attachment: MAHOUT-537.patch

Updated patch. Fixes from previous patch are included, this time merged with unrelated changes to the related files. Also removed all the commented-out old code, and even caught and fixed a few bugs. Fully implemented timesSquared(). All that remains is the times(DRM) job. Will update on this very soon.

(regarding the previous comments on this ticket: I'm using Hadoop 0.20.2)

> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> -------------------------------------------------------------
>
>                 Key: MAHOUT-537
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>             Project: Mahout
>          Issue Type: Improvement
>    Affects Versions: 0.4
>            Reporter: Shannon Quinn
>            Assignee: Shannon Quinn
>         Attachments: MAHOUT-537.patch, MAHOUT-537.patch
>
>
> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Updated: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Shannon Quinn <sq...@gatech.edu>.

Ah, and the reason why I never encountered this problem in my own code 
is because I've been dealing exclusively with symmetric 
matrices...thanks very much for the primer, I'll need it in my 
continuing work!

On 1/6/11 9:11 PM, Jake Mannix wrote:
> Hey Shannon,
>
>    I'm replying via phone, so apologies in advance for brevity:
>
>    If you have a DRM (A) which is n rows by m columns, and another DRM (B)
> which is m rows by p columns, there is *no single method* on DRM which
> computes A*B (a sensible matrix with n rows by p columns).  To compute this,
> you would run A.transpose().times(B).
>
>    On the other hand, if you already have a matrix (call it At) with m rows
> by n columns, then At.times(B) will compute a matrix with n rows and p
> columns in one method call (and one MR pass) whose entries are exactly the
> same as taking the true matrix multiplication of the transpose of At times
> B.
>
>    Any time you use DRM.times(), you are required to have both DRM instances
> have the same number of rows (*not* number of columns of the first equals
> the number of rows of the second).  In fact, as Dmitriy points out, the have
> to have the same number of InputSplits as well (which is easily achieved by
> having both be created in MR jobs with the same # of reducers).
>
>    -jake
>
> On Jan 6, 2011 1:53 PM, "Shannon Quinn"<sq...@gatech.edu>  wrote:
>
>>    Matrix A has N rows (each of which has cardinality M_A), and Matrix B
> has>  N rows (each of whi...
> I suppose this is where I get confused. I thought, by definition, matrix A
> has dimensions (n by m), and matrix B has dimensions (m by p), and the
> resulting matrix is (n by p). I saw in the implementation that it cleverly
> uses the transpose of A such that just the row vectors are needed, but my
> confusion comes from the fact that I don't see an explicit transpose before
> the times() job gets going.
>
> So, in a toy example, A = [3 by 2], B = [2 by 2], it looks to me as if the
> three rows of A are being sent to the MR job with the two rows of B, which
> doesn't make any sense. I know there should be a transpose of A somewhere
> but I don't see it.
>
> Unless the assumption is that the user calls transpose() before calling
> times()? Which doesn't make any sense either since I've used this job just
> fine. I know I'm missing something simple...thanks for your help.
>
> Also: I'll shelve the general DRM rewrite patch, then, for the time being.
> You make good points, and there are other patches I should work on in the
> meantime :) (though I could just experiment with 0.21 to see how well that
> works)
>
> Shannon
>
>>    There are thus N pairs of>  vectors {A_i, B_i}, and if you take
> MatrixSum_{i=1,N} (A_i^T x B_i...
>

Re: [jira] Updated: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Jake Mannix <ja...@gmail.com>.

Hey Shannon,

  I'm replying via phone, so apologies in advance for brevity:

  If you have a DRM (A) which is n rows by m columns, and another DRM (B)
which is m rows by p columns, there is *no single method* on DRM which
computes A*B (a sensible matrix with n rows by p columns).  To compute this,
you would run A.transpose().times(B).

  On the other hand, if you already have a matrix (call it At) with m rows
by n columns, then At.times(B) will compute a matrix with n rows and p
columns in one method call (and one MR pass) whose entries are exactly the
same as taking the true matrix multiplication of the transpose of At times
B.

  Any time you use DRM.times(), you are required to have both DRM instances
have the same number of rows (*not* number of columns of the first equals
the number of rows of the second).  In fact, as Dmitriy points out, the have
to have the same number of InputSplits as well (which is easily achieved by
having both be created in MR jobs with the same # of reducers).

  -jake

On Jan 6, 2011 1:53 PM, "Shannon Quinn" <sq...@gatech.edu> wrote:

>   Matrix A has N rows (each of which has cardinality M_A), and Matrix B
has > N rows (each of whi...
I suppose this is where I get confused. I thought, by definition, matrix A
has dimensions (n by m), and matrix B has dimensions (m by p), and the
resulting matrix is (n by p). I saw in the implementation that it cleverly
uses the transpose of A such that just the row vectors are needed, but my
confusion comes from the fact that I don't see an explicit transpose before
the times() job gets going.

So, in a toy example, A = [3 by 2], B = [2 by 2], it looks to me as if the
three rows of A are being sent to the MR job with the two rows of B, which
doesn't make any sense. I know there should be a transpose of A somewhere
but I don't see it.

Unless the assumption is that the user calls transpose() before calling
times()? Which doesn't make any sense either since I've used this job just
fine. I know I'm missing something simple...thanks for your help.

Also: I'll shelve the general DRM rewrite patch, then, for the time being.
You make good points, and there are other patches I should work on in the
meantime :) (though I could just experiment with 0.21 to see how well that
works)

Shannon

>   There are thus N pairs of > vectors {A_i, B_i}, and if you take
MatrixSum_{i=1,N} (A_i^T x B_i...

Re: [jira] Updated: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Shannon Quinn <sq...@gatech.edu>.

>    Matrix A has N rows (each of which has cardinality M_A), and Matrix B has
> N rows (each of which has cardinality M_B).
I suppose this is where I get confused. I thought, by definition, matrix 
A has dimensions (n by m), and matrix B has dimensions (m by p), and the 
resulting matrix is (n by p). I saw in the implementation that it 
cleverly uses the transpose of A such that just the row vectors are 
needed, but my confusion comes from the fact that I don't see an 
explicit transpose before the times() job gets going.

So, in a toy example, A = [3 by 2], B = [2 by 2], it looks to me as if 
the three rows of A are being sent to the MR job with the two rows of B, 
which doesn't make any sense. I know there should be a transpose of A 
somewhere but I don't see it.

Unless the assumption is that the user calls transpose() before calling 
times()? Which doesn't make any sense either since I've used this job 
just fine. I know I'm missing something simple...thanks for your help.

Also: I'll shelve the general DRM rewrite patch, then, for the time 
being. You make good points, and there are other patches I should work 
on in the meantime :) (though I could just experiment with 0.21 to see 
how well that works)

Shannon

>    There are thus N pairs of
> vectors {A_i, B_i}, and if you take MatrixSum_{i=1,N} (A_i^T x B_i), you get
> a matrix with M_A rows, each of which has cardinality M_B, and this matrix
> is exactly A^T * B.
>
> *You take the transpose on the vectors, row at a time*, from the first of
> the two matrices.
>
>    -jake
>
>
>> I want to understand this little bit so I adequately replicate it in the
>> new patch. Thanks!
>>
>> Shannon
>>
>> Apologies for the brevity, this was sent from my iPhone
>>
>> On Dec 29, 2010, at 1:06, "Shannon Quinn (JIRA)"<ji...@apache.org>  wrote:
>>
>>>      [
>> https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>>> Shannon Quinn updated MAHOUT-537:
>>> ---------------------------------
>>>
>>>     Attachment: MAHOUT-537.patch
>>>
>>> Updated patch. Fixes from previous patch are included, this time merged
>> with unrelated changes to the related files. Also removed all the
>> commented-out old code, and even caught and fixed a few bugs. Fully
>> implemented timesSquared(). All that remains is the times(DRM) job. Will
>> update on this very soon.
>>> (regarding the previous comments on this ticket: I'm using Hadoop 0.20.2)
>>>
>>>> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
>>>> -------------------------------------------------------------
>>>>
>>>>                 Key: MAHOUT-537
>>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-537
>>>>             Project: Mahout
>>>>          Issue Type: Improvement
>>>>    Affects Versions: 0.4
>>>>            Reporter: Shannon Quinn
>>>>            Assignee: Shannon Quinn
>>>>         Attachments: MAHOUT-537.patch, MAHOUT-537.patch
>>>>
>>>>
>>>> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2
>> API, in particular eliminate dependence on the deprecated JobConf, using
>> instead the separate Job and Configuration objects.
>>> --
>>> This message is automatically generated by JIRA.
>>> -
>>> You can reply to this email to add a comment to the issue online.
>>>

Re: [jira] Updated: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Jake Mannix <ja...@gmail.com>.

Hi Shannon, sorry to have been absent too much in this thread!

On Thu, Dec 30, 2010 at 2:16 PM, Shannon Quinn <sq...@gatech.edu> wrote:

> I'm just about finished with this patch (though I'm road tripping at the
> moment), but I wanted to seek some clarification on the mechanics behind
> DRM's matrix multiplication.
>
> I see upon closer inspection that what is actually used is the transpose of
> the multiplicand (matrix A^T in A*B), thereby using only matrix rows (how
> DRMs are organized across HDFS). However, I didn't see any explicit
> transpose operation within the times() method. How is this carried out?
>

The transpose operation is a side effect of the fact that a DRM just
consists of a list of vectors, and you could view it as a row-based matrix,
or a column based matrix.  The matrix multiplication like so:

  Matrix A has N rows (each of which has cardinality M_A), and Matrix B has
N rows (each of which has cardinality M_B).  There are thus N pairs of
vectors {A_i, B_i}, and if you take MatrixSum_{i=1,N} (A_i^T x B_i), you get
a matrix with M_A rows, each of which has cardinality M_B, and this matrix
is exactly A^T * B.

*You take the transpose on the vectors, row at a time*, from the first of
the two matrices.

  -jake


> I want to understand this little bit so I adequately replicate it in the
> new patch. Thanks!
>
> Shannon
>
> Apologies for the brevity, this was sent from my iPhone
>
> On Dec 29, 2010, at 1:06, "Shannon Quinn (JIRA)" <ji...@apache.org> wrote:
>
> >
> >     [
> https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
> >
> > Shannon Quinn updated MAHOUT-537:
> > ---------------------------------
> >
> >    Attachment: MAHOUT-537.patch
> >
> > Updated patch. Fixes from previous patch are included, this time merged
> with unrelated changes to the related files. Also removed all the
> commented-out old code, and even caught and fixed a few bugs. Fully
> implemented timesSquared(). All that remains is the times(DRM) job. Will
> update on this very soon.
> >
> > (regarding the previous comments on this ticket: I'm using Hadoop 0.20.2)
> >
> >> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
> >> -------------------------------------------------------------
> >>
> >>                Key: MAHOUT-537
> >>                URL: https://issues.apache.org/jira/browse/MAHOUT-537
> >>            Project: Mahout
> >>         Issue Type: Improvement
> >>   Affects Versions: 0.4
> >>           Reporter: Shannon Quinn
> >>           Assignee: Shannon Quinn
> >>        Attachments: MAHOUT-537.patch, MAHOUT-537.patch
> >>
> >>
> >> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2
> API, in particular eliminate dependence on the deprecated JobConf, using
> instead the separate Job and Configuration objects.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
>

Re: [jira] Updated: (MAHOUT-537) Bring DistributedRowMatrix into compliance with Hadoop 0.20.2

Posted by Shannon Quinn <sq...@gatech.edu>.

I'm just about finished with this patch (though I'm road tripping at the moment), but I wanted to seek some clarification on the mechanics behind DRM's matrix multiplication. 

I see upon closer inspection that what is actually used is the transpose of the multiplicand (matrix A^T in A*B), thereby using only matrix rows (how DRMs are organized across HDFS). However, I didn't see any explicit transpose operation within the times() method. How is this carried out?

I want to understand this little bit so I adequately replicate it in the new patch. Thanks!

Shannon

Apologies for the brevity, this was sent from my iPhone

On Dec 29, 2010, at 1:06, "Shannon Quinn (JIRA)" <ji...@apache.org> wrote:

> 
>     [ https://issues.apache.org/jira/browse/MAHOUT-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
> 
> Shannon Quinn updated MAHOUT-537:
> ---------------------------------
> 
>    Attachment: MAHOUT-537.patch
> 
> Updated patch. Fixes from previous patch are included, this time merged with unrelated changes to the related files. Also removed all the commented-out old code, and even caught and fixed a few bugs. Fully implemented timesSquared(). All that remains is the times(DRM) job. Will update on this very soon.
> 
> (regarding the previous comments on this ticket: I'm using Hadoop 0.20.2)
> 
>> Bring DistributedRowMatrix into compliance with Hadoop 0.20.2
>> -------------------------------------------------------------
>> 
>>                Key: MAHOUT-537
>>                URL: https://issues.apache.org/jira/browse/MAHOUT-537
>>            Project: Mahout
>>         Issue Type: Improvement
>>   Affects Versions: 0.4
>>           Reporter: Shannon Quinn
>>           Assignee: Shannon Quinn
>>        Attachments: MAHOUT-537.patch, MAHOUT-537.patch
>> 
>> 
>> Convert the current DistributedRowMatrix to use the newer Hadoop 0.20.2 API, in particular eliminate dependence on the deprecated JobConf, using instead the separate Job and Configuration objects.
> 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>