You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Jake Mannix (JIRA)" <ji...@apache.org> on 2010/01/19 18:57:54 UTC

[jira] Created: (MAHOUT-263) Matrix interface should extend Iterable for better integration with distributed storage

Matrix interface should extend Iterable<Vector> for better integration with distributed storage
-----------------------------------------------------------------------------------------------

                 Key: MAHOUT-263
                 URL: https://issues.apache.org/jira/browse/MAHOUT-263
             Project: Mahout
          Issue Type: Improvement
          Components: Math
    Affects Versions: 0.2
         Environment: all
            Reporter: Jake Mannix
            Assignee: Jake Mannix
             Fix For: 0.3


Many sparse algorithms for dealing with Matrices just make sequential passes over the data, but don't need to see the entire matrix at once.  The way they would be implemented currently is:

{code}
Matrix m = getInputCorpus();
for(int i=0; i<m.numRows(); i++) {
  Vector v = m.getRow(i);
  doStuffWithRow(v); 
}
{code}

When the Matrix is backed essentially by a SequenceFile<Integer, Vector>, this algorithm outline doesn't make sense, because it requires lots of sequential random access reads.  What makes more sense, and works for in-memory matrices too, is something like the following:

{code}
public interface Matrix extends Iterable<Vector> { 
{code}

which allows for algorithms which only need iterators over Vectors do use them as such:

{code}
Matrix m = getInputCorpus();
Iterator<Vector> it = m.iterator();
Vector v;
while(it.hasNext() && (v = it.next()) != null) {
  doStuffWithRow(v); 
}
{code}

The Iterator interface could be easily implemented in the AbstractMatrix base class, so implementing this idea would be transparent to all current Mahout code.  Additionally, pulling out two layers of AbstractMatrix - one which only knows how to do the things which can be done using iterators (like times(Vector), timesSquared(Vector), plus(Matrix), assignRow(), etc...), which would be the direct base class for DistributedMatrix (or HDFSMatrix), but all the random-access matrix methods currently in AbstractMatrix would go in another abstract base class of the first one (which could be called AbstractVectorIterable, say).

I think Iteratable<Vector> could be made more flexible by extending that to a new interface VectorIterable, which provided iterateAll() and iterateNonEmpty(), in case document Ids were sparse, and could also allow for the possibility of adding other methods (things like skipTo(int rowNum), perhaps).  

Question is: should this go for all Matrices, or just SparseRowMatrix?  It's really tricky to have a matrix which is iterable both as sparse rows *and* sparse columns.  I guess the point would be that by default, it iterates over rows, unless it's SparseColumnMatrix, which obviously iterates over columns.

Thoughts?  Having to rely on random-access to a distributed-backed matrix is making me jump through silly extra hoops on some of the stuff I'm working on patches for.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-263) Matrix interface should extend Iterable for better integration with distributed storage

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jake Mannix updated MAHOUT-263:
-------------------------------

    Attachment:     (was: MAHOUT-263.diff)

> Matrix interface should extend Iterable<Vector> for better integration with distributed storage
> -----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-263
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-263
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.2
>         Environment: all
>            Reporter: Jake Mannix
>            Assignee: Jake Mannix
>             Fix For: 0.3
>
>
> Many sparse algorithms for dealing with Matrices just make sequential passes over the data, but don't need to see the entire matrix at once.  The way they would be implemented currently is:
> {code}
> Matrix m = getInputCorpus();
> for(int i=0; i<m.numRows(); i++) {
>   Vector v = m.getRow(i);
>   doStuffWithRow(v); 
> }
> {code}
> When the Matrix is backed essentially by a SequenceFile<Integer, Vector>, this algorithm outline doesn't make sense, because it requires lots of sequential random access reads.  What makes more sense, and works for in-memory matrices too, is something like the following:
> {code}
> public interface Matrix extends Iterable<Vector> { 
> {code}
> which allows for algorithms which only need iterators over Vectors do use them as such:
> {code}
> Matrix m = getInputCorpus();
> Iterator<Vector> it = m.iterator();
> Vector v;
> while(it.hasNext() && (v = it.next()) != null) {
>   doStuffWithRow(v); 
> }
> {code}
> The Iterator interface could be easily implemented in the AbstractMatrix base class, so implementing this idea would be transparent to all current Mahout code.  Additionally, pulling out two layers of AbstractMatrix - one which only knows how to do the things which can be done using iterators (like times(Vector), timesSquared(Vector), plus(Matrix), assignRow(), etc...), which would be the direct base class for DistributedMatrix (or HDFSMatrix), but all the random-access matrix methods currently in AbstractMatrix would go in another abstract base class of the first one (which could be called AbstractVectorIterable, say).
> I think Iteratable<Vector> could be made more flexible by extending that to a new interface VectorIterable, which provided iterateAll() and iterateNonEmpty(), in case document Ids were sparse, and could also allow for the possibility of adding other methods (things like skipTo(int rowNum), perhaps).  
> Question is: should this go for all Matrices, or just SparseRowMatrix?  It's really tricky to have a matrix which is iterable both as sparse rows *and* sparse columns.  I guess the point would be that by default, it iterates over rows, unless it's SparseColumnMatrix, which obviously iterates over columns.
> Thoughts?  Having to rely on random-access to a distributed-backed matrix is making me jump through silly extra hoops on some of the stuff I'm working on patches for.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (MAHOUT-263) Matrix interface should extend Iterable for better integration with distributed storage

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jake Mannix resolved MAHOUT-263.
--------------------------------

    Resolution: Fixed

fixed in r903965

> Matrix interface should extend Iterable<Vector> for better integration with distributed storage
> -----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-263
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-263
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.2
>         Environment: all
>            Reporter: Jake Mannix
>            Assignee: Jake Mannix
>             Fix For: 0.3
>
>         Attachments: MAHOUT-263.diff
>
>
> Many sparse algorithms for dealing with Matrices just make sequential passes over the data, but don't need to see the entire matrix at once.  The way they would be implemented currently is:
> {code}
> Matrix m = getInputCorpus();
> for(int i=0; i<m.numRows(); i++) {
>   Vector v = m.getRow(i);
>   doStuffWithRow(v); 
> }
> {code}
> When the Matrix is backed essentially by a SequenceFile<Integer, Vector>, this algorithm outline doesn't make sense, because it requires lots of sequential random access reads.  What makes more sense, and works for in-memory matrices too, is something like the following:
> {code}
> public interface Matrix extends Iterable<Vector> { 
> {code}
> which allows for algorithms which only need iterators over Vectors do use them as such:
> {code}
> Matrix m = getInputCorpus();
> Iterator<Vector> it = m.iterator();
> Vector v;
> while(it.hasNext() && (v = it.next()) != null) {
>   doStuffWithRow(v); 
> }
> {code}
> The Iterator interface could be easily implemented in the AbstractMatrix base class, so implementing this idea would be transparent to all current Mahout code.  Additionally, pulling out two layers of AbstractMatrix - one which only knows how to do the things which can be done using iterators (like times(Vector), timesSquared(Vector), plus(Matrix), assignRow(), etc...), which would be the direct base class for DistributedMatrix (or HDFSMatrix), but all the random-access matrix methods currently in AbstractMatrix would go in another abstract base class of the first one (which could be called AbstractVectorIterable, say).
> I think Iteratable<Vector> could be made more flexible by extending that to a new interface VectorIterable, which provided iterateAll() and iterateNonEmpty(), in case document Ids were sparse, and could also allow for the possibility of adding other methods (things like skipTo(int rowNum), perhaps).  
> Question is: should this go for all Matrices, or just SparseRowMatrix?  It's really tricky to have a matrix which is iterable both as sparse rows *and* sparse columns.  I guess the point would be that by default, it iterates over rows, unless it's SparseColumnMatrix, which obviously iterates over columns.
> Thoughts?  Having to rely on random-access to a distributed-backed matrix is making me jump through silly extra hoops on some of the stuff I'm working on patches for.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-263) Matrix interface should extend Iterable for better integration with distributed storage

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802960#action_12802960 ] 

Ted Dunning commented on MAHOUT-263:
------------------------------------

The idea of iterating through inputs sequentially is absolutely key to good performance on sequential algorithms with good abstraction.

Some algorithms need to permute inputs to some degree, but that is easily handled to a moderate degree by buffering some number of rows and presenting them in randomized order.

{quote}
Question is: should this go for all Matrices, or just SparseRowMatrix?  It's really tricky to have a matrix which is iterable both as sparse rows *and* sparse columns.  I guess the point would be that by default, it iterates over rows, unless it's SparseColumnMatrix, which obviously iterates over columns.

Thoughts?  Having to rely on random-access to a distributed-backed matrix is making me jump through silly extra hoops on some of the stuff I'm working on patches for.
{quote}
My feeling is that we don't need to support iteration both ways.  It would be nice, but the performance hit is so prodigious that it just isn't worth it.  In the past, where I needed to support column and row access, generally stored two copies each optimized for a different kind of access.  This is very rarely needed since most algorithms are very row or column centric and the data can be transposed (and thus permuted to match the desired access pattern) ahead of time to accommodate the need.  In many cases, the loop nesting can be rearranged as well to allow sequential row access to serve as well as column access, especially if map-reduce can be used to rearrange intermediate products.
  

> Matrix interface should extend Iterable<Vector> for better integration with distributed storage
> -----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-263
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-263
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.2
>         Environment: all
>            Reporter: Jake Mannix
>            Assignee: Jake Mannix
>             Fix For: 0.3
>
>
> Many sparse algorithms for dealing with Matrices just make sequential passes over the data, but don't need to see the entire matrix at once.  The way they would be implemented currently is:
> {code}
> Matrix m = getInputCorpus();
> for(int i=0; i<m.numRows(); i++) {
>   Vector v = m.getRow(i);
>   doStuffWithRow(v); 
> }
> {code}
> When the Matrix is backed essentially by a SequenceFile<Integer, Vector>, this algorithm outline doesn't make sense, because it requires lots of sequential random access reads.  What makes more sense, and works for in-memory matrices too, is something like the following:
> {code}
> public interface Matrix extends Iterable<Vector> { 
> {code}
> which allows for algorithms which only need iterators over Vectors do use them as such:
> {code}
> Matrix m = getInputCorpus();
> Iterator<Vector> it = m.iterator();
> Vector v;
> while(it.hasNext() && (v = it.next()) != null) {
>   doStuffWithRow(v); 
> }
> {code}
> The Iterator interface could be easily implemented in the AbstractMatrix base class, so implementing this idea would be transparent to all current Mahout code.  Additionally, pulling out two layers of AbstractMatrix - one which only knows how to do the things which can be done using iterators (like times(Vector), timesSquared(Vector), plus(Matrix), assignRow(), etc...), which would be the direct base class for DistributedMatrix (or HDFSMatrix), but all the random-access matrix methods currently in AbstractMatrix would go in another abstract base class of the first one (which could be called AbstractVectorIterable, say).
> I think Iteratable<Vector> could be made more flexible by extending that to a new interface VectorIterable, which provided iterateAll() and iterateNonEmpty(), in case document Ids were sparse, and could also allow for the possibility of adding other methods (things like skipTo(int rowNum), perhaps).  
> Question is: should this go for all Matrices, or just SparseRowMatrix?  It's really tricky to have a matrix which is iterable both as sparse rows *and* sparse columns.  I guess the point would be that by default, it iterates over rows, unless it's SparseColumnMatrix, which obviously iterates over columns.
> Thoughts?  Having to rely on random-access to a distributed-backed matrix is making me jump through silly extra hoops on some of the stuff I'm working on patches for.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-263) Matrix interface should extend Iterable for better integration with distributed storage

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jake Mannix updated MAHOUT-263:
-------------------------------

    Attachment: MAHOUT-263.diff

Ugly ugly names.  Better suggestions?

> Matrix interface should extend Iterable<Vector> for better integration with distributed storage
> -----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-263
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-263
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.2
>         Environment: all
>            Reporter: Jake Mannix
>            Assignee: Jake Mannix
>             Fix For: 0.3
>
>
> Many sparse algorithms for dealing with Matrices just make sequential passes over the data, but don't need to see the entire matrix at once.  The way they would be implemented currently is:
> {code}
> Matrix m = getInputCorpus();
> for(int i=0; i<m.numRows(); i++) {
>   Vector v = m.getRow(i);
>   doStuffWithRow(v); 
> }
> {code}
> When the Matrix is backed essentially by a SequenceFile<Integer, Vector>, this algorithm outline doesn't make sense, because it requires lots of sequential random access reads.  What makes more sense, and works for in-memory matrices too, is something like the following:
> {code}
> public interface Matrix extends Iterable<Vector> { 
> {code}
> which allows for algorithms which only need iterators over Vectors do use them as such:
> {code}
> Matrix m = getInputCorpus();
> Iterator<Vector> it = m.iterator();
> Vector v;
> while(it.hasNext() && (v = it.next()) != null) {
>   doStuffWithRow(v); 
> }
> {code}
> The Iterator interface could be easily implemented in the AbstractMatrix base class, so implementing this idea would be transparent to all current Mahout code.  Additionally, pulling out two layers of AbstractMatrix - one which only knows how to do the things which can be done using iterators (like times(Vector), timesSquared(Vector), plus(Matrix), assignRow(), etc...), which would be the direct base class for DistributedMatrix (or HDFSMatrix), but all the random-access matrix methods currently in AbstractMatrix would go in another abstract base class of the first one (which could be called AbstractVectorIterable, say).
> I think Iteratable<Vector> could be made more flexible by extending that to a new interface VectorIterable, which provided iterateAll() and iterateNonEmpty(), in case document Ids were sparse, and could also allow for the possibility of adding other methods (things like skipTo(int rowNum), perhaps).  
> Question is: should this go for all Matrices, or just SparseRowMatrix?  It's really tricky to have a matrix which is iterable both as sparse rows *and* sparse columns.  I guess the point would be that by default, it iterates over rows, unless it's SparseColumnMatrix, which obviously iterates over columns.
> Thoughts?  Having to rely on random-access to a distributed-backed matrix is making me jump through silly extra hoops on some of the stuff I'm working on patches for.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-263) Matrix interface should extend Iterable for better integration with distributed storage

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jake Mannix updated MAHOUT-263:
-------------------------------

    Attachment: MAHOUT-263.diff

Let's try a patch which is svn up'ed first.  

Suggest better names for these methods / interfaces / classes?

> Matrix interface should extend Iterable<Vector> for better integration with distributed storage
> -----------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-263
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-263
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.2
>         Environment: all
>            Reporter: Jake Mannix
>            Assignee: Jake Mannix
>             Fix For: 0.3
>
>         Attachments: MAHOUT-263.diff
>
>
> Many sparse algorithms for dealing with Matrices just make sequential passes over the data, but don't need to see the entire matrix at once.  The way they would be implemented currently is:
> {code}
> Matrix m = getInputCorpus();
> for(int i=0; i<m.numRows(); i++) {
>   Vector v = m.getRow(i);
>   doStuffWithRow(v); 
> }
> {code}
> When the Matrix is backed essentially by a SequenceFile<Integer, Vector>, this algorithm outline doesn't make sense, because it requires lots of sequential random access reads.  What makes more sense, and works for in-memory matrices too, is something like the following:
> {code}
> public interface Matrix extends Iterable<Vector> { 
> {code}
> which allows for algorithms which only need iterators over Vectors do use them as such:
> {code}
> Matrix m = getInputCorpus();
> Iterator<Vector> it = m.iterator();
> Vector v;
> while(it.hasNext() && (v = it.next()) != null) {
>   doStuffWithRow(v); 
> }
> {code}
> The Iterator interface could be easily implemented in the AbstractMatrix base class, so implementing this idea would be transparent to all current Mahout code.  Additionally, pulling out two layers of AbstractMatrix - one which only knows how to do the things which can be done using iterators (like times(Vector), timesSquared(Vector), plus(Matrix), assignRow(), etc...), which would be the direct base class for DistributedMatrix (or HDFSMatrix), but all the random-access matrix methods currently in AbstractMatrix would go in another abstract base class of the first one (which could be called AbstractVectorIterable, say).
> I think Iteratable<Vector> could be made more flexible by extending that to a new interface VectorIterable, which provided iterateAll() and iterateNonEmpty(), in case document Ids were sparse, and could also allow for the possibility of adding other methods (things like skipTo(int rowNum), perhaps).  
> Question is: should this go for all Matrices, or just SparseRowMatrix?  It's really tricky to have a matrix which is iterable both as sparse rows *and* sparse columns.  I guess the point would be that by default, it iterates over rows, unless it's SparseColumnMatrix, which obviously iterates over columns.
> Thoughts?  Having to rely on random-access to a distributed-backed matrix is making me jump through silly extra hoops on some of the stuff I'm working on patches for.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.