You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by co...@apache.org on 2008/03/18 01:57:01 UTC

[CONF] Apache Lucene Mahout: Matrix and Vector Needs (page edited)

Matrix and Vector Needs (MAHOUT) edited by Jeff Eastman
      Page: http://cwiki.apache.org/confluence/display/MAHOUT/Matrix+and+Vector+Needs
   Changes: http://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=75990&originalVersion=5&revisedVersion=6






Content:
---------------------------------------------------------------------

h1. Intro

Most ML algorithms require the ability to represent multidimensional data concisely and to be able to easily perform common operations on that data. MAHOUT-6 introduced Vector and Matrix datatypes of arbitrary cardinality, along with a set of common operations on their instances. Vectors and matrices are provided with sparse and dense implementations that are memory resident and are suitable for manipulating intermediate results within mapper, combiner and reducer implementations. They are not intended for applications requiring vectors or matrices that exceed the size of a single JVM, though such applications might be able to utilize them within a larger organizing framework.

h2. Background

See [http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/200802.mbox/browser]

h2. Vectors

Mahout supports a vector interface that defines the following operations over all implementation classes: assign, cardinality, copy divide, dot, get, haveSharedCells, like, minus, normalize, plus, set, size, times, toArray, viewPart, zSum and cross. The class DenseVector implements vectors as a double[] that is storage and memory efficient. The class SparseVector implements vectors as a HashMap<Integer, Double> that is surprisingly fast and efficient. For sparse vectors, the size() method returns the current number of elements whereas the cardinality() method returns the number of dimensions it holds. An additional VectorView class allows views of an underlying vector to be specified by the viewPart() method. See the JavaDocs for more complete definitions.

h2. Matrices

We will more than likely need all the basic Matrix operations, plus some more advanced ones:

* Addition, Subtraction, Multiplication, Transpose, Inverse, Scaling


For ideas like PageRank/TextRank, iterative approaches that, essentially, calculate eigenvectors are also useful.

Similarly, for vectors, things like dot/cross product will be useful

h2. Ideas

Use HBase (BigTable) in Hadoop to represent the Matrix.  Batching row/column operations can be useful.

See [MAHOUT-6|https://issues.apache.org/jira/browse/MAHOUT-6]
See [Hama|http://wiki.apache.org/hadoop/Hama]


h2. References

Have a look at the old parallel computing libraries like [ScalaPACK|http://www.netlib.org/scalapack/], others

---------------------------------------------------------------------
CONFLUENCE INFORMATION
This message is automatically generated by Confluence

Unsubscribe or edit your notifications preferences
   http://cwiki.apache.org/confluence/users/viewnotifications.action

If you think it was sent incorrectly contact one of the administrators
   http://cwiki.apache.org/confluence/administrators.action

If you want more information on Confluence, or have a bug to report see
   http://www.atlassian.com/software/confluence