You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2011/02/09 12:45:57 UTC

[jira] Updated: (MAHOUT-322) DistributedRowMatrix should live in SequenceFile instead of SequenceFile

     [ https://issues.apache.org/jira/browse/MAHOUT-322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-322:
-----------------------------

    Assignee: Jake Mannix

Interesting one -- pinging this issue to see if anyone cares to move forward to patch? I am not intimately familiar with the issue but have a few general thoughts.

However I can see that the int is used in MatrixSlice, and that value is then used elsewhere in the framework. I don't know that it's easy to remove this assumption, and may not be desirable as it looks like it is used as an offset into a vector. While it doesn't mean there can't be some hashing scheme to map arbitrary values to ints, that's non-trivial.

But yes the docs could be adjusted to reflect reality.

Thoughts?

> DistributedRowMatrix should live in SequenceFile<Writable,VectorWritable> instead of SequenceFile<IntWritable,VectorWritable>
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-322
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-322
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.3
>            Reporter: Danny Leshem
>            Assignee: Jake Mannix
>            Priority: Minor
>
> Class documentation for org.apache.mahout.math.hadoop.DistributedRowMatrix states that the matrix lives in SequenceFile<WritableComparable, VectorWritable>. Implementation, however, assumes SequenceFile<IntWritable, VectorWritable> is passed.
> Currently, usage of this class inside Mahout is limited to Jake Mannix's SVD package, mainly to perform PCA on a massive document corpus. Given such corpus, it makes sense to not limit the user by forcing the document "key" to be integer. Instead, users should be able to use Text keys (document name or id) or keys made of any other arbitrary class. One may even argue that forcing a WritableComparable key is too limiting, and a simple Writable key should be assumed.
> In fact, it would be best if DistributedRowMatrix did not read the SequenceFile key at all, to allow user-specific classes (unknown to Mahout) to be used as opaque keys even when their libraries are not available in runtime. Currently DistributedRowMatrix calls "reader.next(i, v)"... but reader has methods to query just the value, avoiding key deserialization altogether.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira