You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Dmitriy Lyubimov (JIRA)" <ji...@apache.org> on 2011/02/10 02:59:57 UTC

[jira] Issue Comment Edited: (MAHOUT-322) DistributedRowMatrix should live in SequenceFile instead of SequenceFile

    [ https://issues.apache.org/jira/browse/MAHOUT-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992832#comment-12992832 ] 

Dmitriy Lyubimov edited comment on MAHOUT-322 at 2/10/11 1:59 AM:
------------------------------------------------------------------

i think technically one can't have sequence files marked with abstract key or value types. I might be wrong but i experimented with it enough times and haven't had any other outcome, but it is possible i am wrong.

If there's indeed such thing as a sequence file saved with WritableComparable as a key, one wouldn't be able to read it since sequence file reader wouldn't be able to instantiate key object based on an abstract class or interface. (It is possible to say i don't care what the key is by using SequenceFile.Reader<WritableComparable> but it's not the same thing.) so i assume the issue description really meant the ability of algorithms to accept arbitrary keys, but not the actual file having WritableComparable as key class in the header (which may imply assorted concrete types in the same file -- i don't think that's supported).

Anyway, i support that most of BLAS and not-so-BLAS stuff is(should be) key type agnostic. However, it is not the same as to say these algorithms should ignore the key (I don't ignore them). The way i solved that problem in Stochastic SVD was exactly this: i don't care what type the label of the rows is. The labels just get copied over into keys of corresponding rows in U output, and U is made sure to have the same class name saved for key as the input of A had. That's it. And i tested it with both IntWritable keys and Text (which is what i think the output of seq2sparse produces). I.e. it breaks the case for adjusting documentation: if you adjust it as the issue requires, that means seq2sparse output is not DRM anymore, yet can be processed by most of DRM-related algorithms such as SSVD.

I think there will always be such paradigm: you just have a bunch of algorithms that care about label and those that don't. In that sense, it's nothing more than just a convention. I don't support coercing DRM as a format to have any particular key but i don't see a reason that a given algorithm might not require it to be something particular. (unless there's no good reason for that).


      was (Author: dlyubimov2):
    i think technically one can't have sequence files marked with abstract key or value types. I might be wrong but i experimented with it enough times and haven't had any other outcome, but it is possible i am wrong.

If there's indeed such thing as a sequence file saved with WritableComparable as a key, one wouldn't be able to read it since sequence file reader wouldn't be able to instantiate key object based on an abstract class or interface. (It is possible to say i don't care what the key is by using SequenceFile.Reader<WritableComparable> but it's not the same thing.) so i assume the issue description really meant the ability of algorithms to accept arbitrary keys, but not the actual file having WritableComparable as key class in the header (which may imply assorted concrete types in the same file -- i don't think that's supported).

Anyway, i support that most of BLAS and not-so-BLAS stuff is(should be) key type agnostic. However, it is not the same as to say these algorithms should ignore the key (I don't ignore them). The way i solved that problem in Stochastic SVD was exactly this: i don't care what type the label of the rows is. The labels just get copied over into keys of corresponding rows in U output, and U is made sure to have the same class name saved for key as the input of A had. That's it. And i tested it with both IntWritable keys and Text (which is what i think the output of seq2sparse produces).

I think there will always be such paradigm: you just have a bunch of algorithms that care about label and those that don't. In that sense, it's nothing more than just a convention. I don't support coercing DRM as a format to have any particular key but i don't see a reason that a given algorithm might not require it to be something particular. (unless there's no good reason for that).

  
> DistributedRowMatrix should live in SequenceFile<Writable,VectorWritable> instead of SequenceFile<IntWritable,VectorWritable>
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-322
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-322
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.3
>            Reporter: Danny Leshem
>            Assignee: Jake Mannix
>            Priority: Minor
>
> Class documentation for org.apache.mahout.math.hadoop.DistributedRowMatrix states that the matrix lives in SequenceFile<WritableComparable, VectorWritable>. Implementation, however, assumes SequenceFile<IntWritable, VectorWritable> is passed.
> Currently, usage of this class inside Mahout is limited to Jake Mannix's SVD package, mainly to perform PCA on a massive document corpus. Given such corpus, it makes sense to not limit the user by forcing the document "key" to be integer. Instead, users should be able to use Text keys (document name or id) or keys made of any other arbitrary class. One may even argue that forcing a WritableComparable key is too limiting, and a simple Writable key should be assumed.
> In fact, it would be best if DistributedRowMatrix did not read the SequenceFile key at all, to allow user-specific classes (unknown to Mahout) to be used as opaque keys even when their libraries are not available in runtime. Currently DistributedRowMatrix calls "reader.next(i, v)"... but reader has methods to query just the value, avoiding key deserialization altogether.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira