You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Raphael Cendrillon (Created) (JIRA)" <ji...@apache.org> on 2011/12/12 01:22:30 UTC

[jira] [Created] (MAHOUT-923) Row mean job for PCA

Row mean job for PCA
--------------------

                 Key: MAHOUT-923
                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
             Project: Mahout
          Issue Type: Improvement
          Components: Math
    Affects Versions: 0.6
            Reporter: Raphael Cendrillon
            Assignee: Raphael Cendrillon
             Fix For: Backlog
         Attachments: MAHOUT-923.patch

Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: [jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
See suggestion in the review board (if i use it correctly, i am still
not sure what to do about it :)

On Mon, Dec 12, 2011 at 12:28 AM, Raphael Cendrillon
<ce...@gmail.com> wrote:
> Thanks Dmitry. I think I understand more clearly now. Are you saying I should make a map only job and then just use some post-processing to manually combine the map outputs?
>
> How many rows should I process per map job?
>
> On Dec 12, 2011, at 12:13 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>>> A combiner is definitely the next step.
>>
>> It is definitely not. Why do you need to sort???
>>
>>> One question, is there already a writable for tuples of e.g. int and Vector, or should I just write one from scratch?
>>
>> From scratch.
>>
>> Or, you can save n as first element in the vector, why not. Your front
>> end code would know how to re-shuffle that.
>> But if not that, then custom writable. TupleWritable saves the class
>> with the value. That's exactly why they invented writables and not
>> using java serialization: you must not save type with each value.
>>
>> -d
>>
>>
>> On Sun, Dec 11, 2011 at 8:14 PM, Raphael Cendrillon (Commented) (JIRA)
>> <ji...@apache.org> wrote:
>>>
>>>    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167341#comment-13167341 ]
>>>
>>> Raphael Cendrillon commented on MAHOUT-923:
>>> -------------------------------------------
>>>
>>> Thanks Lance. A combiner is definitely the next step. One question, is there already a writable for tuples of e.g. int and Vector, or should I just write one from scratch? I know there is TupleWritable, but from what I've read online it's better to avoid that unless you're doing a multiple input join.
>>>
>>> Regarding the class for the output vector, are you saying that instead of inhereting the class from the rows of the DistributedRowMatrix you'd rather be able to specify this manually?
>>>
>>>
>>>
>>>> Row mean job for PCA
>>>> --------------------
>>>>
>>>>                 Key: MAHOUT-923
>>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>>>>             Project: Mahout
>>>>          Issue Type: Improvement
>>>>          Components: Math
>>>>    Affects Versions: 0.6
>>>>            Reporter: Raphael Cendrillon
>>>>            Assignee: Raphael Cendrillon
>>>>             Fix For: Backlog
>>>>
>>>>         Attachments: MAHOUT-923.patch
>>>>
>>>>
>>>> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.
>>>
>>> --
>>> This message is automatically generated by JIRA.
>>> If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>>
>>>

Re: [jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by Raphael Cendrillon <ce...@gmail.com>.
Thanks Dmitry. I think I understand more clearly now. Are you saying I should make a map only job and then just use some post-processing to manually combine the map outputs?

How many rows should I process per map job?

On Dec 12, 2011, at 12:13 AM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

>> A combiner is definitely the next step.
> 
> It is definitely not. Why do you need to sort???
> 
>> One question, is there already a writable for tuples of e.g. int and Vector, or should I just write one from scratch?
> 
> From scratch.
> 
> Or, you can save n as first element in the vector, why not. Your front
> end code would know how to re-shuffle that.
> But if not that, then custom writable. TupleWritable saves the class
> with the value. That's exactly why they invented writables and not
> using java serialization: you must not save type with each value.
> 
> -d
> 
> 
> On Sun, Dec 11, 2011 at 8:14 PM, Raphael Cendrillon (Commented) (JIRA)
> <ji...@apache.org> wrote:
>> 
>>    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167341#comment-13167341 ]
>> 
>> Raphael Cendrillon commented on MAHOUT-923:
>> -------------------------------------------
>> 
>> Thanks Lance. A combiner is definitely the next step. One question, is there already a writable for tuples of e.g. int and Vector, or should I just write one from scratch? I know there is TupleWritable, but from what I've read online it's better to avoid that unless you're doing a multiple input join.
>> 
>> Regarding the class for the output vector, are you saying that instead of inhereting the class from the rows of the DistributedRowMatrix you'd rather be able to specify this manually?
>> 
>> 
>> 
>>> Row mean job for PCA
>>> --------------------
>>> 
>>>                 Key: MAHOUT-923
>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>>>             Project: Mahout
>>>          Issue Type: Improvement
>>>          Components: Math
>>>    Affects Versions: 0.6
>>>            Reporter: Raphael Cendrillon
>>>            Assignee: Raphael Cendrillon
>>>             Fix For: Backlog
>>> 
>>>         Attachments: MAHOUT-923.patch
>>> 
>>> 
>>> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.
>> 
>> --
>> This message is automatically generated by JIRA.
>> If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>> 
>> 

Re: [jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
> A combiner is definitely the next step.

It is definitely not. Why do you need to sort???

> One question, is there already a writable for tuples of e.g. int and Vector, or should I just write one from scratch?

>From scratch.

Or, you can save n as first element in the vector, why not. Your front
end code would know how to re-shuffle that.
But if not that, then custom writable. TupleWritable saves the class
with the value. That's exactly why they invented writables and not
using java serialization: you must not save type with each value.

-d


On Sun, Dec 11, 2011 at 8:14 PM, Raphael Cendrillon (Commented) (JIRA)
<ji...@apache.org> wrote:
>
>    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167341#comment-13167341 ]
>
> Raphael Cendrillon commented on MAHOUT-923:
> -------------------------------------------
>
> Thanks Lance. A combiner is definitely the next step. One question, is there already a writable for tuples of e.g. int and Vector, or should I just write one from scratch? I know there is TupleWritable, but from what I've read online it's better to avoid that unless you're doing a multiple input join.
>
> Regarding the class for the output vector, are you saying that instead of inhereting the class from the rows of the DistributedRowMatrix you'd rather be able to specify this manually?
>
>
>
>> Row mean job for PCA
>> --------------------
>>
>>                 Key: MAHOUT-923
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>>             Project: Mahout
>>          Issue Type: Improvement
>>          Components: Math
>>    Affects Versions: 0.6
>>            Reporter: Raphael Cendrillon
>>            Assignee: Raphael Cendrillon
>>             Fix For: Backlog
>>
>>         Attachments: MAHOUT-923.patch
>>
>>
>> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>

Re: [jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by Lance Norskog <go...@gmail.com>.
The person using this job knows the right vector to use. It may be
that it gets a lot of sparse vectors but will become a dense vector.
Or a vector that writes to a database. Or something else. In fact, I
may just want to turn a vector from Dense to Sparse, and I could
achieve that with this job.

On Mon, Dec 12, 2011 at 12:06 AM, Lance Norskog <go...@gmail.com> wrote:
> To use a combiner, TupleWritable should be fine. I have not used it.
>
> But it will copy the entire vector. You would want to minimize this.
> If this is a big problem, you can do an ugly trick: you store the
> counter as the key value, but make a custom Writable that always
> returns 'this equals the other'. So, all of your counters have the
> same key and thus all vectors go to the same reducer.
>
>
>
> On Sun, Dec 11, 2011 at 8:14 PM, Raphael Cendrillon (Commented) (JIRA)
> <ji...@apache.org> wrote:
>>
>>    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167341#comment-13167341 ]
>>
>> Raphael Cendrillon commented on MAHOUT-923:
>> -------------------------------------------
>>
>> Thanks Lance. A combiner is definitely the next step. One question, is there already a writable for tuples of e.g. int and Vector, or should I just write one from scratch? I know there is TupleWritable, but from what I've read online it's better to avoid that unless you're doing a multiple input join.
>>
>> Regarding the class for the output vector, are you saying that instead of inhereting the class from the rows of the DistributedRowMatrix you'd rather be able to specify this manually?
>>
>>
>>
>>> Row mean job for PCA
>>> --------------------
>>>
>>>                 Key: MAHOUT-923
>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>>>             Project: Mahout
>>>          Issue Type: Improvement
>>>          Components: Math
>>>    Affects Versions: 0.6
>>>            Reporter: Raphael Cendrillon
>>>            Assignee: Raphael Cendrillon
>>>             Fix For: Backlog
>>>
>>>         Attachments: MAHOUT-923.patch
>>>
>>>
>>> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.
>>
>> --
>> This message is automatically generated by JIRA.
>> If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>
>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com



-- 
Lance Norskog
goksron@gmail.com

Re: [jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by Lance Norskog <go...@gmail.com>.
To use a combiner, TupleWritable should be fine. I have not used it.

But it will copy the entire vector. You would want to minimize this.
If this is a big problem, you can do an ugly trick: you store the
counter as the key value, but make a custom Writable that always
returns 'this equals the other'. So, all of your counters have the
same key and thus all vectors go to the same reducer.



On Sun, Dec 11, 2011 at 8:14 PM, Raphael Cendrillon (Commented) (JIRA)
<ji...@apache.org> wrote:
>
>    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167341#comment-13167341 ]
>
> Raphael Cendrillon commented on MAHOUT-923:
> -------------------------------------------
>
> Thanks Lance. A combiner is definitely the next step. One question, is there already a writable for tuples of e.g. int and Vector, or should I just write one from scratch? I know there is TupleWritable, but from what I've read online it's better to avoid that unless you're doing a multiple input join.
>
> Regarding the class for the output vector, are you saying that instead of inhereting the class from the rows of the DistributedRowMatrix you'd rather be able to specify this manually?
>
>
>
>> Row mean job for PCA
>> --------------------
>>
>>                 Key: MAHOUT-923
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>>             Project: Mahout
>>          Issue Type: Improvement
>>          Components: Math
>>    Affects Versions: 0.6
>>            Reporter: Raphael Cendrillon
>>            Assignee: Raphael Cendrillon
>>             Fix For: Backlog
>>
>>         Attachments: MAHOUT-923.patch
>>
>>
>> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>



-- 
Lance Norskog
goksron@gmail.com

Re: [jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
if it's coherent with the rest of the code there, i guess it is benign
to use it for this particular purpose. I can't think of a case where
we'd want to pull exactly one vector into a MR job.


On Mon, Dec 12, 2011 at 12:54 AM, Raphael Cendrillon
<ce...@gmail.com> wrote:
> You've convinced me that this is probably a bad idea.  You never know when this might come back to bite later.
>
> On 12 Dec, 2011, at 12:50 AM, Dmitriy Lyubimov wrote:
>
>> Oh now i remember what the deal with NullWritable was.
>>
>> yes sequence file would read it as in
>>
>>    Configuration conf = new Configuration();
>>    FileSystem fs = FileSystem.getLocal(new Configuration());
>>    Path testPath = new Path("name.seq");
>>
>>    IntWritable iw = new IntWritable();
>>    SequenceFile.Writer w =
>>      SequenceFile.createWriter(fs,
>>                                conf,
>>                                testPath,
>>                                NullWritable.class,
>>                                IntWritable.class);
>>    w.append(NullWritable.get(),iw);
>>    w.close();
>>
>>
>>    SequenceFile.Reader r = new SequenceFile.Reader(fs, testPath, conf);
>>    while ( r.next(NullWritable.get(),iw));
>>    r.close();
>>
>>
>> but SequenceFileInputFileFormat would not. I.e. it is ok if you read
>> it explicitly but I don't think one can use such files as an input for
>> other MR jobs.
>>
>> But since in this case there's no MR job to consume that output (and
>> unlikely ever will be) i guess it is ok to save NullWritable in this
>> case...
>>
>> -d
>>
>> On Mon, Dec 12, 2011 at 12:30 AM, jiraposter@reviews.apache.org
>> (Commented) (JIRA) <ji...@apache.org> wrote:
>>>
>>>    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167406#comment-13167406 ]
>>>
>>> jiraposter@reviews.apache.org commented on MAHOUT-923:
>>> ------------------------------------------------------
>>>
>>>
>>>
>>> bq.  On 2011-12-12 02:10:01, Dmitriy Lyubimov wrote:
>>> bq.  > Hm. I hope i did not read the code or miss something.
>>> bq.  >
>>> bq.  > 1 -- i am not sure this will actually work as intended unless # of reducers is corced to 1, of which i see no mention in the code.
>>> bq.  > 2 -- mappers do nothing, passing on all the row pressure to sort which is absolutely not necessary. Even if you use combiners. This is going to be especially the case if you coerce 1 reducer an no combiners. IMO mean computation should be pushed up to mappers to avoid sort pressures of map reduce. Then reduction becomes largely symbolical(but you do need pass on the # of rows mapper has seen, to the reducer, in order for that operation to apply correctly).
>>> bq.  > 3 -- i am not sure -- is NullWritable as a key legit? In my experience sequence file reader cannot instantiate it because NullWritable is a singleton and its creation is prohibited by making constructor private.
>>> bq.
>>> bq.  Raphael Cendrillon wrote:
>>> bq.      Thanks Dmitry.
>>> bq.
>>> bq.      Regarding 1, if I understand correctly the number of reducers depends on the number of unique keys. Since all keys are set to the same value (null), then all of the mapper outputs should arrive at the same reducer. This seems to work in the unit test, but I may be missing something?
>>> bq.
>>> bq.      Regarding 2, that makes alot of sense. I'm wondering how many rows should be processed per mapper?  I guess there is a trade-off between scalability (processing more rows within a single map job means that each row must have less columns) and speed?  Is there someplace in the SSVD code where the matrix is split into slices of rows that I could use as a reference?
>>> bq.
>>> bq.      Regarding 3, I believe NullWritable is OK. It's used pretty extensively in TimesSquaredJob in DistributedRowMatrx. However if you feel there is some disadvantage to this I could replace "NullWritable.get()" with "new IntWritable(1)" (that is, set all of the keys to 1). Would that be more suitable?
>>> bq.
>>> bq.
>>>
>>> NullWritable objection is withdrawn. Apparently i haven't looked into hadoop for too long, amazingly it seems to work now.
>>>
>>>
>>> 1 -- I don't think your statement about # of reduce tasks is true.
>>>
>>> The job (or, rather, user) sets the number of reduce tasks via config propery. All users will follow hadoop recommendation to set that to 95% of capacity they want to take. (usually the whole cluster). So in production environment you are virtually _guaranteed_ to have number of reducers of something like 75 on a 40-noder and consequently 75 output files (unless users really want to read the details of your job and figure you meant it to be just 1).
>>> Now, it is true that only one file will actually end up having something and the rest of task slots will just be occupied doing nothing .
>>>
>>> So there are two problems with that scheme: a) is that job that allocates so many task slots that do nothing is not a good citizen, since in real production cluster is always shared with multiple jobs. b) your code assumes result will end up in partition 0, whereas contractually it may end up in any of 75 files. (in reality with default hash partitioner for key 1 it will wind up in partion 0001 unless there's one reducer as i guess in your test was).
>>>
>>> 2-- it is simple. when you send n rows to reducers, they are shuffled - and - sorted. Sending massive sets to reducers has 2 effects: first, even if they all group under the same key, they are still sorted with ~ n log (n/p) where p is number of partitions assuming uniform distribution (which it is not because you are sending everything to the same place). Just because we can run distributed sort, doesn't mean we should. Secondly, all these rows are physically moved to reduce tasks, which is still ~n rows. Finally what has made your case especially problematic is that you are sending everything to the same reducer, i.e. you are not actually doing sort in distributed way but rather simple single threaded sort at the reducer that happens to get all the input.
>>>
>>> So that would allocate a lot of tasks slots that are not used; but do a sort that is not needed; and do it in a single reducer thread for the entire input which is not parallel at all.
>>>
>>> Instead, consider this: map has a state consisting of (sum(X), k). it keeps updating it sum+=x, k++ for every new x. At the end of the cycle (in cleanup) it writes only 1 tuple (sum(x), k) as output. so we just reduced complexity of the sort and io from millions of elements to just # of maps (which is perhaps just handful and in reality rarely overshoots 500 mappers). That is, about at least 4 orders of magnitude.
>>>
>>> Now, we send that handful tuples to single reducer and just do combining (sum(X)+= sum_i(X); n+= n_i) where i is the tuple in reducer. And because it is only a handful, reducer also runs very quickly, so the fact that we coerced it to be 1, is pretty benign. That volume of anywhere between 1 to 500 vectors it sums up doesn't warrant distributed computation.
>>>
>>> But, you have to make sure there's only 1 reducer no matter what user put into the config, and you have to make sure you do all heavy lifting in the mappers.
>>>
>>> Finally, you don't even to coerce to 1 reducer. You still can have several (but uniformly distributed) and do final combine in front end of the method. However, given small size and triviality of the reduction processing, it is probably not warranted. Coercing to 1 reducer is ok in this case IMO.
>>>
>>> 3 i guess any writable is ok but NullWritable. Maybe something has changed. i remember falling into that pitfall several generations of hadoop ago. You can verify by staging a simple experiment of writing a sequence file with nullwritable as either key or value and try to read it back. in my test long ago it would write ok but not read back. I beleive similar approach is used with keys in shuffle and sort. There is a reflection writable factory inside which is trying to use default constructor of the class to bring it up which is(was) not available for NullWritable.
>>>
>>>
>>> - Dmitriy
>>>
>>>
>>> -----------------------------------------------------------
>>> This is an automatically generated e-mail. To reply, visit:
>>> https://reviews.apache.org/r/3147/#review3838
>>> -----------------------------------------------------------
>>>
>>>
>>> On 2011-12-12 00:30:24, Raphael Cendrillon wrote:
>>> bq.
>>> bq.  -----------------------------------------------------------
>>> bq.  This is an automatically generated e-mail. To reply, visit:
>>> bq.  https://reviews.apache.org/r/3147/
>>> bq.  -----------------------------------------------------------
>>> bq.
>>> bq.  (Updated 2011-12-12 00:30:24)
>>> bq.
>>> bq.
>>> bq.  Review request for mahout.
>>> bq.
>>> bq.
>>> bq.  Summary
>>> bq.  -------
>>> bq.
>>> bq.  Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.
>>> bq.
>>> bq.
>>> bq.  This addresses bug MAHOUT-923.
>>> bq.      https://issues.apache.org/jira/browse/MAHOUT-923
>>> bq.
>>> bq.
>>> bq.  Diffs
>>> bq.  -----
>>> bq.
>>> bq.    /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213095
>>> bq.    /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION
>>> bq.    /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213095
>>> bq.
>>> bq.  Diff: https://reviews.apache.org/r/3147/diff
>>> bq.
>>> bq.
>>> bq.  Testing
>>> bq.  -------
>>> bq.
>>> bq.  Junit test
>>> bq.
>>> bq.
>>> bq.  Thanks,
>>> bq.
>>> bq.  Raphael
>>> bq.
>>> bq.
>>>
>>>
>>>
>>>> Row mean job for PCA
>>>> --------------------
>>>>
>>>>                 Key: MAHOUT-923
>>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>>>>             Project: Mahout
>>>>          Issue Type: Improvement
>>>>          Components: Math
>>>>    Affects Versions: 0.6
>>>>            Reporter: Raphael Cendrillon
>>>>            Assignee: Raphael Cendrillon
>>>>             Fix For: Backlog
>>>>
>>>>         Attachments: MAHOUT-923.patch
>>>>
>>>>
>>>> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.
>>>
>>> --
>>> This message is automatically generated by JIRA.
>>> If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>>
>>>
>

Re: [jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by Raphael Cendrillon <ce...@gmail.com>.
You've convinced me that this is probably a bad idea.  You never know when this might come back to bite later.

On 12 Dec, 2011, at 12:50 AM, Dmitriy Lyubimov wrote:

> Oh now i remember what the deal with NullWritable was.
> 
> yes sequence file would read it as in
> 
>    Configuration conf = new Configuration();
>    FileSystem fs = FileSystem.getLocal(new Configuration());
>    Path testPath = new Path("name.seq");
> 
>    IntWritable iw = new IntWritable();
>    SequenceFile.Writer w =
>      SequenceFile.createWriter(fs,
>                                conf,
>                                testPath,
>                                NullWritable.class,
>                                IntWritable.class);
>    w.append(NullWritable.get(),iw);
>    w.close();
> 
> 
>    SequenceFile.Reader r = new SequenceFile.Reader(fs, testPath, conf);
>    while ( r.next(NullWritable.get(),iw));
>    r.close();
> 
> 
> but SequenceFileInputFileFormat would not. I.e. it is ok if you read
> it explicitly but I don't think one can use such files as an input for
> other MR jobs.
> 
> But since in this case there's no MR job to consume that output (and
> unlikely ever will be) i guess it is ok to save NullWritable in this
> case...
> 
> -d
> 
> On Mon, Dec 12, 2011 at 12:30 AM, jiraposter@reviews.apache.org
> (Commented) (JIRA) <ji...@apache.org> wrote:
>> 
>>    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167406#comment-13167406 ]
>> 
>> jiraposter@reviews.apache.org commented on MAHOUT-923:
>> ------------------------------------------------------
>> 
>> 
>> 
>> bq.  On 2011-12-12 02:10:01, Dmitriy Lyubimov wrote:
>> bq.  > Hm. I hope i did not read the code or miss something.
>> bq.  >
>> bq.  > 1 -- i am not sure this will actually work as intended unless # of reducers is corced to 1, of which i see no mention in the code.
>> bq.  > 2 -- mappers do nothing, passing on all the row pressure to sort which is absolutely not necessary. Even if you use combiners. This is going to be especially the case if you coerce 1 reducer an no combiners. IMO mean computation should be pushed up to mappers to avoid sort pressures of map reduce. Then reduction becomes largely symbolical(but you do need pass on the # of rows mapper has seen, to the reducer, in order for that operation to apply correctly).
>> bq.  > 3 -- i am not sure -- is NullWritable as a key legit? In my experience sequence file reader cannot instantiate it because NullWritable is a singleton and its creation is prohibited by making constructor private.
>> bq.
>> bq.  Raphael Cendrillon wrote:
>> bq.      Thanks Dmitry.
>> bq.
>> bq.      Regarding 1, if I understand correctly the number of reducers depends on the number of unique keys. Since all keys are set to the same value (null), then all of the mapper outputs should arrive at the same reducer. This seems to work in the unit test, but I may be missing something?
>> bq.
>> bq.      Regarding 2, that makes alot of sense. I'm wondering how many rows should be processed per mapper?  I guess there is a trade-off between scalability (processing more rows within a single map job means that each row must have less columns) and speed?  Is there someplace in the SSVD code where the matrix is split into slices of rows that I could use as a reference?
>> bq.
>> bq.      Regarding 3, I believe NullWritable is OK. It's used pretty extensively in TimesSquaredJob in DistributedRowMatrx. However if you feel there is some disadvantage to this I could replace "NullWritable.get()" with "new IntWritable(1)" (that is, set all of the keys to 1). Would that be more suitable?
>> bq.
>> bq.
>> 
>> NullWritable objection is withdrawn. Apparently i haven't looked into hadoop for too long, amazingly it seems to work now.
>> 
>> 
>> 1 -- I don't think your statement about # of reduce tasks is true.
>> 
>> The job (or, rather, user) sets the number of reduce tasks via config propery. All users will follow hadoop recommendation to set that to 95% of capacity they want to take. (usually the whole cluster). So in production environment you are virtually _guaranteed_ to have number of reducers of something like 75 on a 40-noder and consequently 75 output files (unless users really want to read the details of your job and figure you meant it to be just 1).
>> Now, it is true that only one file will actually end up having something and the rest of task slots will just be occupied doing nothing .
>> 
>> So there are two problems with that scheme: a) is that job that allocates so many task slots that do nothing is not a good citizen, since in real production cluster is always shared with multiple jobs. b) your code assumes result will end up in partition 0, whereas contractually it may end up in any of 75 files. (in reality with default hash partitioner for key 1 it will wind up in partion 0001 unless there's one reducer as i guess in your test was).
>> 
>> 2-- it is simple. when you send n rows to reducers, they are shuffled - and - sorted. Sending massive sets to reducers has 2 effects: first, even if they all group under the same key, they are still sorted with ~ n log (n/p) where p is number of partitions assuming uniform distribution (which it is not because you are sending everything to the same place). Just because we can run distributed sort, doesn't mean we should. Secondly, all these rows are physically moved to reduce tasks, which is still ~n rows. Finally what has made your case especially problematic is that you are sending everything to the same reducer, i.e. you are not actually doing sort in distributed way but rather simple single threaded sort at the reducer that happens to get all the input.
>> 
>> So that would allocate a lot of tasks slots that are not used; but do a sort that is not needed; and do it in a single reducer thread for the entire input which is not parallel at all.
>> 
>> Instead, consider this: map has a state consisting of (sum(X), k). it keeps updating it sum+=x, k++ for every new x. At the end of the cycle (in cleanup) it writes only 1 tuple (sum(x), k) as output. so we just reduced complexity of the sort and io from millions of elements to just # of maps (which is perhaps just handful and in reality rarely overshoots 500 mappers). That is, about at least 4 orders of magnitude.
>> 
>> Now, we send that handful tuples to single reducer and just do combining (sum(X)+= sum_i(X); n+= n_i) where i is the tuple in reducer. And because it is only a handful, reducer also runs very quickly, so the fact that we coerced it to be 1, is pretty benign. That volume of anywhere between 1 to 500 vectors it sums up doesn't warrant distributed computation.
>> 
>> But, you have to make sure there's only 1 reducer no matter what user put into the config, and you have to make sure you do all heavy lifting in the mappers.
>> 
>> Finally, you don't even to coerce to 1 reducer. You still can have several (but uniformly distributed) and do final combine in front end of the method. However, given small size and triviality of the reduction processing, it is probably not warranted. Coercing to 1 reducer is ok in this case IMO.
>> 
>> 3 i guess any writable is ok but NullWritable. Maybe something has changed. i remember falling into that pitfall several generations of hadoop ago. You can verify by staging a simple experiment of writing a sequence file with nullwritable as either key or value and try to read it back. in my test long ago it would write ok but not read back. I beleive similar approach is used with keys in shuffle and sort. There is a reflection writable factory inside which is trying to use default constructor of the class to bring it up which is(was) not available for NullWritable.
>> 
>> 
>> - Dmitriy
>> 
>> 
>> -----------------------------------------------------------
>> This is an automatically generated e-mail. To reply, visit:
>> https://reviews.apache.org/r/3147/#review3838
>> -----------------------------------------------------------
>> 
>> 
>> On 2011-12-12 00:30:24, Raphael Cendrillon wrote:
>> bq.
>> bq.  -----------------------------------------------------------
>> bq.  This is an automatically generated e-mail. To reply, visit:
>> bq.  https://reviews.apache.org/r/3147/
>> bq.  -----------------------------------------------------------
>> bq.
>> bq.  (Updated 2011-12-12 00:30:24)
>> bq.
>> bq.
>> bq.  Review request for mahout.
>> bq.
>> bq.
>> bq.  Summary
>> bq.  -------
>> bq.
>> bq.  Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.
>> bq.
>> bq.
>> bq.  This addresses bug MAHOUT-923.
>> bq.      https://issues.apache.org/jira/browse/MAHOUT-923
>> bq.
>> bq.
>> bq.  Diffs
>> bq.  -----
>> bq.
>> bq.    /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213095
>> bq.    /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION
>> bq.    /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213095
>> bq.
>> bq.  Diff: https://reviews.apache.org/r/3147/diff
>> bq.
>> bq.
>> bq.  Testing
>> bq.  -------
>> bq.
>> bq.  Junit test
>> bq.
>> bq.
>> bq.  Thanks,
>> bq.
>> bq.  Raphael
>> bq.
>> bq.
>> 
>> 
>> 
>>> Row mean job for PCA
>>> --------------------
>>> 
>>>                 Key: MAHOUT-923
>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>>>             Project: Mahout
>>>          Issue Type: Improvement
>>>          Components: Math
>>>    Affects Versions: 0.6
>>>            Reporter: Raphael Cendrillon
>>>            Assignee: Raphael Cendrillon
>>>             Fix For: Backlog
>>> 
>>>         Attachments: MAHOUT-923.patch
>>> 
>>> 
>>> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.
>> 
>> --
>> This message is automatically generated by JIRA.
>> If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>> 
>> 


Re: [jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
Oh now i remember what the deal with NullWritable was.

yes sequence file would read it as in

    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.getLocal(new Configuration());
    Path testPath = new Path("name.seq");

    IntWritable iw = new IntWritable();
    SequenceFile.Writer w =
      SequenceFile.createWriter(fs,
                                conf,
                                testPath,
                                NullWritable.class,
                                IntWritable.class);
    w.append(NullWritable.get(),iw);
    w.close();


    SequenceFile.Reader r = new SequenceFile.Reader(fs, testPath, conf);
    while ( r.next(NullWritable.get(),iw));
    r.close();


but SequenceFileInputFileFormat would not. I.e. it is ok if you read
it explicitly but I don't think one can use such files as an input for
other MR jobs.

But since in this case there's no MR job to consume that output (and
unlikely ever will be) i guess it is ok to save NullWritable in this
case...

-d

On Mon, Dec 12, 2011 at 12:30 AM, jiraposter@reviews.apache.org
(Commented) (JIRA) <ji...@apache.org> wrote:
>
>    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167406#comment-13167406 ]
>
> jiraposter@reviews.apache.org commented on MAHOUT-923:
> ------------------------------------------------------
>
>
>
> bq.  On 2011-12-12 02:10:01, Dmitriy Lyubimov wrote:
> bq.  > Hm. I hope i did not read the code or miss something.
> bq.  >
> bq.  > 1 -- i am not sure this will actually work as intended unless # of reducers is corced to 1, of which i see no mention in the code.
> bq.  > 2 -- mappers do nothing, passing on all the row pressure to sort which is absolutely not necessary. Even if you use combiners. This is going to be especially the case if you coerce 1 reducer an no combiners. IMO mean computation should be pushed up to mappers to avoid sort pressures of map reduce. Then reduction becomes largely symbolical(but you do need pass on the # of rows mapper has seen, to the reducer, in order for that operation to apply correctly).
> bq.  > 3 -- i am not sure -- is NullWritable as a key legit? In my experience sequence file reader cannot instantiate it because NullWritable is a singleton and its creation is prohibited by making constructor private.
> bq.
> bq.  Raphael Cendrillon wrote:
> bq.      Thanks Dmitry.
> bq.
> bq.      Regarding 1, if I understand correctly the number of reducers depends on the number of unique keys. Since all keys are set to the same value (null), then all of the mapper outputs should arrive at the same reducer. This seems to work in the unit test, but I may be missing something?
> bq.
> bq.      Regarding 2, that makes alot of sense. I'm wondering how many rows should be processed per mapper?  I guess there is a trade-off between scalability (processing more rows within a single map job means that each row must have less columns) and speed?  Is there someplace in the SSVD code where the matrix is split into slices of rows that I could use as a reference?
> bq.
> bq.      Regarding 3, I believe NullWritable is OK. It's used pretty extensively in TimesSquaredJob in DistributedRowMatrx. However if you feel there is some disadvantage to this I could replace "NullWritable.get()" with "new IntWritable(1)" (that is, set all of the keys to 1). Would that be more suitable?
> bq.
> bq.
>
> NullWritable objection is withdrawn. Apparently i haven't looked into hadoop for too long, amazingly it seems to work now.
>
>
> 1 -- I don't think your statement about # of reduce tasks is true.
>
> The job (or, rather, user) sets the number of reduce tasks via config propery. All users will follow hadoop recommendation to set that to 95% of capacity they want to take. (usually the whole cluster). So in production environment you are virtually _guaranteed_ to have number of reducers of something like 75 on a 40-noder and consequently 75 output files (unless users really want to read the details of your job and figure you meant it to be just 1).
> Now, it is true that only one file will actually end up having something and the rest of task slots will just be occupied doing nothing .
>
> So there are two problems with that scheme: a) is that job that allocates so many task slots that do nothing is not a good citizen, since in real production cluster is always shared with multiple jobs. b) your code assumes result will end up in partition 0, whereas contractually it may end up in any of 75 files. (in reality with default hash partitioner for key 1 it will wind up in partion 0001 unless there's one reducer as i guess in your test was).
>
> 2-- it is simple. when you send n rows to reducers, they are shuffled - and - sorted. Sending massive sets to reducers has 2 effects: first, even if they all group under the same key, they are still sorted with ~ n log (n/p) where p is number of partitions assuming uniform distribution (which it is not because you are sending everything to the same place). Just because we can run distributed sort, doesn't mean we should. Secondly, all these rows are physically moved to reduce tasks, which is still ~n rows. Finally what has made your case especially problematic is that you are sending everything to the same reducer, i.e. you are not actually doing sort in distributed way but rather simple single threaded sort at the reducer that happens to get all the input.
>
> So that would allocate a lot of tasks slots that are not used; but do a sort that is not needed; and do it in a single reducer thread for the entire input which is not parallel at all.
>
> Instead, consider this: map has a state consisting of (sum(X), k). it keeps updating it sum+=x, k++ for every new x. At the end of the cycle (in cleanup) it writes only 1 tuple (sum(x), k) as output. so we just reduced complexity of the sort and io from millions of elements to just # of maps (which is perhaps just handful and in reality rarely overshoots 500 mappers). That is, about at least 4 orders of magnitude.
>
> Now, we send that handful tuples to single reducer and just do combining (sum(X)+= sum_i(X); n+= n_i) where i is the tuple in reducer. And because it is only a handful, reducer also runs very quickly, so the fact that we coerced it to be 1, is pretty benign. That volume of anywhere between 1 to 500 vectors it sums up doesn't warrant distributed computation.
>
> But, you have to make sure there's only 1 reducer no matter what user put into the config, and you have to make sure you do all heavy lifting in the mappers.
>
> Finally, you don't even to coerce to 1 reducer. You still can have several (but uniformly distributed) and do final combine in front end of the method. However, given small size and triviality of the reduction processing, it is probably not warranted. Coercing to 1 reducer is ok in this case IMO.
>
> 3 i guess any writable is ok but NullWritable. Maybe something has changed. i remember falling into that pitfall several generations of hadoop ago. You can verify by staging a simple experiment of writing a sequence file with nullwritable as either key or value and try to read it back. in my test long ago it would write ok but not read back. I beleive similar approach is used with keys in shuffle and sort. There is a reflection writable factory inside which is trying to use default constructor of the class to bring it up which is(was) not available for NullWritable.
>
>
> - Dmitriy
>
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/3147/#review3838
> -----------------------------------------------------------
>
>
> On 2011-12-12 00:30:24, Raphael Cendrillon wrote:
> bq.
> bq.  -----------------------------------------------------------
> bq.  This is an automatically generated e-mail. To reply, visit:
> bq.  https://reviews.apache.org/r/3147/
> bq.  -----------------------------------------------------------
> bq.
> bq.  (Updated 2011-12-12 00:30:24)
> bq.
> bq.
> bq.  Review request for mahout.
> bq.
> bq.
> bq.  Summary
> bq.  -------
> bq.
> bq.  Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.
> bq.
> bq.
> bq.  This addresses bug MAHOUT-923.
> bq.      https://issues.apache.org/jira/browse/MAHOUT-923
> bq.
> bq.
> bq.  Diffs
> bq.  -----
> bq.
> bq.    /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213095
> bq.    /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION
> bq.    /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213095
> bq.
> bq.  Diff: https://reviews.apache.org/r/3147/diff
> bq.
> bq.
> bq.  Testing
> bq.  -------
> bq.
> bq.  Junit test
> bq.
> bq.
> bq.  Thanks,
> bq.
> bq.  Raphael
> bq.
> bq.
>
>
>
>> Row mean job for PCA
>> --------------------
>>
>>                 Key: MAHOUT-923
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>>             Project: Mahout
>>          Issue Type: Improvement
>>          Components: Math
>>    Affects Versions: 0.6
>>            Reporter: Raphael Cendrillon
>>            Assignee: Raphael Cendrillon
>>             Fix For: Backlog
>>
>>         Attachments: MAHOUT-923.patch
>>
>>
>> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "Dmitriy Lyubimov (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171947#comment-13171947 ] 

Dmitriy Lyubimov commented on MAHOUT-923:
-----------------------------------------

sorry, i was/am busy with various performance tweaks. so i probably will get back to PCI no sooner than new year's eve . (At which point i would be a really sore loser, to spend a family holiday on patching Mahout, wouldn't I?):)
                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch, MAHOUT-923.patch, MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "Raphael Cendrillon (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13168133#comment-13168133 ] 

Raphael Cendrillon commented on MAHOUT-923:
-------------------------------------------

Thanks Lance. I've updated this on reviewboard. Could you please take a look?
                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch, MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "Sean Owen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13168414#comment-13168414 ] 

Sean Owen commented on MAHOUT-923:
----------------------------------

clone() can return what it likes, though it is intended to return an object of the same class. So it *could* do what you describe "legally", even if it generally doesn't or perhaps shouldn't. What's the flaw that Lance was alluding to? I am aware of sins in implementation of clone() but nothing that was actually causing a problem.
                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch, MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "Sean Owen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13168289#comment-13168289 ] 

Sean Owen commented on MAHOUT-923:
----------------------------------

Lance, what are the clone() "flaws" you're talking about? I'm not aware of anything.
                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch, MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167292#comment-13167292 ] 

jiraposter@reviews.apache.org commented on MAHOUT-923:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3147/
-----------------------------------------------------------

Review request for mahout.


Summary
-------

Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.


This addresses bug MAHOUT-923.
    https://issues.apache.org/jira/browse/MAHOUT-923


Diffs
-----

  /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213095 
  /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION 
  /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213095 

Diff: https://reviews.apache.org/r/3147/diff


Testing
-------

Junit test


Thanks,

Raphael


                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13168352#comment-13168352 ] 

jiraposter@reviews.apache.org commented on MAHOUT-923:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3147/#review3874
-----------------------------------------------------------



/trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java
<https://reviews.apache.org/r/3147/#comment8701>

    I would really rather use standard terminology here.
    
    A mean row is a row that is that average of all others, but a row mean would mean an average of the elements a single row.  The plural form, row means, indicates the means of all rows.  What you are computing are the means of every column.
    
    In contrast, R, Octave and Matlab all use columnMeans as the name of the function being implemented here.



/trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java
<https://reviews.apache.org/r/3147/#comment8702>

    There are lots of lines with trailing white space.  Isn't this easily suppressed?


- Ted


On 2011-12-13 04:46:47, Raphael Cendrillon wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/3147/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-12-13 04:46:47)
bq.  
bq.  
bq.  Review request for mahout, lancenorskog and Dmitriy Lyubimov.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.
bq.  
bq.  
bq.  This addresses bug MAHOUT-923.
bq.      https://issues.apache.org/jira/browse/MAHOUT-923
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474 
bq.    /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION 
bq.    /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474 
bq.  
bq.  Diff: https://reviews.apache.org/r/3147/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Junit test
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Raphael
bq.  
bq.


                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch, MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167312#comment-13167312 ] 

jiraposter@reviews.apache.org commented on MAHOUT-923:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3147/#review3838
-----------------------------------------------------------


Hm. I hope i did not read the code or miss something. 

1 -- i am not sure this will actually work as intended unless # of reducers is corced to 1, of which i see no mention in the code. 
2 -- mappers do nothing, passing on all the row pressure to sort which is absolutely not necessary. Even if you use combiners. This is going to be especially the case if you coerce 1 reducer an no combiners. IMO mean computation should be pushed up to mappers to avoid sort pressures of map reduce. Then reduction becomes largely symbolical(but you do need pass on the # of rows mapper has seen, to the reducer, in order for that operation to apply correctly).
3 -- i am not sure -- is NullWritable as a key legit? In my experience sequence file reader cannot instantiate it because NullWritable is a singleton and its creation is prohibited by making constructor private.

- Dmitriy


On 2011-12-12 00:30:24, Raphael Cendrillon wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/3147/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-12-12 00:30:24)
bq.  
bq.  
bq.  Review request for mahout.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.
bq.  
bq.  
bq.  This addresses bug MAHOUT-923.
bq.      https://issues.apache.org/jira/browse/MAHOUT-923
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213095 
bq.    /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION 
bq.    /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213095 
bq.  
bq.  Diff: https://reviews.apache.org/r/3147/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Junit test
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Raphael
bq.  
bq.


                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "Lance Norskog (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167337#comment-13167337 ] 

Lance Norskog commented on MAHOUT-923:
--------------------------------------

MatrixRowMeanJob writes <NullWritable,VectorWritable> but the convention for Mahout jobs is <IntWritable,VectorWritable>. Should it use an IntWritable instead for consistency? The Int would be one.

Can MatrixRowMeanJob have a Combiner? Would this help with performance in the standard use cases? The key would still be NullWritable. value would be a VectorWritable and a counter. The Reducer would include the counters for each input Vector.

Should MatrixRowMeanJob have a configurable class for the ouput vector?
                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-923) Row mean job for PCA

Posted by "Raphael Cendrillon (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raphael Cendrillon updated MAHOUT-923:
--------------------------------------

    Attachment: MAHOUT-923.patch
    
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167338#comment-13167338 ] 

jiraposter@reviews.apache.org commented on MAHOUT-923:
------------------------------------------------------



bq.  On 2011-12-12 02:10:01, Dmitriy Lyubimov wrote:
bq.  > Hm. I hope i did not read the code or miss something. 
bq.  > 
bq.  > 1 -- i am not sure this will actually work as intended unless # of reducers is corced to 1, of which i see no mention in the code. 
bq.  > 2 -- mappers do nothing, passing on all the row pressure to sort which is absolutely not necessary. Even if you use combiners. This is going to be especially the case if you coerce 1 reducer an no combiners. IMO mean computation should be pushed up to mappers to avoid sort pressures of map reduce. Then reduction becomes largely symbolical(but you do need pass on the # of rows mapper has seen, to the reducer, in order for that operation to apply correctly).
bq.  > 3 -- i am not sure -- is NullWritable as a key legit? In my experience sequence file reader cannot instantiate it because NullWritable is a singleton and its creation is prohibited by making constructor private.

Thanks Dmitry.

Regarding 1, if I understand correctly the number of reducers depends on the number of unique keys. Since all keys are set to the same value (null), then all of the mapper outputs should arrive at the same reducer. This seems to work in the unit test, but I may be missing something?

Regarding 2, that makes alot of sense. I'm wondering how many rows should be processed per mapper?  I guess there is a trade-off between scalability (processing more rows within a single map job means that each row must have less columns) and speed?  Is there someplace in the SSVD code where the matrix is split into slices of rows that I could use as a reference?

Regarding 3, I believe NullWritable is OK. It's used pretty extensively in TimesSquaredJob in DistributedRowMatrx. However if you feel there is some disadvantage to this I could replace "NullWritable.get()" with "new IntWritable(1)" (that is, set all of the keys to 1). Would that be more suitable?


- Raphael


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3147/#review3838
-----------------------------------------------------------


On 2011-12-12 00:30:24, Raphael Cendrillon wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/3147/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-12-12 00:30:24)
bq.  
bq.  
bq.  Review request for mahout.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.
bq.  
bq.  
bq.  This addresses bug MAHOUT-923.
bq.      https://issues.apache.org/jira/browse/MAHOUT-923
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213095 
bq.    /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION 
bq.    /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213095 
bq.  
bq.  Diff: https://reviews.apache.org/r/3147/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Junit test
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Raphael
bq.  
bq.


                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "Lance Norskog (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13168111#comment-13168111 ] 

Lance Norskog commented on MAHOUT-923:
--------------------------------------

The right way to set the vector class is to use Vector.like() instead of clone(). Then, copy the old vector into the new vector. clone() is basically bogus; it was a nice idea that has turned out to have flaws. Vector.like() allows any kind of Vector to make a different kind, as appropriate. And the class parameter would override that, but would not be necessary.
                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch, MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-923) Row mean job for PCA

Posted by "Raphael Cendrillon (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raphael Cendrillon updated MAHOUT-923:
--------------------------------------

    Attachment: MAHOUT-923.patch
    
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch, MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "Raphael Cendrillon (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171673#comment-13171673 ] 

Raphael Cendrillon commented on MAHOUT-923:
-------------------------------------------

Are there any more changes needed at this point?  What's the next step from here?
                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch, MAHOUT-923.patch, MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "Raphael Cendrillon (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13168492#comment-13168492 ] 

Raphael Cendrillon commented on MAHOUT-923:
-------------------------------------------

Eclipse. I've added the provided format template, but sometimes whitespace still pops up. I'll look into this some more. It's driving me crazy :)



                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch, MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167965#comment-13167965 ] 

jiraposter@reviews.apache.org commented on MAHOUT-923:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3147/
-----------------------------------------------------------

(Updated 2011-12-13 00:09:03.441301)


Review request for mahout.


Changes
-------

Added checks to ensure reliable operation with null matrices


Summary
-------

Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.


This addresses bug MAHOUT-923.
    https://issues.apache.org/jira/browse/MAHOUT-923


Diffs (updated)
-----

  /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474 
  /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION 
  /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474 

Diff: https://reviews.apache.org/r/3147/diff


Testing
-------

Junit test


Thanks,

Raphael


                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13168543#comment-13168543 ] 

jiraposter@reviews.apache.org commented on MAHOUT-923:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3147/
-----------------------------------------------------------

(Updated 2011-12-13 17:53:35.691333)


Review request for mahout, lancenorskog and Dmitriy Lyubimov.


Changes
-------

Changed method name to columnMeans(). Removed trailing whitespaces.


Summary
-------

Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.


This addresses bug MAHOUT-923.
    https://issues.apache.org/jira/browse/MAHOUT-923


Diffs (updated)
-----

  /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474 
  /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java PRE-CREATION 
  /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474 

Diff: https://reviews.apache.org/r/3147/diff


Testing
-------

Junit test


Thanks,

Raphael


                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch, MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167293#comment-13167293 ] 

jiraposter@reviews.apache.org commented on MAHOUT-923:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3147/
-----------------------------------------------------------

(Updated 2011-12-12 00:30:24.091994)


Review request for mahout.


Summary
-------

Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.


This addresses bug MAHOUT-923.
    https://issues.apache.org/jira/browse/MAHOUT-923


Diffs (updated)
-----

  /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213095 
  /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION 
  /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213095 

Diff: https://reviews.apache.org/r/3147/diff


Testing
-------

Junit test


Thanks,

Raphael


                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "Raphael Cendrillon (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167291#comment-13167291 ] 

Raphael Cendrillon commented on MAHOUT-923:
-------------------------------------------

It might be worthwhile to pull this into a seperate issue. Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.
Does a class like this already exist?
                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167443#comment-13167443 ] 

jiraposter@reviews.apache.org commented on MAHOUT-923:
------------------------------------------------------



bq.  On 2011-12-12 02:10:01, Dmitriy Lyubimov wrote:
bq.  > Hm. I hope i did not read the code or miss something. 
bq.  > 
bq.  > 1 -- i am not sure this will actually work as intended unless # of reducers is corced to 1, of which i see no mention in the code. 
bq.  > 2 -- mappers do nothing, passing on all the row pressure to sort which is absolutely not necessary. Even if you use combiners. This is going to be especially the case if you coerce 1 reducer an no combiners. IMO mean computation should be pushed up to mappers to avoid sort pressures of map reduce. Then reduction becomes largely symbolical(but you do need pass on the # of rows mapper has seen, to the reducer, in order for that operation to apply correctly).
bq.  > 3 -- i am not sure -- is NullWritable as a key legit? In my experience sequence file reader cannot instantiate it because NullWritable is a singleton and its creation is prohibited by making constructor private.
bq.  
bq.  Raphael Cendrillon wrote:
bq.      Thanks Dmitry.
bq.      
bq.      Regarding 1, if I understand correctly the number of reducers depends on the number of unique keys. Since all keys are set to the same value (null), then all of the mapper outputs should arrive at the same reducer. This seems to work in the unit test, but I may be missing something?
bq.      
bq.      Regarding 2, that makes alot of sense. I'm wondering how many rows should be processed per mapper?  I guess there is a trade-off between scalability (processing more rows within a single map job means that each row must have less columns) and speed?  Is there someplace in the SSVD code where the matrix is split into slices of rows that I could use as a reference?
bq.      
bq.      Regarding 3, I believe NullWritable is OK. It's used pretty extensively in TimesSquaredJob in DistributedRowMatrx. However if you feel there is some disadvantage to this I could replace "NullWritable.get()" with "new IntWritable(1)" (that is, set all of the keys to 1). Would that be more suitable?
bq.      
bq.
bq.  
bq.  Dmitriy Lyubimov wrote:
bq.      NullWritable objection is withdrawn. Apparently i haven't looked into hadoop for too long, amazingly it seems to work now.
bq.      
bq.      
bq.      1 -- I don't think your statement about # of reduce tasks is true. 
bq.      
bq.      The job (or, rather, user) sets the number of reduce tasks via config propery. All users will follow hadoop recommendation to set that to 95% of capacity they want to take. (usually the whole cluster). So in production environment you are virtually _guaranteed_ to have number of reducers of something like 75 on a 40-noder and consequently 75 output files (unless users really want to read the details of your job and figure you meant it to be just 1). 
bq.      Now, it is true that only one file will actually end up having something and the rest of task slots will just be occupied doing nothing . 
bq.      
bq.      So there are two problems with that scheme: a) is that job that allocates so many task slots that do nothing is not a good citizen, since in real production cluster is always shared with multiple jobs. b) your code assumes result will end up in partition 0, whereas contractually it may end up in any of 75 files. (in reality with default hash partitioner for key 1 it will wind up in partion 0001 unless there's one reducer as i guess in your test was). 
bq.      
bq.      2-- it is simple. when you send n rows to reducers, they are shuffled - and - sorted. Sending massive sets to reducers has 2 effects: first, even if they all group under the same key, they are still sorted with ~ n log (n/p) where p is number of partitions assuming uniform distribution (which it is not because you are sending everything to the same place). Just because we can run distributed sort, doesn't mean we should. Secondly, all these rows are physically moved to reduce tasks, which is still ~n rows. Finally what has made your case especially problematic is that you are sending everything to the same reducer, i.e. you are not actually doing sort in distributed way but rather simple single threaded sort at the reducer that happens to get all the input. 
bq.      
bq.      So that would allocate a lot of tasks slots that are not used; but do a sort that is not needed; and do it in a single reducer thread for the entire input which is not parallel at all. 
bq.      
bq.      Instead, consider this: map has a state consisting of (sum(X), k). it keeps updating it sum+=x, k++ for every new x. At the end of the cycle (in cleanup) it writes only 1 tuple (sum(x), k) as output. so we just reduced complexity of the sort and io from millions of elements to just # of maps (which is perhaps just handful and in reality rarely overshoots 500 mappers). That is, about at least 4 orders of magnitude. 
bq.      
bq.      Now, we send that handful tuples to single reducer and just do combining (sum(X)+= sum_i(X); n+= n_i) where i is the tuple in reducer. And because it is only a handful, reducer also runs very quickly, so the fact that we coerced it to be 1, is pretty benign. That volume of anywhere between 1 to 500 vectors it sums up doesn't warrant distributed computation. 
bq.      
bq.      But, you have to make sure there's only 1 reducer no matter what user put into the config, and you have to make sure you do all heavy lifting in the mappers.
bq.      
bq.      Finally, you don't even to coerce to 1 reducer. You still can have several (but uniformly distributed) and do final combine in front end of the method. However, given small size and triviality of the reduction processing, it is probably not warranted. Coercing to 1 reducer is ok in this case IMO.
bq.      
bq.      3 i guess any writable is ok but NullWritable. Maybe something has changed. i remember falling into that pitfall several generations of hadoop ago. You can verify by staging a simple experiment of writing a sequence file with nullwritable as either key or value and try to read it back. in my test long ago it would write ok but not read back. I beleive similar approach is used with keys in shuffle and sort. There is a reflection writable factory inside which is trying to use default constructor of the class to bring it up which is(was) not available for NullWritable.
bq.      
bq.      
bq.

Thanks Dmitriy. I've updated the diff to push the row summation into the mapper as you suggested, force the number of reducers to 1, and make the final output key IntWritable. Could you please take a look?


- Raphael


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3147/#review3838
-----------------------------------------------------------


On 2011-12-12 10:41:46, Raphael Cendrillon wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/3147/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-12-12 10:41:46)
bq.  
bq.  
bq.  Review request for mahout.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.
bq.  
bq.  
bq.  This addresses bug MAHOUT-923.
bq.      https://issues.apache.org/jira/browse/MAHOUT-923
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213095 
bq.    /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION 
bq.    /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213095 
bq.  
bq.  Diff: https://reviews.apache.org/r/3147/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Junit test
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Raphael
bq.  
bq.


                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167406#comment-13167406 ] 

jiraposter@reviews.apache.org commented on MAHOUT-923:
------------------------------------------------------



bq.  On 2011-12-12 02:10:01, Dmitriy Lyubimov wrote:
bq.  > Hm. I hope i did not read the code or miss something. 
bq.  > 
bq.  > 1 -- i am not sure this will actually work as intended unless # of reducers is corced to 1, of which i see no mention in the code. 
bq.  > 2 -- mappers do nothing, passing on all the row pressure to sort which is absolutely not necessary. Even if you use combiners. This is going to be especially the case if you coerce 1 reducer an no combiners. IMO mean computation should be pushed up to mappers to avoid sort pressures of map reduce. Then reduction becomes largely symbolical(but you do need pass on the # of rows mapper has seen, to the reducer, in order for that operation to apply correctly).
bq.  > 3 -- i am not sure -- is NullWritable as a key legit? In my experience sequence file reader cannot instantiate it because NullWritable is a singleton and its creation is prohibited by making constructor private.
bq.  
bq.  Raphael Cendrillon wrote:
bq.      Thanks Dmitry.
bq.      
bq.      Regarding 1, if I understand correctly the number of reducers depends on the number of unique keys. Since all keys are set to the same value (null), then all of the mapper outputs should arrive at the same reducer. This seems to work in the unit test, but I may be missing something?
bq.      
bq.      Regarding 2, that makes alot of sense. I'm wondering how many rows should be processed per mapper?  I guess there is a trade-off between scalability (processing more rows within a single map job means that each row must have less columns) and speed?  Is there someplace in the SSVD code where the matrix is split into slices of rows that I could use as a reference?
bq.      
bq.      Regarding 3, I believe NullWritable is OK. It's used pretty extensively in TimesSquaredJob in DistributedRowMatrx. However if you feel there is some disadvantage to this I could replace "NullWritable.get()" with "new IntWritable(1)" (that is, set all of the keys to 1). Would that be more suitable?
bq.      
bq.

NullWritable objection is withdrawn. Apparently i haven't looked into hadoop for too long, amazingly it seems to work now.


1 -- I don't think your statement about # of reduce tasks is true. 

The job (or, rather, user) sets the number of reduce tasks via config propery. All users will follow hadoop recommendation to set that to 95% of capacity they want to take. (usually the whole cluster). So in production environment you are virtually _guaranteed_ to have number of reducers of something like 75 on a 40-noder and consequently 75 output files (unless users really want to read the details of your job and figure you meant it to be just 1). 
Now, it is true that only one file will actually end up having something and the rest of task slots will just be occupied doing nothing . 

So there are two problems with that scheme: a) is that job that allocates so many task slots that do nothing is not a good citizen, since in real production cluster is always shared with multiple jobs. b) your code assumes result will end up in partition 0, whereas contractually it may end up in any of 75 files. (in reality with default hash partitioner for key 1 it will wind up in partion 0001 unless there's one reducer as i guess in your test was). 

2-- it is simple. when you send n rows to reducers, they are shuffled - and - sorted. Sending massive sets to reducers has 2 effects: first, even if they all group under the same key, they are still sorted with ~ n log (n/p) where p is number of partitions assuming uniform distribution (which it is not because you are sending everything to the same place). Just because we can run distributed sort, doesn't mean we should. Secondly, all these rows are physically moved to reduce tasks, which is still ~n rows. Finally what has made your case especially problematic is that you are sending everything to the same reducer, i.e. you are not actually doing sort in distributed way but rather simple single threaded sort at the reducer that happens to get all the input. 

So that would allocate a lot of tasks slots that are not used; but do a sort that is not needed; and do it in a single reducer thread for the entire input which is not parallel at all. 

Instead, consider this: map has a state consisting of (sum(X), k). it keeps updating it sum+=x, k++ for every new x. At the end of the cycle (in cleanup) it writes only 1 tuple (sum(x), k) as output. so we just reduced complexity of the sort and io from millions of elements to just # of maps (which is perhaps just handful and in reality rarely overshoots 500 mappers). That is, about at least 4 orders of magnitude. 

Now, we send that handful tuples to single reducer and just do combining (sum(X)+= sum_i(X); n+= n_i) where i is the tuple in reducer. And because it is only a handful, reducer also runs very quickly, so the fact that we coerced it to be 1, is pretty benign. That volume of anywhere between 1 to 500 vectors it sums up doesn't warrant distributed computation. 

But, you have to make sure there's only 1 reducer no matter what user put into the config, and you have to make sure you do all heavy lifting in the mappers.

Finally, you don't even to coerce to 1 reducer. You still can have several (but uniformly distributed) and do final combine in front end of the method. However, given small size and triviality of the reduction processing, it is probably not warranted. Coercing to 1 reducer is ok in this case IMO.

3 i guess any writable is ok but NullWritable. Maybe something has changed. i remember falling into that pitfall several generations of hadoop ago. You can verify by staging a simple experiment of writing a sequence file with nullwritable as either key or value and try to read it back. in my test long ago it would write ok but not read back. I beleive similar approach is used with keys in shuffle and sort. There is a reflection writable factory inside which is trying to use default constructor of the class to bring it up which is(was) not available for NullWritable.


- Dmitriy


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3147/#review3838
-----------------------------------------------------------


On 2011-12-12 00:30:24, Raphael Cendrillon wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/3147/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-12-12 00:30:24)
bq.  
bq.  
bq.  Review request for mahout.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.
bq.  
bq.  
bq.  This addresses bug MAHOUT-923.
bq.      https://issues.apache.org/jira/browse/MAHOUT-923
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213095 
bq.    /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION 
bq.    /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213095 
bq.  
bq.  Diff: https://reviews.apache.org/r/3147/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Junit test
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Raphael
bq.  
bq.


                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171667#comment-13171667 ] 

jiraposter@reviews.apache.org commented on MAHOUT-923:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3147/
-----------------------------------------------------------

(Updated 2011-12-17 20:38:27.774380)


Review request for mahout, Ted Dunning, lancenorskog, and Dmitriy Lyubimov.


Changes
-------

Added javadocs


Summary
-------

Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.


This addresses bug MAHOUT-923.
    https://issues.apache.org/jira/browse/MAHOUT-923


Diffs (updated)
-----

  /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1215567 
  /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java PRE-CREATION 
  /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1215567 

Diff: https://reviews.apache.org/r/3147/diff


Testing
-------

Junit test


Thanks,

Raphael


                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch, MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171671#comment-13171671 ] 

jiraposter@reviews.apache.org commented on MAHOUT-923:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3147/
-----------------------------------------------------------

(Updated 2011-12-17 20:50:42.776447)


Review request for mahout, Ted Dunning, lancenorskog, and Dmitriy Lyubimov.


Changes
-------

Correct version.


Summary
-------

Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.


This addresses bug MAHOUT-923.
    https://issues.apache.org/jira/browse/MAHOUT-923


Diffs (updated)
-----

  /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1215567 
  /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java PRE-CREATION 
  /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1215567 

Diff: https://reviews.apache.org/r/3147/diff


Testing
-------

Junit test


Thanks,

Raphael


                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch, MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-923) Row mean job for PCA

Posted by "Raphael Cendrillon (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raphael Cendrillon updated MAHOUT-923:
--------------------------------------

    Attachment: MAHOUT-923.patch
    
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch, MAHOUT-923.patch, MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13168009#comment-13168009 ] 

jiraposter@reviews.apache.org commented on MAHOUT-923:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3147/
-----------------------------------------------------------

(Updated 2011-12-13 00:58:36.591798)


Review request for mahout and Dmitriy Lyubimov.


Changes
-------

Added private static final to 'one', removed clone(), added option to specify class of return vector.


Summary
-------

Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.


This addresses bug MAHOUT-923.
    https://issues.apache.org/jira/browse/MAHOUT-923


Diffs (updated)
-----

  /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474 
  /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION 
  /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474 

Diff: https://reviews.apache.org/r/3147/diff


Testing
-------

Junit test


Thanks,

Raphael


                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch, MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167798#comment-13167798 ] 

jiraposter@reviews.apache.org commented on MAHOUT-923:
------------------------------------------------------



bq.  On 2011-12-12 02:10:01, Dmitriy Lyubimov wrote:
bq.  > Hm. I hope i did not read the code or miss something. 
bq.  > 
bq.  > 1 -- i am not sure this will actually work as intended unless # of reducers is corced to 1, of which i see no mention in the code. 
bq.  > 2 -- mappers do nothing, passing on all the row pressure to sort which is absolutely not necessary. Even if you use combiners. This is going to be especially the case if you coerce 1 reducer an no combiners. IMO mean computation should be pushed up to mappers to avoid sort pressures of map reduce. Then reduction becomes largely symbolical(but you do need pass on the # of rows mapper has seen, to the reducer, in order for that operation to apply correctly).
bq.  > 3 -- i am not sure -- is NullWritable as a key legit? In my experience sequence file reader cannot instantiate it because NullWritable is a singleton and its creation is prohibited by making constructor private.
bq.  
bq.  Raphael Cendrillon wrote:
bq.      Thanks Dmitry.
bq.      
bq.      Regarding 1, if I understand correctly the number of reducers depends on the number of unique keys. Since all keys are set to the same value (null), then all of the mapper outputs should arrive at the same reducer. This seems to work in the unit test, but I may be missing something?
bq.      
bq.      Regarding 2, that makes alot of sense. I'm wondering how many rows should be processed per mapper?  I guess there is a trade-off between scalability (processing more rows within a single map job means that each row must have less columns) and speed?  Is there someplace in the SSVD code where the matrix is split into slices of rows that I could use as a reference?
bq.      
bq.      Regarding 3, I believe NullWritable is OK. It's used pretty extensively in TimesSquaredJob in DistributedRowMatrx. However if you feel there is some disadvantage to this I could replace "NullWritable.get()" with "new IntWritable(1)" (that is, set all of the keys to 1). Would that be more suitable?
bq.      
bq.
bq.  
bq.  Dmitriy Lyubimov wrote:
bq.      NullWritable objection is withdrawn. Apparently i haven't looked into hadoop for too long, amazingly it seems to work now.
bq.      
bq.      
bq.      1 -- I don't think your statement about # of reduce tasks is true. 
bq.      
bq.      The job (or, rather, user) sets the number of reduce tasks via config propery. All users will follow hadoop recommendation to set that to 95% of capacity they want to take. (usually the whole cluster). So in production environment you are virtually _guaranteed_ to have number of reducers of something like 75 on a 40-noder and consequently 75 output files (unless users really want to read the details of your job and figure you meant it to be just 1). 
bq.      Now, it is true that only one file will actually end up having something and the rest of task slots will just be occupied doing nothing . 
bq.      
bq.      So there are two problems with that scheme: a) is that job that allocates so many task slots that do nothing is not a good citizen, since in real production cluster is always shared with multiple jobs. b) your code assumes result will end up in partition 0, whereas contractually it may end up in any of 75 files. (in reality with default hash partitioner for key 1 it will wind up in partion 0001 unless there's one reducer as i guess in your test was). 
bq.      
bq.      2-- it is simple. when you send n rows to reducers, they are shuffled - and - sorted. Sending massive sets to reducers has 2 effects: first, even if they all group under the same key, they are still sorted with ~ n log (n/p) where p is number of partitions assuming uniform distribution (which it is not because you are sending everything to the same place). Just because we can run distributed sort, doesn't mean we should. Secondly, all these rows are physically moved to reduce tasks, which is still ~n rows. Finally what has made your case especially problematic is that you are sending everything to the same reducer, i.e. you are not actually doing sort in distributed way but rather simple single threaded sort at the reducer that happens to get all the input. 
bq.      
bq.      So that would allocate a lot of tasks slots that are not used; but do a sort that is not needed; and do it in a single reducer thread for the entire input which is not parallel at all. 
bq.      
bq.      Instead, consider this: map has a state consisting of (sum(X), k). it keeps updating it sum+=x, k++ for every new x. At the end of the cycle (in cleanup) it writes only 1 tuple (sum(x), k) as output. so we just reduced complexity of the sort and io from millions of elements to just # of maps (which is perhaps just handful and in reality rarely overshoots 500 mappers). That is, about at least 4 orders of magnitude. 
bq.      
bq.      Now, we send that handful tuples to single reducer and just do combining (sum(X)+= sum_i(X); n+= n_i) where i is the tuple in reducer. And because it is only a handful, reducer also runs very quickly, so the fact that we coerced it to be 1, is pretty benign. That volume of anywhere between 1 to 500 vectors it sums up doesn't warrant distributed computation. 
bq.      
bq.      But, you have to make sure there's only 1 reducer no matter what user put into the config, and you have to make sure you do all heavy lifting in the mappers.
bq.      
bq.      Finally, you don't even to coerce to 1 reducer. You still can have several (but uniformly distributed) and do final combine in front end of the method. However, given small size and triviality of the reduction processing, it is probably not warranted. Coercing to 1 reducer is ok in this case IMO.
bq.      
bq.      3 i guess any writable is ok but NullWritable. Maybe something has changed. i remember falling into that pitfall several generations of hadoop ago. You can verify by staging a simple experiment of writing a sequence file with nullwritable as either key or value and try to read it back. in my test long ago it would write ok but not read back. I beleive similar approach is used with keys in shuffle and sort. There is a reflection writable factory inside which is trying to use default constructor of the class to bring it up which is(was) not available for NullWritable.
bq.      
bq.      
bq.
bq.  
bq.  Raphael Cendrillon wrote:
bq.      Thanks Dmitriy. I've updated the diff to push the row summation into the mapper as you suggested, force the number of reducers to 1, and make the final output key IntWritable. Could you please take a look?

looks good on top of it. 
One nitpicking that i have is this 

context.write(NullWritable.get(), new VectorWritable(runningSum));

but runningSum is initialized in the map loop which technically may never be called (not likely but theoretically possible).

so therefore i'd initialize runningSum vector to something that is nonzero in setup. or better yet just check for null and skip map output if no data. 

Same considerations for reducer. it needs to handle the corner case when there's no input, correctly.


- Dmitriy


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3147/#review3838
-----------------------------------------------------------


On 2011-12-12 10:41:46, Raphael Cendrillon wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/3147/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-12-12 10:41:46)
bq.  
bq.  
bq.  Review request for mahout.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.
bq.  
bq.  
bq.  This addresses bug MAHOUT-923.
bq.      https://issues.apache.org/jira/browse/MAHOUT-923
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213095 
bq.    /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION 
bq.    /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213095 
bq.  
bq.  Diff: https://reviews.apache.org/r/3147/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Junit test
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Raphael
bq.  
bq.


                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167980#comment-13167980 ] 

jiraposter@reviews.apache.org commented on MAHOUT-923:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3147/#review3866
-----------------------------------------------------------



/trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java
<https://reviews.apache.org/r/3147/#comment8690>

    can be private static final



/trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java
<https://reviews.apache.org/r/3147/#comment8691>

    No need to clone here, see org.apache.mahout.common.mapreduce.VectorSumReducer


- Sebastian


On 2011-12-13 00:10:57, Raphael Cendrillon wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/3147/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-12-13 00:10:57)
bq.  
bq.  
bq.  Review request for mahout and Dmitriy Lyubimov.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.
bq.  
bq.  
bq.  This addresses bug MAHOUT-923.
bq.      https://issues.apache.org/jira/browse/MAHOUT-923
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474 
bq.    /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION 
bq.    /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474 
bq.  
bq.  Diff: https://reviews.apache.org/r/3147/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Junit test
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Raphael
bq.  
bq.


                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch, MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167972#comment-13167972 ] 

jiraposter@reviews.apache.org commented on MAHOUT-923:
------------------------------------------------------



bq.  On 2011-12-12 02:10:01, Dmitriy Lyubimov wrote:
bq.  > Hm. I hope i did not read the code or miss something. 
bq.  > 
bq.  > 1 -- i am not sure this will actually work as intended unless # of reducers is corced to 1, of which i see no mention in the code. 
bq.  > 2 -- mappers do nothing, passing on all the row pressure to sort which is absolutely not necessary. Even if you use combiners. This is going to be especially the case if you coerce 1 reducer an no combiners. IMO mean computation should be pushed up to mappers to avoid sort pressures of map reduce. Then reduction becomes largely symbolical(but you do need pass on the # of rows mapper has seen, to the reducer, in order for that operation to apply correctly).
bq.  > 3 -- i am not sure -- is NullWritable as a key legit? In my experience sequence file reader cannot instantiate it because NullWritable is a singleton and its creation is prohibited by making constructor private.
bq.  
bq.  Raphael Cendrillon wrote:
bq.      Thanks Dmitry.
bq.      
bq.      Regarding 1, if I understand correctly the number of reducers depends on the number of unique keys. Since all keys are set to the same value (null), then all of the mapper outputs should arrive at the same reducer. This seems to work in the unit test, but I may be missing something?
bq.      
bq.      Regarding 2, that makes alot of sense. I'm wondering how many rows should be processed per mapper?  I guess there is a trade-off between scalability (processing more rows within a single map job means that each row must have less columns) and speed?  Is there someplace in the SSVD code where the matrix is split into slices of rows that I could use as a reference?
bq.      
bq.      Regarding 3, I believe NullWritable is OK. It's used pretty extensively in TimesSquaredJob in DistributedRowMatrx. However if you feel there is some disadvantage to this I could replace "NullWritable.get()" with "new IntWritable(1)" (that is, set all of the keys to 1). Would that be more suitable?
bq.      
bq.
bq.  
bq.  Dmitriy Lyubimov wrote:
bq.      NullWritable objection is withdrawn. Apparently i haven't looked into hadoop for too long, amazingly it seems to work now.
bq.      
bq.      
bq.      1 -- I don't think your statement about # of reduce tasks is true. 
bq.      
bq.      The job (or, rather, user) sets the number of reduce tasks via config propery. All users will follow hadoop recommendation to set that to 95% of capacity they want to take. (usually the whole cluster). So in production environment you are virtually _guaranteed_ to have number of reducers of something like 75 on a 40-noder and consequently 75 output files (unless users really want to read the details of your job and figure you meant it to be just 1). 
bq.      Now, it is true that only one file will actually end up having something and the rest of task slots will just be occupied doing nothing . 
bq.      
bq.      So there are two problems with that scheme: a) is that job that allocates so many task slots that do nothing is not a good citizen, since in real production cluster is always shared with multiple jobs. b) your code assumes result will end up in partition 0, whereas contractually it may end up in any of 75 files. (in reality with default hash partitioner for key 1 it will wind up in partion 0001 unless there's one reducer as i guess in your test was). 
bq.      
bq.      2-- it is simple. when you send n rows to reducers, they are shuffled - and - sorted. Sending massive sets to reducers has 2 effects: first, even if they all group under the same key, they are still sorted with ~ n log (n/p) where p is number of partitions assuming uniform distribution (which it is not because you are sending everything to the same place). Just because we can run distributed sort, doesn't mean we should. Secondly, all these rows are physically moved to reduce tasks, which is still ~n rows. Finally what has made your case especially problematic is that you are sending everything to the same reducer, i.e. you are not actually doing sort in distributed way but rather simple single threaded sort at the reducer that happens to get all the input. 
bq.      
bq.      So that would allocate a lot of tasks slots that are not used; but do a sort that is not needed; and do it in a single reducer thread for the entire input which is not parallel at all. 
bq.      
bq.      Instead, consider this: map has a state consisting of (sum(X), k). it keeps updating it sum+=x, k++ for every new x. At the end of the cycle (in cleanup) it writes only 1 tuple (sum(x), k) as output. so we just reduced complexity of the sort and io from millions of elements to just # of maps (which is perhaps just handful and in reality rarely overshoots 500 mappers). That is, about at least 4 orders of magnitude. 
bq.      
bq.      Now, we send that handful tuples to single reducer and just do combining (sum(X)+= sum_i(X); n+= n_i) where i is the tuple in reducer. And because it is only a handful, reducer also runs very quickly, so the fact that we coerced it to be 1, is pretty benign. That volume of anywhere between 1 to 500 vectors it sums up doesn't warrant distributed computation. 
bq.      
bq.      But, you have to make sure there's only 1 reducer no matter what user put into the config, and you have to make sure you do all heavy lifting in the mappers.
bq.      
bq.      Finally, you don't even to coerce to 1 reducer. You still can have several (but uniformly distributed) and do final combine in front end of the method. However, given small size and triviality of the reduction processing, it is probably not warranted. Coercing to 1 reducer is ok in this case IMO.
bq.      
bq.      3 i guess any writable is ok but NullWritable. Maybe something has changed. i remember falling into that pitfall several generations of hadoop ago. You can verify by staging a simple experiment of writing a sequence file with nullwritable as either key or value and try to read it back. in my test long ago it would write ok but not read back. I beleive similar approach is used with keys in shuffle and sort. There is a reflection writable factory inside which is trying to use default constructor of the class to bring it up which is(was) not available for NullWritable.
bq.      
bq.      
bq.
bq.  
bq.  Raphael Cendrillon wrote:
bq.      Thanks Dmitriy. I've updated the diff to push the row summation into the mapper as you suggested, force the number of reducers to 1, and make the final output key IntWritable. Could you please take a look?
bq.  
bq.  Dmitriy Lyubimov wrote:
bq.      looks good on top of it. 
bq.      One nitpicking that i have is this 
bq.      
bq.      context.write(NullWritable.get(), new VectorWritable(runningSum));
bq.      
bq.      but runningSum is initialized in the map loop which technically may never be called (not likely but theoretically possible).
bq.      
bq.      so therefore i'd initialize runningSum vector to something that is nonzero in setup. or better yet just check for null and skip map output if no data. 
bq.      
bq.      Same considerations for reducer. it needs to handle the corner case when there's no input, correctly.
bq.

Thanks Dmitry. I've updated the patch to add these checks.


- Raphael


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3147/#review3838
-----------------------------------------------------------


On 2011-12-13 00:10:57, Raphael Cendrillon wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/3147/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-12-13 00:10:57)
bq.  
bq.  
bq.  Review request for mahout and Dmitriy Lyubimov.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.
bq.  
bq.  
bq.  This addresses bug MAHOUT-923.
bq.      https://issues.apache.org/jira/browse/MAHOUT-923
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474 
bq.    /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION 
bq.    /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474 
bq.  
bq.  Diff: https://reviews.apache.org/r/3147/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Junit test
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Raphael
bq.  
bq.


                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch, MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13169868#comment-13169868 ] 

jiraposter@reviews.apache.org commented on MAHOUT-923:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3147/#review3916
-----------------------------------------------------------


Looks much better from the trivial formatting standpoint.


- Ted


On 2011-12-13 17:53:35, Raphael Cendrillon wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/3147/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-12-13 17:53:35)
bq.  
bq.  
bq.  Review request for mahout, lancenorskog and Dmitriy Lyubimov.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.
bq.  
bq.  
bq.  This addresses bug MAHOUT-923.
bq.      https://issues.apache.org/jira/browse/MAHOUT-923
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474 
bq.    /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java PRE-CREATION 
bq.    /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474 
bq.  
bq.  Diff: https://reviews.apache.org/r/3147/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Junit test
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Raphael
bq.  
bq.


                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch, MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-923) Row mean job for PCA

Posted by "Raphael Cendrillon (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raphael Cendrillon updated MAHOUT-923:
--------------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

Marking this as closed as patch has been integrated into MAHOUT-817
                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch, MAHOUT-923.patch, MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13168465#comment-13168465 ] 

jiraposter@reviews.apache.org commented on MAHOUT-923:
------------------------------------------------------



bq.  On 2011-12-13 13:08:20, Ted Dunning wrote:
bq.  > /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java, line 199
bq.  > <https://reviews.apache.org/r/3147/diff/5/?file=64279#file64279line199>
bq.  >
bq.  >     I would really rather use standard terminology here.
bq.  >     
bq.  >     A mean row is a row that is that average of all others, but a row mean would mean an average of the elements a single row.  The plural form, row means, indicates the means of all rows.  What you are computing are the means of every column.
bq.  >     
bq.  >     In contrast, R, Octave and Matlab all use columnMeans as the name of the function being implemented here.

Sure. In Matlab/Octave I'm used to mean(A,1) (takes the mean across the 1st dimension, ie. across rows, but done per column). I'll change this to colMeans(), which seems to be clearer.


bq.  On 2011-12-13 13:08:20, Ted Dunning wrote:
bq.  > /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java, lines 129-132
bq.  > <https://reviews.apache.org/r/3147/diff/5/?file=64280#file64280line129>
bq.  >
bq.  >     There are lots of lines with trailing white space.  Isn't this easily suppressed?

I can use sed, or perhaps there's a better way?


- Raphael


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3147/#review3874
-----------------------------------------------------------


On 2011-12-13 04:46:47, Raphael Cendrillon wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/3147/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-12-13 04:46:47)
bq.  
bq.  
bq.  Review request for mahout, lancenorskog and Dmitriy Lyubimov.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.
bq.  
bq.  
bq.  This addresses bug MAHOUT-923.
bq.      https://issues.apache.org/jira/browse/MAHOUT-923
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474 
bq.    /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION 
bq.    /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474 
bq.  
bq.  Diff: https://reviews.apache.org/r/3147/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  Junit test
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Raphael
bq.  
bq.


                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch, MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171665#comment-13171665 ] 

jiraposter@reviews.apache.org commented on MAHOUT-923:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3147/
-----------------------------------------------------------

(Updated 2011-12-17 20:32:34.158332)


Review request for mahout, Ted Dunning, lancenorskog, and Dmitriy Lyubimov.


Summary
-------

Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.


This addresses bug MAHOUT-923.
    https://issues.apache.org/jira/browse/MAHOUT-923


Diffs
-----

  /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474 
  /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java PRE-CREATION 
  /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474 

Diff: https://reviews.apache.org/r/3147/diff


Testing
-------

Junit test


Thanks,

Raphael


                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch, MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171669#comment-13171669 ] 

jiraposter@reviews.apache.org commented on MAHOUT-923:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3147/
-----------------------------------------------------------

(Updated 2011-12-17 20:46:39.068863)


Review request for mahout, Ted Dunning, lancenorskog, and Dmitriy Lyubimov.


Changes
-------

Added javadocs


Summary
-------

Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.


This addresses bug MAHOUT-923.
    https://issues.apache.org/jira/browse/MAHOUT-923


Diffs (updated)
-----

  /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1215567 
  /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixColumnMeansJob.java PRE-CREATION 
  /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1215567 

Diff: https://reviews.apache.org/r/3147/diff


Testing
-------

Junit test


Thanks,

Raphael


                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch, MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: [jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
PPPS
you can also clone github's own mirror of git.apache.com but be
careful: they seem to be out of date pretty badly from time to time.
so better either use my branch or clone from apache directly (longer)
if github's mirror is out of date.

-d

On Sun, Dec 18, 2011 at 2:28 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> ps you can save time for pull/push tremendously by just cloning
> github.com:dlyubimov/mahout-commits repo. its trunk is already
> up-to-date with apache's.
>
> -d
>
> On Sun, Dec 18, 2011 at 2:25 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> PS if it is not terribly difficult, if you could post your patch on
>> github, it would be awesome (with complete mahout history based on
>> git.apache.org/mahout)
>>
>> Then we can merge it more easily in case it gets out of sync with the
>> trunk HEAD.
>>
>> Thank you for doing this.
>>
>>
>> On Sun, Dec 18, 2011 at 2:24 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>> If i had to guess, the mapper reported time should be under 1 minute
>>> regardless of the input size on any __non-vm__ machine (unless it is
>>> IBM XT :) even with -Xmx200m which is hadoop default.
>>>
>>> The reducer depends on the input size, but unless you manage to
>>> generate 1000 mappers, i don't think it will jump out of 1 min either.
>>>
>>> Thanks.
>>> -Dmitriy
>>>
>>> On Sun, Dec 18, 2011 at 2:04 PM, Raphael Cendrillon
>>> <ce...@gmail.com> wrote:
>>>> Thanks Dmitry. I tend to agree. Let's pull out the generic and just set it dense.
>>>>
>>>> Let me try out some larger data sets and see how it runs. Do you have any suggestions / expectations on performance that I should aim for? E.g. Given x nodes and a y by y matrix the job should take around z minutes?
>>>>
>>>> As a follow up, would it be worth starting work on the 'brute force' job for subtracting the average from each of the rows?
>>>>
>>>> On Dec 18, 2011, at 1:56 PM, "Dmitriy Lyubimov (Commented) (JIRA)" <ji...@apache.org> wrote:
>>>>
>>>>>
>>>>>    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171946#comment-13171946 ]
>>>>>
>>>>> Dmitriy Lyubimov commented on MAHOUT-923:
>>>>> -----------------------------------------
>>>>>
>>>>> Raphael, thank you for seeing this thru.
>>>>>
>>>>> Q:
>>>>> 1) -- why do you need vector class for the accumulator now? mean is kind of expected to be dense in the end, if not in the mappers then at least in the reducer for sure. And secondly, if you want to do this, why don't your api would accept a class instance, not a "short" name? that would be consistent with the Hadoop Job and file format apis which kind of take classes, not strings.
>>>>>
>>>>> 2) --  I know you have a unit test, but did you test it on a simulated input, like say 2G big? if not, i will have to test it before you proceed.
>>>>>
>>>>> As a next step, i guess i need to try it out to see if it works on various kind of inputs.
>>>>>
>>>>>> Row mean job for PCA
>>>>>> --------------------
>>>>>>
>>>>>>                Key: MAHOUT-923
>>>>>>                URL: https://issues.apache.org/jira/browse/MAHOUT-923
>>>>>>            Project: Mahout
>>>>>>         Issue Type: Improvement
>>>>>>         Components: Math
>>>>>>   Affects Versions: 0.6
>>>>>>           Reporter: Raphael Cendrillon
>>>>>>           Assignee: Raphael Cendrillon
>>>>>>            Fix For: Backlog
>>>>>>
>>>>>>        Attachments: MAHOUT-923.patch, MAHOUT-923.patch, MAHOUT-923.patch
>>>>>>
>>>>>>
>>>>>> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.
>>>>>
>>>>> --
>>>>> This message is automatically generated by JIRA.
>>>>> If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>>>>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>>>>
>>>>>

Re: [jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
ps you can save time for pull/push tremendously by just cloning
github.com:dlyubimov/mahout-commits repo. its trunk is already
up-to-date with apache's.

-d

On Sun, Dec 18, 2011 at 2:25 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> PS if it is not terribly difficult, if you could post your patch on
> github, it would be awesome (with complete mahout history based on
> git.apache.org/mahout)
>
> Then we can merge it more easily in case it gets out of sync with the
> trunk HEAD.
>
> Thank you for doing this.
>
>
> On Sun, Dec 18, 2011 at 2:24 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> If i had to guess, the mapper reported time should be under 1 minute
>> regardless of the input size on any __non-vm__ machine (unless it is
>> IBM XT :) even with -Xmx200m which is hadoop default.
>>
>> The reducer depends on the input size, but unless you manage to
>> generate 1000 mappers, i don't think it will jump out of 1 min either.
>>
>> Thanks.
>> -Dmitriy
>>
>> On Sun, Dec 18, 2011 at 2:04 PM, Raphael Cendrillon
>> <ce...@gmail.com> wrote:
>>> Thanks Dmitry. I tend to agree. Let's pull out the generic and just set it dense.
>>>
>>> Let me try out some larger data sets and see how it runs. Do you have any suggestions / expectations on performance that I should aim for? E.g. Given x nodes and a y by y matrix the job should take around z minutes?
>>>
>>> As a follow up, would it be worth starting work on the 'brute force' job for subtracting the average from each of the rows?
>>>
>>> On Dec 18, 2011, at 1:56 PM, "Dmitriy Lyubimov (Commented) (JIRA)" <ji...@apache.org> wrote:
>>>
>>>>
>>>>    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171946#comment-13171946 ]
>>>>
>>>> Dmitriy Lyubimov commented on MAHOUT-923:
>>>> -----------------------------------------
>>>>
>>>> Raphael, thank you for seeing this thru.
>>>>
>>>> Q:
>>>> 1) -- why do you need vector class for the accumulator now? mean is kind of expected to be dense in the end, if not in the mappers then at least in the reducer for sure. And secondly, if you want to do this, why don't your api would accept a class instance, not a "short" name? that would be consistent with the Hadoop Job and file format apis which kind of take classes, not strings.
>>>>
>>>> 2) --  I know you have a unit test, but did you test it on a simulated input, like say 2G big? if not, i will have to test it before you proceed.
>>>>
>>>> As a next step, i guess i need to try it out to see if it works on various kind of inputs.
>>>>
>>>>> Row mean job for PCA
>>>>> --------------------
>>>>>
>>>>>                Key: MAHOUT-923
>>>>>                URL: https://issues.apache.org/jira/browse/MAHOUT-923
>>>>>            Project: Mahout
>>>>>         Issue Type: Improvement
>>>>>         Components: Math
>>>>>   Affects Versions: 0.6
>>>>>           Reporter: Raphael Cendrillon
>>>>>           Assignee: Raphael Cendrillon
>>>>>            Fix For: Backlog
>>>>>
>>>>>        Attachments: MAHOUT-923.patch, MAHOUT-923.patch, MAHOUT-923.patch
>>>>>
>>>>>
>>>>> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.
>>>>
>>>> --
>>>> This message is automatically generated by JIRA.
>>>> If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>>>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>>>
>>>>

Re: [jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by Raphael Cendrillon <ce...@gmail.com>.
Sure. Github is actually much easier for me. Generating patches while working on multiple jiras gets messy :)

On Dec 18, 2011, at 2:25 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> PS if it is not terribly difficult, if you could post your patch on
> github, it would be awesome (with complete mahout history based on
> git.apache.org/mahout)
> 
> Then we can merge it more easily in case it gets out of sync with the
> trunk HEAD.
> 
> Thank you for doing this.
> 
> 
> On Sun, Dec 18, 2011 at 2:24 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>> If i had to guess, the mapper reported time should be under 1 minute
>> regardless of the input size on any __non-vm__ machine (unless it is
>> IBM XT :) even with -Xmx200m which is hadoop default.
>> 
>> The reducer depends on the input size, but unless you manage to
>> generate 1000 mappers, i don't think it will jump out of 1 min either.
>> 
>> Thanks.
>> -Dmitriy
>> 
>> On Sun, Dec 18, 2011 at 2:04 PM, Raphael Cendrillon
>> <ce...@gmail.com> wrote:
>>> Thanks Dmitry. I tend to agree. Let's pull out the generic and just set it dense.
>>> 
>>> Let me try out some larger data sets and see how it runs. Do you have any suggestions / expectations on performance that I should aim for? E.g. Given x nodes and a y by y matrix the job should take around z minutes?
>>> 
>>> As a follow up, would it be worth starting work on the 'brute force' job for subtracting the average from each of the rows?
>>> 
>>> On Dec 18, 2011, at 1:56 PM, "Dmitriy Lyubimov (Commented) (JIRA)" <ji...@apache.org> wrote:
>>> 
>>>> 
>>>>    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171946#comment-13171946 ]
>>>> 
>>>> Dmitriy Lyubimov commented on MAHOUT-923:
>>>> -----------------------------------------
>>>> 
>>>> Raphael, thank you for seeing this thru.
>>>> 
>>>> Q:
>>>> 1) -- why do you need vector class for the accumulator now? mean is kind of expected to be dense in the end, if not in the mappers then at least in the reducer for sure. And secondly, if you want to do this, why don't your api would accept a class instance, not a "short" name? that would be consistent with the Hadoop Job and file format apis which kind of take classes, not strings.
>>>> 
>>>> 2) --  I know you have a unit test, but did you test it on a simulated input, like say 2G big? if not, i will have to test it before you proceed.
>>>> 
>>>> As a next step, i guess i need to try it out to see if it works on various kind of inputs.
>>>> 
>>>>> Row mean job for PCA
>>>>> --------------------
>>>>> 
>>>>>                Key: MAHOUT-923
>>>>>                URL: https://issues.apache.org/jira/browse/MAHOUT-923
>>>>>            Project: Mahout
>>>>>         Issue Type: Improvement
>>>>>         Components: Math
>>>>>   Affects Versions: 0.6
>>>>>           Reporter: Raphael Cendrillon
>>>>>           Assignee: Raphael Cendrillon
>>>>>            Fix For: Backlog
>>>>> 
>>>>>        Attachments: MAHOUT-923.patch, MAHOUT-923.patch, MAHOUT-923.patch
>>>>> 
>>>>> 
>>>>> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.
>>>> 
>>>> --
>>>> This message is automatically generated by JIRA.
>>>> If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>>>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>>> 
>>>> 

Re: [jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
PS if it is not terribly difficult, if you could post your patch on
github, it would be awesome (with complete mahout history based on
git.apache.org/mahout)

Then we can merge it more easily in case it gets out of sync with the
trunk HEAD.

Thank you for doing this.


On Sun, Dec 18, 2011 at 2:24 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> If i had to guess, the mapper reported time should be under 1 minute
> regardless of the input size on any __non-vm__ machine (unless it is
> IBM XT :) even with -Xmx200m which is hadoop default.
>
> The reducer depends on the input size, but unless you manage to
> generate 1000 mappers, i don't think it will jump out of 1 min either.
>
> Thanks.
> -Dmitriy
>
> On Sun, Dec 18, 2011 at 2:04 PM, Raphael Cendrillon
> <ce...@gmail.com> wrote:
>> Thanks Dmitry. I tend to agree. Let's pull out the generic and just set it dense.
>>
>> Let me try out some larger data sets and see how it runs. Do you have any suggestions / expectations on performance that I should aim for? E.g. Given x nodes and a y by y matrix the job should take around z minutes?
>>
>> As a follow up, would it be worth starting work on the 'brute force' job for subtracting the average from each of the rows?
>>
>> On Dec 18, 2011, at 1:56 PM, "Dmitriy Lyubimov (Commented) (JIRA)" <ji...@apache.org> wrote:
>>
>>>
>>>    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171946#comment-13171946 ]
>>>
>>> Dmitriy Lyubimov commented on MAHOUT-923:
>>> -----------------------------------------
>>>
>>> Raphael, thank you for seeing this thru.
>>>
>>> Q:
>>> 1) -- why do you need vector class for the accumulator now? mean is kind of expected to be dense in the end, if not in the mappers then at least in the reducer for sure. And secondly, if you want to do this, why don't your api would accept a class instance, not a "short" name? that would be consistent with the Hadoop Job and file format apis which kind of take classes, not strings.
>>>
>>> 2) --  I know you have a unit test, but did you test it on a simulated input, like say 2G big? if not, i will have to test it before you proceed.
>>>
>>> As a next step, i guess i need to try it out to see if it works on various kind of inputs.
>>>
>>>> Row mean job for PCA
>>>> --------------------
>>>>
>>>>                Key: MAHOUT-923
>>>>                URL: https://issues.apache.org/jira/browse/MAHOUT-923
>>>>            Project: Mahout
>>>>         Issue Type: Improvement
>>>>         Components: Math
>>>>   Affects Versions: 0.6
>>>>           Reporter: Raphael Cendrillon
>>>>           Assignee: Raphael Cendrillon
>>>>            Fix For: Backlog
>>>>
>>>>        Attachments: MAHOUT-923.patch, MAHOUT-923.patch, MAHOUT-923.patch
>>>>
>>>>
>>>> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.
>>>
>>> --
>>> This message is automatically generated by JIRA.
>>> If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>>
>>>

Re: [jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
If i had to guess, the mapper reported time should be under 1 minute
regardless of the input size on any __non-vm__ machine (unless it is
IBM XT :) even with -Xmx200m which is hadoop default.

The reducer depends on the input size, but unless you manage to
generate 1000 mappers, i don't think it will jump out of 1 min either.

Thanks.
-Dmitriy

On Sun, Dec 18, 2011 at 2:04 PM, Raphael Cendrillon
<ce...@gmail.com> wrote:
> Thanks Dmitry. I tend to agree. Let's pull out the generic and just set it dense.
>
> Let me try out some larger data sets and see how it runs. Do you have any suggestions / expectations on performance that I should aim for? E.g. Given x nodes and a y by y matrix the job should take around z minutes?
>
> As a follow up, would it be worth starting work on the 'brute force' job for subtracting the average from each of the rows?
>
> On Dec 18, 2011, at 1:56 PM, "Dmitriy Lyubimov (Commented) (JIRA)" <ji...@apache.org> wrote:
>
>>
>>    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171946#comment-13171946 ]
>>
>> Dmitriy Lyubimov commented on MAHOUT-923:
>> -----------------------------------------
>>
>> Raphael, thank you for seeing this thru.
>>
>> Q:
>> 1) -- why do you need vector class for the accumulator now? mean is kind of expected to be dense in the end, if not in the mappers then at least in the reducer for sure. And secondly, if you want to do this, why don't your api would accept a class instance, not a "short" name? that would be consistent with the Hadoop Job and file format apis which kind of take classes, not strings.
>>
>> 2) --  I know you have a unit test, but did you test it on a simulated input, like say 2G big? if not, i will have to test it before you proceed.
>>
>> As a next step, i guess i need to try it out to see if it works on various kind of inputs.
>>
>>> Row mean job for PCA
>>> --------------------
>>>
>>>                Key: MAHOUT-923
>>>                URL: https://issues.apache.org/jira/browse/MAHOUT-923
>>>            Project: Mahout
>>>         Issue Type: Improvement
>>>         Components: Math
>>>   Affects Versions: 0.6
>>>           Reporter: Raphael Cendrillon
>>>           Assignee: Raphael Cendrillon
>>>            Fix For: Backlog
>>>
>>>        Attachments: MAHOUT-923.patch, MAHOUT-923.patch, MAHOUT-923.patch
>>>
>>>
>>> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.
>>
>> --
>> This message is automatically generated by JIRA.
>> If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>
>>

Re: [jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by Raphael Cendrillon <ce...@gmail.com>.
Sorry, I should have been clearer on this. By 'brute force' I mean mean subtraction from Y, not A.

Dmitry, from what I can gather from your document this is still necessary, even with mean propagation. Is that right?

On 18 Dec, 2011, at 11:19 PM, Ted Dunning wrote:

> No.  That way lies madness.  That makes sparse rows non-sparse.
> 
> Such subtraction must be done implicitly, not explicitly.
> 
> On Sun, Dec 18, 2011 at 2:04 PM, Raphael Cendrillon <
> cendrillon1978@gmail.com> wrote:
> 
>> As a follow up, would it be worth starting work on the 'brute force' job
>> for subtracting the average from each of the rows?
>> 


Re: [jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by Ted Dunning <te...@gmail.com>.
No.  That way lies madness.  That makes sparse rows non-sparse.

Such subtraction must be done implicitly, not explicitly.

On Sun, Dec 18, 2011 at 2:04 PM, Raphael Cendrillon <
cendrillon1978@gmail.com> wrote:

> As a follow up, would it be worth starting work on the 'brute force' job
> for subtracting the average from each of the rows?
>

Re: [jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by Raphael Cendrillon <ce...@gmail.com>.
Thanks Dmitry. I tend to agree. Let's pull out the generic and just set it dense. 

Let me try out some larger data sets and see how it runs. Do you have any suggestions / expectations on performance that I should aim for? E.g. Given x nodes and a y by y matrix the job should take around z minutes?

As a follow up, would it be worth starting work on the 'brute force' job for subtracting the average from each of the rows?

On Dec 18, 2011, at 1:56 PM, "Dmitriy Lyubimov (Commented) (JIRA)" <ji...@apache.org> wrote:

> 
>    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171946#comment-13171946 ] 
> 
> Dmitriy Lyubimov commented on MAHOUT-923:
> -----------------------------------------
> 
> Raphael, thank you for seeing this thru. 
> 
> Q: 
> 1) -- why do you need vector class for the accumulator now? mean is kind of expected to be dense in the end, if not in the mappers then at least in the reducer for sure. And secondly, if you want to do this, why don't your api would accept a class instance, not a "short" name? that would be consistent with the Hadoop Job and file format apis which kind of take classes, not strings. 
> 
> 2) --  I know you have a unit test, but did you test it on a simulated input, like say 2G big? if not, i will have to test it before you proceed.
> 
> As a next step, i guess i need to try it out to see if it works on various kind of inputs. 
> 
>> Row mean job for PCA
>> --------------------
>> 
>>                Key: MAHOUT-923
>>                URL: https://issues.apache.org/jira/browse/MAHOUT-923
>>            Project: Mahout
>>         Issue Type: Improvement
>>         Components: Math
>>   Affects Versions: 0.6
>>           Reporter: Raphael Cendrillon
>>           Assignee: Raphael Cendrillon
>>            Fix For: Backlog
>> 
>>        Attachments: MAHOUT-923.patch, MAHOUT-923.patch, MAHOUT-923.patch
>> 
>> 
>> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.
> 
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
> 
> 

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "Dmitriy Lyubimov (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171946#comment-13171946 ] 

Dmitriy Lyubimov commented on MAHOUT-923:
-----------------------------------------

Raphael, thank you for seeing this thru. 

Q: 
1) -- why do you need vector class for the accumulator now? mean is kind of expected to be dense in the end, if not in the mappers then at least in the reducer for sure. And secondly, if you want to do this, why don't your api would accept a class instance, not a "short" name? that would be consistent with the Hadoop Job and file format apis which kind of take classes, not strings. 

2) --  I know you have a unit test, but did you test it on a simulated input, like say 2G big? if not, i will have to test it before you proceed.

As a next step, i guess i need to try it out to see if it works on various kind of inputs. 
                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch, MAHOUT-923.patch, MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Closed] (MAHOUT-923) Row mean job for PCA

Posted by "Raphael Cendrillon (Closed) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raphael Cendrillon closed MAHOUT-923.
-------------------------------------


Marking this as closed as patch has been integrated into MAHOUT-817
                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch, MAHOUT-923.patch, MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "Ted Dunning (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13168392#comment-13168392 ] 

Ted Dunning commented on MAHOUT-923:
------------------------------------

Sean,

Clone can only return *exactly* the same type.  This is a real problem sometimes.  For example, view matrices should not return view matrices but should return something of the type of the underlying matrix, but in the right size.

The issue is analogous to the problem with constructors compared to factory methods.  With constructors, you have already defined the return type and you may not know enough to really choose the correct return type.  With factory methods, the framework is free to give you anything that satisfies the basic contract.

                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch, MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-923) Row mean job for PCA

Posted by "Raphael Cendrillon (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raphael Cendrillon updated MAHOUT-923:
--------------------------------------

    Status: Patch Available  (was: Open)
    
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch, MAHOUT-923.patch, MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "Ted Dunning (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13168479#comment-13168479 ] 

Ted Dunning commented on MAHOUT-923:
------------------------------------

For getting rid of trailing white space, most IDE's have this function built in.

What are you using to write this?
                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch, MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167967#comment-13167967 ] 

jiraposter@reviews.apache.org commented on MAHOUT-923:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3147/
-----------------------------------------------------------

(Updated 2011-12-13 00:10:57.848590)


Review request for mahout and Dmitriy Lyubimov.


Summary
-------

Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.


This addresses bug MAHOUT-923.
    https://issues.apache.org/jira/browse/MAHOUT-923


Diffs
-----

  /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474 
  /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION 
  /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474 

Diff: https://reviews.apache.org/r/3147/diff


Testing
-------

Junit test


Thanks,

Raphael


                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "Raphael Cendrillon (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167341#comment-13167341 ] 

Raphael Cendrillon commented on MAHOUT-923:
-------------------------------------------

Thanks Lance. A combiner is definitely the next step. One question, is there already a writable for tuples of e.g. int and Vector, or should I just write one from scratch? I know there is TupleWritable, but from what I've read online it's better to avoid that unless you're doing a multiple input join.

Regarding the class for the output vector, are you saying that instead of inhereting the class from the rows of the DistributedRowMatrix you'd rather be able to specify this manually?


                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13168134#comment-13168134 ] 

jiraposter@reviews.apache.org commented on MAHOUT-923:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3147/
-----------------------------------------------------------

(Updated 2011-12-13 04:46:47.630950)


Review request for mahout, lancenorskog and Dmitriy Lyubimov.


Summary
-------

Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.


This addresses bug MAHOUT-923.
    https://issues.apache.org/jira/browse/MAHOUT-923


Diffs
-----

  /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213474 
  /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION 
  /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213474 

Diff: https://reviews.apache.org/r/3147/diff


Testing
-------

Junit test


Thanks,

Raphael


                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch, MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-923) Row mean job for PCA

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167440#comment-13167440 ] 

jiraposter@reviews.apache.org commented on MAHOUT-923:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3147/
-----------------------------------------------------------

(Updated 2011-12-12 10:41:46.013180)


Review request for mahout.


Summary
-------

Here's a patch with a simple job to calculate the row mean (column-wise mean). One outstanding issue is the combiner, this requires a wrtiable class IntVectorTupleWritable, where the Int stores the number of rows, and the Vector stores the column-wise sum.


This addresses bug MAHOUT-923.
    https://issues.apache.org/jira/browse/MAHOUT-923


Diffs (updated)
-----

  /trunk/core/src/main/java/org/apache/mahout/math/hadoop/DistributedRowMatrix.java 1213095 
  /trunk/core/src/main/java/org/apache/mahout/math/hadoop/MatrixRowMeanJob.java PRE-CREATION 
  /trunk/core/src/test/java/org/apache/mahout/math/hadoop/TestDistributedRowMatrix.java 1213095 

Diff: https://reviews.apache.org/r/3147/diff


Testing
-------

Junit test


Thanks,

Raphael


                
> Row mean job for PCA
> --------------------
>
>                 Key: MAHOUT-923
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-923
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Math
>    Affects Versions: 0.6
>            Reporter: Raphael Cendrillon
>            Assignee: Raphael Cendrillon
>             Fix For: Backlog
>
>         Attachments: MAHOUT-923.patch
>
>
> Add map reduce job for calculating mean row (column-wise mean) of a Distributed Row Matrix for use in PCA.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira