You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Yuri Niyazov <yu...@gmail.com> on 2008/05/30 20:05:38 UTC

Question from newbie to mahout

Hi everyone,

  I reviewed the mailing list archives, and the Stanford NIPS paper. It is
unclear which algorithms have already been "claimed" for development, so to
speak, by the selected GSoC participants. I am not part of GSoC, and I would
like to write and contribute at least one full-blown (both uniprocess & MR)
algorithm implementation to Mahout, but I don't want to write something that
someone else has plans to do in the near future. A summary or an appropriate
pointer to such a summary would be appreciated.

Thanks,
YN.

Re: Question from newbie to mahout

Posted by Ted Dunning <te...@gmail.com>.

On Fri, May 30, 2008 at 1:14 PM, Grant Ingersoll <gs...@apache.org>
wrote:

>
>
> a) tree classifiers, notably random forests and some sort of boosted
>> decision tree
>>
>
> I think one of the GSOC's may be working on Random Forests.  Ian?

Excellent!

> c) better parallel matrix operations
>>
>
> Note, Hama has been accepted into incubation, so that is something to keep
> an eye on and evaluate.

I am dubious of the Hama approach, based as it is on Hbase.  For that level
of performance, we can just emit order-1 updates and depend on combiners to
make us whole.  For higher performance, I think we need to do much better
blocking and do high performance multiplies on map nodes.

For example, for the matrix product {A B}_ij = sum_k a_ik b_kj  we can
implement this using map reduce as a bigram counter.

If A and B are encoded as triples containing (m, row, column, value) where m
is 'a' or 'b' as appropriate, then we can do roughly the following.  This
may be fast enough for very sparse matrices, but might be very slow for
dense matrices:

    map1 = {key, value, out ->
          // *value is ['a', i, k, a_ik] or ['b', k, j, b_kj]*
          // *output record is keyed by k and has ['a', i, a_ik] or ['b', j,
b_kj] as value
          *// *in real life, the 'a' or 'b' would be inferred in the
configure method*
          if (value[0] == "a") out.collect(value[2], [value[0], value[1],
value[3]])
          else out.collect(value[1], [value[0], value[2], value[3]])
    }
    reducer1 = {key, values, out ->
          // *key is k, values contains items like ['a', i, a_ik] or ['b',
j, b_kj]*
          // *we build a nested map to simplify iterating over the cross
product*
          def v = [:]
          def v.a = [:]
          def v.b = [:]
          values.each {
               v[it[0]] [it[1]] = it[2]
          }
          v.a.each {i, a_ik ->
                v.b.each {j, b_kj ->
                     out.collect([i, j], a_jk * b_kj)
                }
          }
    }

    // *second mr adds up all of the products by reducing on final position*
    map2 = {key, ijv, out ->
        // *key is [i,j], value is a_ik * b_kj for some (now unknown) k*
        out.collect(ijv[0..1], ijv[2])
    }
    combiner2 = {ij, values, out ->
        out.collect(ij, values.sum())
    }
    reducer2 = combiner

Re: Question from newbie to mahout

Posted by Grant Ingersoll <gs...@apache.org>.

On May 30, 2008, at 3:54 PM, Ted Dunning wrote:

> THere are several clustering implementations done and in process, some
> variants of naive bayesian classficiation are happening, somebody is  
> working
> on logistic regression (and hopefully generalized linear modeling).   
> Another
> guy is doing some nice thinking about evolutionary algorithms.  I  
> think
> somebody is working on SVM, but I haven't heard much about that just  
> lately.
>
> Some things that have not been jumped on include:
>
> a) tree classifiers, notably random forests and some sort of boosted
> decision tree

I think one of the GSOC's may be working on Random Forests.  Ian?

>
>
> b) general coocurrence analysis
>
> c) better parallel matrix operations

Note, Hama has been accepted into incubation, so that is something to  
keep an eye on and evaluate.

>
>
> d) latent variable techniques such as LDA, MDCA, non-negative matrix
> factorization and LSI (although there has been some discussion of this
> recently)
>
> As far as I know (a) and (b) are wide open.  I would expect that the  
> folks
> working on different parts of existing efforts would welcome some  
> additional
> umph, so I wouldn't let that stop you.

Definitely!   Demos, docs, other fun algorithms.  It's really wide  
open at this point.


>
>
> On Fri, May 30, 2008 at 11:05 AM, Yuri Niyazov  
> <yu...@gmail.com>
> wrote:
>
>> Hi everyone,
>>
>> I reviewed the mailing list archives, and the Stanford NIPS paper.  
>> It is
>> unclear which algorithms have already been "claimed" for  
>> development, so to
>> speak, by the selected GSoC participants. I am not part of GSoC,  
>> and I
>> would
>> like to write and contribute at least one full-blown (both  
>> uniprocess & MR)
>> algorithm implementation to Mahout, but I don't want to write  
>> something
>> that
>> someone else has plans to do in the near future. A summary or an
>> appropriate
>> pointer to such a summary would be appreciated.
>>
>> Thanks,
>> YN.
>>
>
>
>
> -- 
> ted

--------------------------
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Question from newbie to mahout

Posted by Ted Dunning <te...@gmail.com>.

THere are several clustering implementations done and in process, some
variants of naive bayesian classficiation are happening, somebody is working
on logistic regression (and hopefully generalized linear modeling).  Another
guy is doing some nice thinking about evolutionary algorithms.  I think
somebody is working on SVM, but I haven't heard much about that just lately.

Some things that have not been jumped on include:

a) tree classifiers, notably random forests and some sort of boosted
decision tree

b) general coocurrence analysis

c) better parallel matrix operations

d) latent variable techniques such as LDA, MDCA, non-negative matrix
factorization and LSI (although there has been some discussion of this
recently)

As far as I know (a) and (b) are wide open.  I would expect that the folks
working on different parts of existing efforts would welcome some additional
umph, so I wouldn't let that stop you.

On Fri, May 30, 2008 at 11:05 AM, Yuri Niyazov <yu...@gmail.com>
wrote:

> Hi everyone,
>
>  I reviewed the mailing list archives, and the Stanford NIPS paper. It is
> unclear which algorithms have already been "claimed" for development, so to
> speak, by the selected GSoC participants. I am not part of GSoC, and I
> would
> like to write and contribute at least one full-blown (both uniprocess & MR)
> algorithm implementation to Mahout, but I don't want to write something
> that
> someone else has plans to do in the near future. A summary or an
> appropriate
> pointer to such a summary would be appreciated.
>
> Thanks,
> YN.
>

-- 
ted