You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Hector Yee <he...@gmail.com> on 2011/05/18 02:26:49 UTC

Possible contributions

Hello,

  Some background on myself - I was at Google the last 5 years working on
the self-driving car, image search and youtube in machine learning (
http://www.linkedin.com/in/yeehector)

  I have some proposed contributions and I wonder if they will be useful in
Mahout (otherwise I will just commit it in a new open source project in
github).

- Sparse autoencoder (think of it as something like LDA - it has an
unsupervised hidden topic model and an output that reconstructs the input
but blurs it a bit due to the hidden layer bottleneck). The variant I am
planning to implement is optimized for sparse (e.g. text) labels. Not sure
if it will fit into the filter framework?

- Boosting with l1 regularization and back pruning. (just the binary case -
I haven't had much luck with the multi-class case vs adaboost ECC).

- online kernelized learner for ranking and classification (optimization in
the primal rather than the dual)

I'm new to Mahout, so let me know if anyone is working on these already or
not. I've implemented them several times in C++.

-- 
Yee Yang Li Hector
http://hectorgon.blogspot.com/ (tech + travel)
http://hectorgon.com (book reviews)

Re: Possible contributions

Posted by Ted Dunning <te...@gmail.com>.

Good man.

On Mon, May 23, 2011 at 3:45 PM, Hector Yee <he...@gmail.com> wrote:

> FYI the ICLA has been filed.
>
>

Re: Possible contributions

Posted by Hector Yee <he...@gmail.com>.

FYI the ICLA has been filed.

On Wed, May 18, 2011 at 3:27 AM, Ted Dunning <te...@gmail.com> wrote:

> Hector,
>
> An in-core variant or a sequential on-disk variant is a great starting
> point
> and focussing on the kernelized ranker is also a good place to start.
>
> It would help if you can provide lots of visibility early in the process.
>  IF the JIRA process of attaching a diff becomes cumbersome, then you can
> use something like an associated github mirror of Mahout where you keep
> your
> work in progress.  Make sure you are good with the ASL requirements and if
> you plan to write more than a few hundred lines of code, it would be good
> to
> file an individual contributor license.  That can be found here:
>
> http://www.apache.org/licenses/icla.txt
>
> On Tue, May 17, 2011 at 10:17 PM, Hector Yee <he...@gmail.com> wrote:
>
> > I'll probably just implement an in-core variant first.
> >
> > re: online kernelized ranker - this is pretty easy to do so I will
> probably
> > do it as a starter contribution.
> >
> > re: java, sure I have no problems writing it all in java.
> >
> > Whats the process in doing this? Write the code and them start a jiri
> > ticket
> > with the patch?
> >
>



-- 
Yee Yang Li Hector
http://hectorgon.blogspot.com/ (tech + travel)
http://hectorgon.com (book reviews)

Re: Possible contributions

Posted by Hector Yee <he...@gmail.com>.

I just completed and submitted an online passive aggressive classifier as my
test case (MAHOUT-702). I believe I've followed the how to except I couldn't
find a CHANGES.txt to write my changes in.

On Wed, May 18, 2011 at 6:27 PM, Ted Dunning <te...@gmail.com> wrote:

> Hector,
>
> An in-core variant or a sequential on-disk variant is a great starting
> point
> and focussing on the kernelized ranker is also a good place to start.
>
> It would help if you can provide lots of visibility early in the process.
>  IF the JIRA process of attaching a diff becomes cumbersome, then you can
> use something like an associated github mirror of Mahout where you keep
> your
> work in progress.  Make sure you are good with the ASL requirements and if
> you plan to write more than a few hundred lines of code, it would be good
> to
> file an individual contributor license.  That can be found here:
>
> http://www.apache.org/licenses/icla.txt
>
> On Tue, May 17, 2011 at 10:17 PM, Hector Yee <he...@gmail.com> wrote:
>
> > I'll probably just implement an in-core variant first.
> >
> > re: online kernelized ranker - this is pretty easy to do so I will
> probably
> > do it as a starter contribution.
> >
> > re: java, sure I have no problems writing it all in java.
> >
> > Whats the process in doing this? Write the code and them start a jiri
> > ticket
> > with the patch?
> >
>



-- 
Yee Yang Li Hector
http://hectorgon.blogspot.com/ (tech + travel)
http://hectorgon.com (book reviews)

Re: Possible contributions

Posted by Ted Dunning <te...@gmail.com>.

Hector,

An in-core variant or a sequential on-disk variant is a great starting point
and focussing on the kernelized ranker is also a good place to start.

It would help if you can provide lots of visibility early in the process.
 IF the JIRA process of attaching a diff becomes cumbersome, then you can
use something like an associated github mirror of Mahout where you keep your
work in progress.  Make sure you are good with the ASL requirements and if
you plan to write more than a few hundred lines of code, it would be good to
file an individual contributor license.  That can be found here:

http://www.apache.org/licenses/icla.txt

On Tue, May 17, 2011 at 10:17 PM, Hector Yee <he...@gmail.com> wrote:

> I'll probably just implement an in-core variant first.
>
> re: online kernelized ranker - this is pretty easy to do so I will probably
> do it as a starter contribution.
>
> re: java, sure I have no problems writing it all in java.
>
> Whats the process in doing this? Write the code and them start a jiri
> ticket
> with the patch?
>

Re: Possible contributions

Posted by Shannon Quinn <sq...@gatech.edu>.

As far as I understand, the problem isn't adding multiple inputs; you 
can do it exactly as the documentation you linked shows. The problem 
(which is what we're trying to solve in MAHOUT-537) is how to tell 
within the Mapper/Reducer itself from which input path the current data 
are taken; there's no way to say for sure where the row of data you're 
currently operating on came from, which is information that is essential 
to, say, matrix-matrix multiplication.

However, if all you're wanting to do is allow your algorithm to take 
multiple input files and treat them more or less as "one big file", then 
I don't think this approach should give you any problems.

On 5/28/11 5:43 PM, Dhruv Kumar wrote:
> Isabel and Dmitry,
>
> Thank you for your input on this. I've noticed that Mahout's code uses the
> new mapreduce package, so I have been following the new APIs. This was also
> suggested by Sean w.r.t Mahout-294.
>
> Multiple inputs is a requirement for my project and I was planning on using
> the old mapred.lib.multipleinputs class which is not marked as deprecated in
> 0.20.2:
>
>
> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleInputs.html
>
> Is this advisable and if not, what are my options to handle multiple inputs?
>
> On Sat, May 28, 2011 at 5:59 PM, Dmitriy Lyubimov<dl...@gmail.com>  wrote:
>
>> Dhruv,
>>
>> Just a warning, before you want to lock yourself to new apis:
>>
>> Yes new APIs are preferrable but it is not always possible to use them
>> because 0.20.2 lacks _a lot_ in terms of bare necessities in new api
>> realm . (multiple inputs/ outputs come to mind at once).
>>
>> I think i did weasel my way out of those in some cases but i did not
>> test it at scale yet, it is certainly not an official way to do it.
>>
>> Either way it's probably not worth it for anything beyond sheer basic
>> MR functionality until we switch to something that actually does have
>> the 'new api' because 0.20.2 has some very much truncated version
>> which is very far from complete.
>>
>> -d
>>
>> On Fri, May 27, 2011 at 3:19 AM, Isabel Drost<is...@apache.org>  wrote:
>>> On 18.05.2011 Dhruv Kumar wrote:
>>>> For the GSoC project which version of Hadoop's API should I follow?
>>> Try to use the new M/R apis where possible - we had the same discussion
>> in an
>>> earlier thread on spectral clustering, in addition Sean just opened an
>> issue
>>> concerning Upgrading to newer Hadoop versions, you can take a look there
>> as
>>> well.
>>>
>>> Isabel
>>>

Re: Possible contributions

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Yes. There's always a workaround.

Say you have input1 and it's a tab separated text with 3 attributes
and you have another input2 in a sequence file with another 6
attributes so yes, you could run 2 map-only jobs on them to bring them
to a homogeneous format with a join key indicating which part they are
coming from, and then dump them together and  run the final
single-input job to join them.

But it's the point of a tool: you can always say you could manage to
do some fire with just rubbing one wood on another instead of buying a
lighter. Doesn't mean not having a lighter is a good thing for the
purposes of camping. Nor it proves that lighter under new api settings
works as expected :)

-d

On Sat, May 28, 2011 at 4:05 PM, Sean Owen <sr...@gmail.com> wrote:
> You can't mix and match old and new APIs in general, no.
>
> It's better to use new APIs unless it would make the implementation really
> hard or really slow.
>
> The new APIs lack MultipleInputs as of 0.20.x. That doesn't mean you can't
> have multiple inputs. You can add several input paths as Shannon says. What
> I don't know how to do in 0.20.x is configure a different Mapper per input
> path. (Or is it true that this is an exception where you *can* use the old
> API? if MultipleInputs is just setting some config variables in a certain
> way that the new API respects, it works.)
>
> However it is *not* true that you can't know where the data is from. This is
> the whole "join key" business I was alluding to last time. The key can
> contain this info. (The value could, but it's more useful in the key where
> it can even affect ordering.)
>
> Bottom line: without knowing much about what you're up to I am still 80%
> sure you can construct your implementation with what the new API has.
>
>
> On Sat, May 28, 2011 at 11:58 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:
>
>> As i said, and as i think Shannon's reply confirms in part, you
>> sometimes can weasel your way out of this, but this is not how this
>> api is intended to be used. To begin with, old and new api have never
>> been intended to be used together (so you are already breaking interop
>> guarantees with any of future releases), and, second, by using old api
>> in part.... you are not using new anyway by definition of your
>> actions.
>>
>>
>

Re: Possible contributions

Posted by Sean Owen <sr...@gmail.com>.

You can't mix and match old and new APIs in general, no.

It's better to use new APIs unless it would make the implementation really
hard or really slow.

The new APIs lack MultipleInputs as of 0.20.x. That doesn't mean you can't
have multiple inputs. You can add several input paths as Shannon says. What
I don't know how to do in 0.20.x is configure a different Mapper per input
path. (Or is it true that this is an exception where you *can* use the old
API? if MultipleInputs is just setting some config variables in a certain
way that the new API respects, it works.)

However it is *not* true that you can't know where the data is from. This is
the whole "join key" business I was alluding to last time. The key can
contain this info. (The value could, but it's more useful in the key where
it can even affect ordering.)

Bottom line: without knowing much about what you're up to I am still 80%
sure you can construct your implementation with what the new API has.

On Sat, May 28, 2011 at 11:58 PM, Dmitriy Lyubimov <dl...@gmail.com>wrote:

> As i said, and as i think Shannon's reply confirms in part, you
> sometimes can weasel your way out of this, but this is not how this
> api is intended to be used. To begin with, old and new api have never
> been intended to be used together (so you are already breaking interop
> guarantees with any of future releases), and, second, by using old api
> in part.... you are not using new anyway by definition of your
> actions.
>
>

Re: Possible contributions

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

As i said, and as i think Shannon's reply confirms in part, you
sometimes can weasel your way out of this, but this is not how this
api is intended to be used. To begin with, old and new api have never
been intended to be used together (so you are already breaking interop
guarantees with any of future releases), and, second, by using old api
in part.... you are not using new anyway by definition of your
actions.

On Sat, May 28, 2011 at 3:54 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
> I don't see how you can use deprecated multiple inputs, as if i am not
> missing anything, its signature is tied to old api types, such as
> JobConf, which you of course won't have as you define a new api job.
>
> On Sat, May 28, 2011 at 3:43 PM, Dhruv Kumar <dk...@ecs.umass.edu> wrote:
>> Isabel and Dmitry,
>>
>> Thank you for your input on this. I've noticed that Mahout's code uses the
>> new mapreduce package, so I have been following the new APIs. This was also
>> suggested by Sean w.r.t Mahout-294.
>>
>> Multiple inputs is a requirement for my project and I was planning on using
>> the old mapred.lib.multipleinputs class which is not marked as deprecated in
>> 0.20.2:
>>
>>
>> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleInputs.html
>>
>> Is this advisable and if not, what are my options to handle multiple inputs?
>>
>> On Sat, May 28, 2011 at 5:59 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>>
>>> Dhruv,
>>>
>>> Just a warning, before you want to lock yourself to new apis:
>>>
>>> Yes new APIs are preferrable but it is not always possible to use them
>>> because 0.20.2 lacks _a lot_ in terms of bare necessities in new api
>>> realm . (multiple inputs/ outputs come to mind at once).
>>>
>>> I think i did weasel my way out of those in some cases but i did not
>>> test it at scale yet, it is certainly not an official way to do it.
>>>
>>> Either way it's probably not worth it for anything beyond sheer basic
>>> MR functionality until we switch to something that actually does have
>>> the 'new api' because 0.20.2 has some very much truncated version
>>> which is very far from complete.
>>>
>>> -d
>>>
>>> On Fri, May 27, 2011 at 3:19 AM, Isabel Drost <is...@apache.org> wrote:
>>> > On 18.05.2011 Dhruv Kumar wrote:
>>> >> For the GSoC project which version of Hadoop's API should I follow?
>>> >
>>> > Try to use the new M/R apis where possible - we had the same discussion
>>> in an
>>> > earlier thread on spectral clustering, in addition Sean just opened an
>>> issue
>>> > concerning Upgrading to newer Hadoop versions, you can take a look there
>>> as
>>> > well.
>>> >
>>> > Isabel
>>> >
>>>
>>
>

Re: Possible contributions

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Job input path is always multiple paths. You don't need to have
multiple inputs to specify that. What you need multiple inputs for is
to be able to specify different input file formats and assign
different mappers to handle them.

If all your input is formatted homogeneously both record structure
wise and logic wise, then you don't need multiple inputs. That's not
what MI main achievement.

On Sat, May 28, 2011 at 3:58 PM, Shannon Quinn <sq...@gatech.edu> wrote:
> Isn't this just a matter of making multiple calls to
> FileInputFormat.addInputPath(...) (to adhere to the new APIs) ?
>
> On 5/28/11 5:54 PM, Dmitriy Lyubimov wrote:
>>
>> I don't see how you can use deprecated multiple inputs, as if i am not
>> missing anything, its signature is tied to old api types, such as
>> JobConf, which you of course won't have as you define a new api job.
>>
>> On Sat, May 28, 2011 at 3:43 PM, Dhruv Kumar<dk...@ecs.umass.edu>  wrote:
>>>
>>> Isabel and Dmitry,
>>>
>>> Thank you for your input on this. I've noticed that Mahout's code uses
>>> the
>>> new mapreduce package, so I have been following the new APIs. This was
>>> also
>>> suggested by Sean w.r.t Mahout-294.
>>>
>>> Multiple inputs is a requirement for my project and I was planning on
>>> using
>>> the old mapred.lib.multipleinputs class which is not marked as deprecated
>>> in
>>> 0.20.2:
>>>
>>>
>>>
>>> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleInputs.html
>>>
>>> Is this advisable and if not, what are my options to handle multiple
>>> inputs?
>>>
>>> On Sat, May 28, 2011 at 5:59 PM, Dmitriy Lyubimov<dl...@gmail.com>
>>>  wrote:
>>>
>>>> Dhruv,
>>>>
>>>> Just a warning, before you want to lock yourself to new apis:
>>>>
>>>> Yes new APIs are preferrable but it is not always possible to use them
>>>> because 0.20.2 lacks _a lot_ in terms of bare necessities in new api
>>>> realm . (multiple inputs/ outputs come to mind at once).
>>>>
>>>> I think i did weasel my way out of those in some cases but i did not
>>>> test it at scale yet, it is certainly not an official way to do it.
>>>>
>>>> Either way it's probably not worth it for anything beyond sheer basic
>>>> MR functionality until we switch to something that actually does have
>>>> the 'new api' because 0.20.2 has some very much truncated version
>>>> which is very far from complete.
>>>>
>>>> -d
>>>>
>>>> On Fri, May 27, 2011 at 3:19 AM, Isabel Drost<is...@apache.org>  wrote:
>>>>>
>>>>> On 18.05.2011 Dhruv Kumar wrote:
>>>>>>
>>>>>> For the GSoC project which version of Hadoop's API should I follow?
>>>>>
>>>>> Try to use the new M/R apis where possible - we had the same discussion
>>>>
>>>> in an
>>>>>
>>>>> earlier thread on spectral clustering, in addition Sean just opened an
>>>>
>>>> issue
>>>>>
>>>>> concerning Upgrading to newer Hadoop versions, you can take a look
>>>>> there
>>>>
>>>> as
>>>>>
>>>>> well.
>>>>>
>>>>> Isabel
>>>>>
>
>

Re: Possible contributions

Posted by Shannon Quinn <sq...@gatech.edu>.

Isn't this just a matter of making multiple calls to 
FileInputFormat.addInputPath(...) (to adhere to the new APIs) ?

On 5/28/11 5:54 PM, Dmitriy Lyubimov wrote:
> I don't see how you can use deprecated multiple inputs, as if i am not
> missing anything, its signature is tied to old api types, such as
> JobConf, which you of course won't have as you define a new api job.
>
> On Sat, May 28, 2011 at 3:43 PM, Dhruv Kumar<dk...@ecs.umass.edu>  wrote:
>> Isabel and Dmitry,
>>
>> Thank you for your input on this. I've noticed that Mahout's code uses the
>> new mapreduce package, so I have been following the new APIs. This was also
>> suggested by Sean w.r.t Mahout-294.
>>
>> Multiple inputs is a requirement for my project and I was planning on using
>> the old mapred.lib.multipleinputs class which is not marked as deprecated in
>> 0.20.2:
>>
>>
>> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleInputs.html
>>
>> Is this advisable and if not, what are my options to handle multiple inputs?
>>
>> On Sat, May 28, 2011 at 5:59 PM, Dmitriy Lyubimov<dl...@gmail.com>  wrote:
>>
>>> Dhruv,
>>>
>>> Just a warning, before you want to lock yourself to new apis:
>>>
>>> Yes new APIs are preferrable but it is not always possible to use them
>>> because 0.20.2 lacks _a lot_ in terms of bare necessities in new api
>>> realm . (multiple inputs/ outputs come to mind at once).
>>>
>>> I think i did weasel my way out of those in some cases but i did not
>>> test it at scale yet, it is certainly not an official way to do it.
>>>
>>> Either way it's probably not worth it for anything beyond sheer basic
>>> MR functionality until we switch to something that actually does have
>>> the 'new api' because 0.20.2 has some very much truncated version
>>> which is very far from complete.
>>>
>>> -d
>>>
>>> On Fri, May 27, 2011 at 3:19 AM, Isabel Drost<is...@apache.org>  wrote:
>>>> On 18.05.2011 Dhruv Kumar wrote:
>>>>> For the GSoC project which version of Hadoop's API should I follow?
>>>> Try to use the new M/R apis where possible - we had the same discussion
>>> in an
>>>> earlier thread on spectral clustering, in addition Sean just opened an
>>> issue
>>>> concerning Upgrading to newer Hadoop versions, you can take a look there
>>> as
>>>> well.
>>>>
>>>> Isabel
>>>>

Re: Possible contributions

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

I don't see how you can use deprecated multiple inputs, as if i am not
missing anything, its signature is tied to old api types, such as
JobConf, which you of course won't have as you define a new api job.

On Sat, May 28, 2011 at 3:43 PM, Dhruv Kumar <dk...@ecs.umass.edu> wrote:
> Isabel and Dmitry,
>
> Thank you for your input on this. I've noticed that Mahout's code uses the
> new mapreduce package, so I have been following the new APIs. This was also
> suggested by Sean w.r.t Mahout-294.
>
> Multiple inputs is a requirement for my project and I was planning on using
> the old mapred.lib.multipleinputs class which is not marked as deprecated in
> 0.20.2:
>
>
> http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleInputs.html
>
> Is this advisable and if not, what are my options to handle multiple inputs?
>
> On Sat, May 28, 2011 at 5:59 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> Dhruv,
>>
>> Just a warning, before you want to lock yourself to new apis:
>>
>> Yes new APIs are preferrable but it is not always possible to use them
>> because 0.20.2 lacks _a lot_ in terms of bare necessities in new api
>> realm . (multiple inputs/ outputs come to mind at once).
>>
>> I think i did weasel my way out of those in some cases but i did not
>> test it at scale yet, it is certainly not an official way to do it.
>>
>> Either way it's probably not worth it for anything beyond sheer basic
>> MR functionality until we switch to something that actually does have
>> the 'new api' because 0.20.2 has some very much truncated version
>> which is very far from complete.
>>
>> -d
>>
>> On Fri, May 27, 2011 at 3:19 AM, Isabel Drost <is...@apache.org> wrote:
>> > On 18.05.2011 Dhruv Kumar wrote:
>> >> For the GSoC project which version of Hadoop's API should I follow?
>> >
>> > Try to use the new M/R apis where possible - we had the same discussion
>> in an
>> > earlier thread on spectral clustering, in addition Sean just opened an
>> issue
>> > concerning Upgrading to newer Hadoop versions, you can take a look there
>> as
>> > well.
>> >
>> > Isabel
>> >
>>
>

Re: Possible contributions

Posted by Dhruv Kumar <dk...@ecs.umass.edu>.

Isabel and Dmitry,

Thank you for your input on this. I've noticed that Mahout's code uses the
new mapreduce package, so I have been following the new APIs. This was also
suggested by Sean w.r.t Mahout-294.

Multiple inputs is a requirement for my project and I was planning on using
the old mapred.lib.multipleinputs class which is not marked as deprecated in
0.20.2:


http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleInputs.html

Is this advisable and if not, what are my options to handle multiple inputs?

On Sat, May 28, 2011 at 5:59 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> Dhruv,
>
> Just a warning, before you want to lock yourself to new apis:
>
> Yes new APIs are preferrable but it is not always possible to use them
> because 0.20.2 lacks _a lot_ in terms of bare necessities in new api
> realm . (multiple inputs/ outputs come to mind at once).
>
> I think i did weasel my way out of those in some cases but i did not
> test it at scale yet, it is certainly not an official way to do it.
>
> Either way it's probably not worth it for anything beyond sheer basic
> MR functionality until we switch to something that actually does have
> the 'new api' because 0.20.2 has some very much truncated version
> which is very far from complete.
>
> -d
>
> On Fri, May 27, 2011 at 3:19 AM, Isabel Drost <is...@apache.org> wrote:
> > On 18.05.2011 Dhruv Kumar wrote:
> >> For the GSoC project which version of Hadoop's API should I follow?
> >
> > Try to use the new M/R apis where possible - we had the same discussion
> in an
> > earlier thread on spectral clustering, in addition Sean just opened an
> issue
> > concerning Upgrading to newer Hadoop versions, you can take a look there
> as
> > well.
> >
> > Isabel
> >
>

Re: Possible contributions

Posted by Dmitriy Lyubimov <dl...@gmail.com>.

Dhruv,

Just a warning, before you want to lock yourself to new apis:

Yes new APIs are preferrable but it is not always possible to use them
because 0.20.2 lacks _a lot_ in terms of bare necessities in new api
realm . (multiple inputs/ outputs come to mind at once).

I think i did weasel my way out of those in some cases but i did not
test it at scale yet, it is certainly not an official way to do it.

Either way it's probably not worth it for anything beyond sheer basic
MR functionality until we switch to something that actually does have
the 'new api' because 0.20.2 has some very much truncated version
which is very far from complete.

-d

On Fri, May 27, 2011 at 3:19 AM, Isabel Drost <is...@apache.org> wrote:
> On 18.05.2011 Dhruv Kumar wrote:
>> For the GSoC project which version of Hadoop's API should I follow?
>
> Try to use the new M/R apis where possible - we had the same discussion in an
> earlier thread on spectral clustering, in addition Sean just opened an issue
> concerning Upgrading to newer Hadoop versions, you can take a look there as
> well.
>
> Isabel
>

Re: Possible contributions

Posted by Isabel Drost <is...@apache.org>.

On 18.05.2011 Dhruv Kumar wrote:
> For the GSoC project which version of Hadoop's API should I follow?

Try to use the new M/R apis where possible - we had the same discussion in an 
earlier thread on spectral clustering, in addition Sean just opened an issue 
concerning Upgrading to newer Hadoop versions, you can take a look there as 
well.

Isabel

Re: Possible contributions

Posted by Ted Dunning <te...@gmail.com>.

Well, this much I think is uncontroversial.

On Wed, May 18, 2011 at 3:38 AM, Sean Owen <sr...@gmail.com> wrote:

> And I do think we need to focus on cleanup now rather than later. For
> example I will shortly suggest deprecating M/R jobs that use Hadoop 0.19
> APIs in the name of moving forward.
>

Re: Possible contributions

Posted by Dhruv Kumar <dk...@ecs.umass.edu>.

On Wed, May 18, 2011 at 6:38 AM, Sean Owen <sr...@gmail.com> wrote:

> I think it first has to finish embracing MapReduce! The code base already
> uses 2.5 different versions of Hadoop. It would be better clean up the
> modest clutter of approaches we already have before thinking about
> extending
> it.
>


For the GSoC project which version of Hadoop's API should I follow?



> Good news is there's a fair bit of time before any other particular
> framework becomes widely used enough to merit thinking hard about.
>
> And I do think we need to focus on cleanup now rather than later. For
> example I will shortly suggest deprecating M/R jobs that use Hadoop 0.19
> APIs in the name of moving forward.
>
> On Wed, May 18, 2011 at 11:23 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > This is a theme that is going to raise itself over and over.
> >
> > I think that strategically, Mahout is going to have to embrace the
> > MapReduce
> > nextGen work so that we can have flexible computation models.  We already
> > need this with all the large scale SVD work.  We could very much use it
> for
> > the SGD stuff.  Now this gradient work could use it.
> >
> > New needs aren't going to stop.
> >
> > On Tue, May 17, 2011 at 10:17 PM, Hector Yee <he...@gmail.com>
> wrote:
> >
> > > Re: boosting scalability, I've implemented it on thousands of machines,
> > but
> > > not with mapreduce, rather with direct RPC calls. The gradient
> > computation
> > > tends to be iterative, so one way to do it is to have each iteration
> run
> > > per
> > > mapreduce.
> > > Compute gradients in the mapper, gather them in the reducer, rinse and
> > > repeat.
> > >
> >
>

Re: Possible contributions

Posted by Sean Owen <sr...@gmail.com>.

I think it first has to finish embracing MapReduce! The code base already
uses 2.5 different versions of Hadoop. It would be better clean up the
modest clutter of approaches we already have before thinking about extending
it.

Good news is there's a fair bit of time before any other particular
framework becomes widely used enough to merit thinking hard about.

And I do think we need to focus on cleanup now rather than later. For
example I will shortly suggest deprecating M/R jobs that use Hadoop 0.19
APIs in the name of moving forward.

On Wed, May 18, 2011 at 11:23 AM, Ted Dunning <te...@gmail.com> wrote:

> This is a theme that is going to raise itself over and over.
>
> I think that strategically, Mahout is going to have to embrace the
> MapReduce
> nextGen work so that we can have flexible computation models.  We already
> need this with all the large scale SVD work.  We could very much use it for
> the SGD stuff.  Now this gradient work could use it.
>
> New needs aren't going to stop.
>
> On Tue, May 17, 2011 at 10:17 PM, Hector Yee <he...@gmail.com> wrote:
>
> > Re: boosting scalability, I've implemented it on thousands of machines,
> but
> > not with mapreduce, rather with direct RPC calls. The gradient
> computation
> > tends to be iterative, so one way to do it is to have each iteration run
> > per
> > mapreduce.
> > Compute gradients in the mapper, gather them in the reducer, rinse and
> > repeat.
> >
>

Re: Possible contributions

Posted by Ted Dunning <te...@gmail.com>.

This is a theme that is going to raise itself over and over.

I think that strategically, Mahout is going to have to embrace the MapReduce
nextGen work so that we can have flexible computation models.  We already
need this with all the large scale SVD work.  We could very much use it for
the SGD stuff.  Now this gradient work could use it.

New needs aren't going to stop.

On Tue, May 17, 2011 at 10:17 PM, Hector Yee <he...@gmail.com> wrote:

> Re: boosting scalability, I've implemented it on thousands of machines, but
> not with mapreduce, rather with direct RPC calls. The gradient computation
> tends to be iterative, so one way to do it is to have each iteration run
> per
> mapreduce.
> Compute gradients in the mapper, gather them in the reducer, rinse and
> repeat.
>

Re: Possible contributions

Posted by Grant Ingersoll <gs...@apache.org>.

https://cwiki.apache.org/confluence/display/MAHOUT/How+To+Contribute

On May 18, 2011, at 1:17 AM, Hector Yee wrote:

> Re: boosting scalability, I've implemented it on thousands of machines, but
> not with mapreduce, rather with direct RPC calls. The gradient computation
> tends to be iterative, so one way to do it is to have each iteration run per
> mapreduce.
> Compute gradients in the mapper, gather them in the reducer, rinse and
> repeat.
> 
> I'll probably just implement an in-core variant first.
> 
> re: online kernelized ranker - this is pretty easy to do so I will probably
> do it as a starter contribution.
> 
> re: java, sure I have no problems writing it all in java.
> 
> Whats the process in doing this? Write the code and them start a jiri ticket
> with the patch?
> 
> -- 
> Yee Yang Li Hector
> http://hectorgon.blogspot.com/ (tech + travel)
> http://hectorgon.com (book reviews)

--------------------------
Grant Ingersoll
Lucene Revolution -- Lucene and Solr User Conference
May 25-26 in San Francisco
www.lucenerevolution.org

Re: Possible contributions

Posted by Hector Yee <he...@gmail.com>.

Re: boosting scalability, I've implemented it on thousands of machines, but
not with mapreduce, rather with direct RPC calls. The gradient computation
tends to be iterative, so one way to do it is to have each iteration run per
mapreduce.
Compute gradients in the mapper, gather them in the reducer, rinse and
repeat.

I'll probably just implement an in-core variant first.

re: online kernelized ranker - this is pretty easy to do so I will probably
do it as a starter contribution.

re: java, sure I have no problems writing it all in java.

Whats the process in doing this? Write the code and them start a jiri ticket
with the patch?

-- 
Yee Yang Li Hector
http://hectorgon.blogspot.com/ (tech + travel)
http://hectorgon.com (book reviews)

Re: Possible contributions

Posted by Ted Dunning <te...@gmail.com>.

On Tue, May 17, 2011 at 5:26 PM, Hector Yee <he...@gmail.com> wrote:

>  I have some proposed contributions and I wonder if they will be useful in
> Mahout (otherwise I will just commit it in a new open source project in
> github).
>

These generally sound pretty good.

> - Sparse autoencoder (think of it as something like LDA - it has an
> unsupervised hidden topic model and an output that reconstructs the input
> but blurs it a bit due to the hidden layer bottleneck). The variant I am
> planning to implement is optimized for sparse (e.g. text) labels. Not sure
> if it will fit into the filter framework?
>

This would definitely fit into the variable encoder framework where the
hashed encoders live.

Filters is another reasonable home.

Clustering is a reasonable home since it has come to mean "unsupervised
stuff" for the most part.

> - Boosting with l1 regularization and back pruning. (just the binary case -
> I haven't had much luck with the multi-class case vs adaboost ECC).
>

How scalable is this?

> - online kernelized learner for ranking and classification (optimization in
> the primal rather than the dual)
>

This would be very interesting.  It would fit in very well next to the SGD
models as an interesting alternative/elaboration.

>
> I'm new to Mahout, so let me know if anyone is working on these already or
> not. I've implemented them several times in C++.
>

These all sound plenty new enough.

You good with doing them in Java?