You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by David Hall <dl...@cs.berkeley.edu> on 2009/12/03 09:02:19 UTC

Re: SVM algo, code, etc.

On Wed, Nov 25, 2009 at 2:35 AM, Isabel Drost <is...@apache.org> wrote:
> On Fri Grant Ingersoll <gs...@apache.org> wrote:
>> On Nov 19, 2009, at 1:15 PM, Sean Owen wrote:
>> > Post a patch if you'd like to proceed, IMHO.
>> +1
>
> +1 from me as well. I would love to see solid svm support in Mahout.

And another +1 from me. If you want a pointer, I've recently stumbled
on a new solver for SVMs that seems to be remarkably easy to
implement.

It's called Pegasos:

ttic.uchicago.edu/~shai/papers/ShalevSiSr07.pdf

-- David

Re: SVM algo, code, etc.

Posted by David Hall <dl...@cs.berkeley.edu>.

On Thu, Dec 3, 2009 at 2:12 AM, Olivier Grisel <ol...@ensta.org> wrote:
> 2009/12/3 Ted Dunning <te...@gmail.com>:
>> Very interesting results, particularly the lack of dependence on data size.
>>
>> On Thu, Dec 3, 2009 at 12:02 AM, David Hall <dl...@cs.berkeley.edu> wrote:
>>
>>> On Wed, Nov 25, 2009 at 2:35 AM, Isabel Drost <is...@apache.org> wrote:
>>> > On Fri Grant Ingersoll <gs...@apache.org> wrote:
>>> >> On Nov 19, 2009, at 1:15 PM, Sean Owen wrote:
>>> >> > Post a patch if you'd like to proceed, IMHO.
>>> >> +1
>>> >
>>> > +1 from me as well. I would love to see solid svm support in Mahout.
>>>
>>> And another +1 from me. If you want a pointer, I've recently stumbled
>>> on a new solver for SVMs that seems to be remarkably easy to
>>> implement.
>>>
>>> It's called Pegasos:
>>>
>>> ttic.uchicago.edu/~shai/papers/ShalevSiSr07.pdf<http://ttic.uchicago.edu/%7Eshai/papers/ShalevSiSr07.pdf>
>
> Pegasos and other online implementations of SVMs based on regularized
> variants of stochastic gradient descent are indeed amenable to large
> scale problems. They solve the SVM optimization problem with a
> stochastic approximation of the primal (as opposed to more 'classical'
> solvers such as libsvm that solve the dual problem using Sequential
> Minimal Optimization). However SGD based SVM implementation are
> currently limited to the linear 'kernel' (which is often expressive
> enough for common NLP tasks such as document categorization).

You seem to know far more about this than I do, but the paper I linked
to specifically says (page 6) that they can use Mercer kernels.

-- David

Re: SVM algo, code, etc.

Posted by Grant Ingersoll <gs...@gmail.com>.

Hi Zhao,

The best way to share this is to generate a patch and attach it to an issue in JIRA.  See the How To Contribute section on the wiki.  I think many of us are interested, so it would be easier to have the discussion on the mailing list or JIRA instead of trying to manage it privately.

-Grant

On Dec 19, 2009, at 1:50 AM, zhao zhendong wrote:

> Hi all,
> 
> I have finished a draft of proposal on Parallel Pegasos SVM solover.
> 
> I need some comments. If any one interested in this proposal, contact me
> please.
> 
> By the way, is it a good idea to attach this proposal in this mail session.
> 
> Cheers,
> Zhendong
> 
> On Thu, Dec 17, 2009 at 1:16 AM, Ted Dunning <te...@gmail.com> wrote:
> 
>> I have half of an SGD logistic regression written.  There might be some
>> code
>> to share.  I will put up a patch shortly so that others can follow
>> progress.
>> 
>> There is also a group of four (very quiet) developers who are working on
>> something related to SVM.  I will be meeting with them tomorrow morning and
>> will encourage them to post a JIRA for discussion.
>> 
>> On Tue, Dec 15, 2009 at 10:32 PM, David Hall <dl...@cs.berkeley.edu> wrote:
>> 
>>> I would be
>>> glad to get this up and running, though I'd like more to help curate a
>>> patch. Pegasos or no.
>>> 
>> 
>> 
>> 
>> --
>> Ted Dunning, CTO
>> DeepDyve
>>

Re: SVM algo, code, etc.

Posted by zhao zhendong <zh...@gmail.com>.

Hi all,

I have finished a draft of proposal on Parallel Pegasos SVM solover.

I need some comments. If any one interested in this proposal, contact me
please.

By the way, is it a good idea to attach this proposal in this mail session.

Cheers,
Zhendong

On Thu, Dec 17, 2009 at 1:16 AM, Ted Dunning <te...@gmail.com> wrote:

> I have half of an SGD logistic regression written.  There might be some
> code
> to share.  I will put up a patch shortly so that others can follow
> progress.
>
> There is also a group of four (very quiet) developers who are working on
> something related to SVM.  I will be meeting with them tomorrow morning and
> will encourage them to post a JIRA for discussion.
>
> On Tue, Dec 15, 2009 at 10:32 PM, David Hall <dl...@cs.berkeley.edu> wrote:
>
> >  I would be
> > glad to get this up and running, though I'd like more to help curate a
> > patch. Pegasos or no.
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: SVM algo, code, etc.

Posted by Ted Dunning <te...@gmail.com>.

I have half of an SGD logistic regression written.  There might be some code
to share.  I will put up a patch shortly so that others can follow progress.

There is also a group of four (very quiet) developers who are working on
something related to SVM.  I will be meeting with them tomorrow morning and
will encourage them to post a JIRA for discussion.

On Tue, Dec 15, 2009 at 10:32 PM, David Hall <dl...@cs.berkeley.edu> wrote:

>  I would be
> glad to get this up and running, though I'd like more to help curate a
> patch. Pegasos or no.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: SVM algo, code, etc.

Posted by David Hall <dl...@cs.berkeley.edu>.

On Fri, Dec 11, 2009 at 5:02 AM, Jake Mannix <ja...@gmail.com> wrote:
> I really feel like I should respond to this, but seeing as I live on the
> west coast
> of the US, going to bed might be more advisable.
>
> On a very specific topic of SVMs, I can certainly look into this, but David,
> were you interested in helping bring this into Mahout and help maintain it?
> You are often rather quiet on here, yet happened to jump in as this topic
> came up?

Yeah, the first semester of the PhD program has been far more busy
than I imagined, and I've been overwhelmed. (Now it's finals week.)

Online optimization has kind of caught my eye of late, with Pegasos
being something I had been thinking about implementing. I would be
glad to get this up and running, though I'd like more to help curate a
patch. Pegasos or no.

-- David

>
>  -jake
>
> On Fri, Dec 11, 2009 at 4:40 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> This is a timely message, since I'm currently presuming to close some
>> old Mahout issues at the moment and it raises a related concern.
>>
>> There's lots of old JIRA issues of the form:
>> 1) somebody submits a patch implementing part of something
>> 2) some comments happen, maybe
>> 3) nothing happens for a year
>> 4) I close it now
>>
>> At an early stage, this is fine actually. 20 people contribute at the
>> start; 3 select themselves naturally as regular contributors. 20
>> patches go up; the 5 that are of use an interest naturally get picked
>> up and eventually committed. But going forward, this probably won't
>> do. Potential committers get discouraged and work goes wasted. (See
>> comments about Commons Math on this list for an example of the
>> fallout.)
>>
>> I wonder what the obstacles are to avoiding this?
>>
>> 1) Do we need to be clearer about what the project is and isn't about?
>> What the priorities are, what work is already on the table to be done?
>> This is why I am keen on cleaning up JIRA now; it's hard for even us
>> to understand what's in progress, what's important,
>>
>> 2) Do we need some more official ownership or responsibility for
>> components? For example I am not sure who would manage changes to
>> Clustering stuff. I know it isn't me; I don't know about that part. So
>> what happens to an incoming patch to clustering? While too much
>> command-and-control isn't possible or desirable in open source, lack
>> of it is harmful too. I don't think the answer is "just let people
>> commit bits and bobs" since it makes the project appear to be a
>> workbench of half-finished jobs, which does a disservice to the
>> components that are polished.
>>
>>
>> I have no reason to believe this SVM patch, should it materialize,
>> would fall through the cracks in this way, but want to ask now how we
>> can just make sure. So, can we answer:
>>
>> 1) Is SVM in scope for Mahout? (I am guessing so.)
>> 2) Who is nominally committing to shepherd the code into the code base
>> and fix bugs and answer questions? (Jake?)
>>
>>
>> I'm not really bothered about this particular patch, but the more
>> general question.
>>
>

Re: SVM algo, code, etc.

Posted by Jake Mannix <ja...@gmail.com>.

I really feel like I should respond to this, but seeing as I live on the
west coast
of the US, going to bed might be more advisable.

On a very specific topic of SVMs, I can certainly look into this, but David,
were you interested in helping bring this into Mahout and help maintain it?
You are often rather quiet on here, yet happened to jump in as this topic
came up?

  -jake

On Fri, Dec 11, 2009 at 4:40 AM, Sean Owen <sr...@gmail.com> wrote:

> This is a timely message, since I'm currently presuming to close some
> old Mahout issues at the moment and it raises a related concern.
>
> There's lots of old JIRA issues of the form:
> 1) somebody submits a patch implementing part of something
> 2) some comments happen, maybe
> 3) nothing happens for a year
> 4) I close it now
>
> At an early stage, this is fine actually. 20 people contribute at the
> start; 3 select themselves naturally as regular contributors. 20
> patches go up; the 5 that are of use an interest naturally get picked
> up and eventually committed. But going forward, this probably won't
> do. Potential committers get discouraged and work goes wasted. (See
> comments about Commons Math on this list for an example of the
> fallout.)
>
> I wonder what the obstacles are to avoiding this?
>
> 1) Do we need to be clearer about what the project is and isn't about?
> What the priorities are, what work is already on the table to be done?
> This is why I am keen on cleaning up JIRA now; it's hard for even us
> to understand what's in progress, what's important,
>
> 2) Do we need some more official ownership or responsibility for
> components? For example I am not sure who would manage changes to
> Clustering stuff. I know it isn't me; I don't know about that part. So
> what happens to an incoming patch to clustering? While too much
> command-and-control isn't possible or desirable in open source, lack
> of it is harmful too. I don't think the answer is "just let people
> commit bits and bobs" since it makes the project appear to be a
> workbench of half-finished jobs, which does a disservice to the
> components that are polished.
>
>
> I have no reason to believe this SVM patch, should it materialize,
> would fall through the cracks in this way, but want to ask now how we
> can just make sure. So, can we answer:
>
> 1) Is SVM in scope for Mahout? (I am guessing so.)
> 2) Who is nominally committing to shepherd the code into the code base
> and fix bugs and answer questions? (Jake?)
>
>
> I'm not really bothered about this particular patch, but the more
> general question.
>

Re: SVM algo, code, etc.

Posted by Isabel Drost <is...@apache.org>.

On Fri Sean Owen <sr...@gmail.com> wrote:

> Sure is there a mailing list or something for this? I'd like to be
> looped into talking about issues like this.

dev@community.apache.org

Isabel

Re: SVM algo, code, etc.

Posted by Sean Owen <sr...@gmail.com>.

Sure is there a mailing list or something for this? I'd like to be
looped into talking about issues like this.

On Fri, Dec 11, 2009 at 3:59 PM, Isabel Drost <is...@apache.org> wrote:
> Attracting new committers and integrating contributors is a topic that
> is not only relevant for Mahout but for many, if not all Apache
> projects. As a result, in November this year the project comdev was
> founded with the goal to discuss these issues, identify solutions that
> work, discuss common problems, establish an Apache mentoring program
> similar but complimentary to GSoC etc. That's why I thought, the
> question you raised really might be interesting there as well.
>
> Isabel
>

Re: SVM algo, code, etc.

Posted by Isabel Drost <is...@apache.org>.

On Fri Sean Owen <sr...@gmail.com> wrote:
> On Fri Isabel Drost wrote:
> > If you are interested in a broader discussion, it might make sense
> > to include the people over at the newly founded community
> > development project in the discussion?
> 
> What's this?

Attracting new committers and integrating contributors is a topic that
is not only relevant for Mahout but for many, if not all Apache
projects. As a result, in November this year the project comdev was
founded with the goal to discuss these issues, identify solutions that
work, discuss common problems, establish an Apache mentoring program
similar but complimentary to GSoC etc. That's why I thought, the
question you raised really might be interesting there as well.

Isabel

Re: SVM algo, code, etc.

Posted by Sean Owen <sr...@gmail.com>.

On Fri, Dec 11, 2009 at 2:08 PM, Isabel Drost <is...@apache.org> wrote:
> I would guess that your question is not really restricted to Mahout but
> the same question appears on other projects as well. Basically the
> question is on "how to mentor developers to become new project members".

The first question is how to avoid orphaned patches. It's OK that some
stuff was orphaned from early days, but want to see we're set up to
avoid it going forward. Yes that could be solved by cultivating patch
submitters into committers and de facto owners of that subset of the
project.

More broadly the issue is making sure someone is at least kinda
watching over each bit of code in the project. So proposed idea either
receives support right away through to committing, or else isn't begun
at all. And so we have code of consistent quality and maintenance.

I argue that answering the second one actually improves the project's
ability to attract committers, in the first place.

> If you are interested in a broader discussion, it might make sense to
> include the people over at the newly founded community development
> project in the discussion?

What's this?

Re: SVM algo, code, etc.

Posted by Isabel Drost <is...@apache.org>.

On Fri Sean Owen <sr...@gmail.com> wrote:

> 1) Is SVM in scope for Mahout? (I am guessing so.)

Yes.

> 2) Who is nominally committing to shepherd the code into the code base
> and fix bugs and answer questions? (Jake?)
> 
> I'm not really bothered about this particular patch, but the more
> general question.

I would guess that your question is not really restricted to Mahout but
the same question appears on other projects as well. Basically the
question is on "how to mentor developers to become new project members".

If you are interested in a broader discussion, it might make sense to
include the people over at the newly founded community development
project in the discussion?

Isabel

Re: SVM algo, code, etc.

Posted by Isabel Drost <is...@apache.org>.

On Fri Sean Owen <sr...@gmail.com> wrote:

> 1) Is SVM in scope for Mahout? (I am guessing so.)

Yes.

> 2) Who is nominally committing to shepherd the code into the code base
> and fix bugs and answer questions? (Jake?)
> 
> I'm not really bothered about this particular patch, but the more
> general question.

I would guess that your question is not really restricted to Mahout but
the same question appears on other projects as well. Basically the
question is on "how to mentor developers to become new project members".

If you are interested in a broader discussion, it might make sense to
include the people over at the newly founded community development
project in the discussion?

Isabel

Re: SVM algo, code, etc.

Posted by Grant Ingersoll <gs...@apache.org>.

On Dec 11, 2009, at 7:40 AM, Sean Owen wrote:

> This is a timely message, since I'm currently presuming to close some
> old Mahout issues at the moment and it raises a related concern.
> 
> There's lots of old JIRA issues of the form:
> 1) somebody submits a patch implementing part of something
> 2) some comments happen, maybe
> 3) nothing happens for a year
> 4) I close it now
> 
> At an early stage, this is fine actually. 20 people contribute at the
> start; 3 select themselves naturally as regular contributors. 20
> patches go up; the 5 that are of use an interest naturally get picked
> up and eventually committed. But going forward, this probably won't
> do. Potential committers get discouraged and work goes wasted. (See
> comments about Commons Math on this list for an example of the
> fallout.)
> 
> I wonder what the obstacles are to avoiding this?

To some extent, it is the nature of open source.  People are all volunteers here and stuff that is good, sticks, stuff that doesn't goes by the wayside.  People also need to realize there is more to contributing than just throwing up code and expecting someone else to do all the work from here on out.

> 
> 1) Do we need to be clearer about what the project is and isn't about?
> What the priorities are, what work is already on the table to be done?
> This is why I am keen on cleaning up JIRA now; it's hard for even us
> to understand what's in progress, what's important,

I think this self selects itself and I don't worry too much about it, but feel free to clean up.  People can always reopen later if they want.  That's how a meritocracy works.  If you have an itch to clean it up, then do so.  

> 
> 2) Do we need some more official ownership or responsibility for
> components? For example I am not sure who would manage changes to
> Clustering stuff. I know it isn't me; I don't know about that part. So
> what happens to an incoming patch to clustering? While too much
> command-and-control isn't possible or desirable in open source, lack
> of it is harmful too. I don't think the answer is "just let people
> commit bits and bobs" since it makes the project appear to be a
> workbench of half-finished jobs, which does a disservice to the
> components that are polished.

I always look at it as an evolution.  We are all volunteers, so it is impossible to say someone owns anything in Mahout.  Part of being a committer is knowing as much what not to touch as what to touch.  For example, I can change code in the recommenders, but I know I it is not my area of expertise, so I don't unless I really understand it.   

We are starting to hit a critical mass of committers and contributors now, that this will just work out through the wisdom of crowds.  The phrase "herding cats" is often used when describing project management at the ASF.  My take is, people should scratch their own itches and, when needed, engage the community for stuff that is beyond there scope.  Bits and bobs is how it works.  This is a feature, not a bug.

Believe it or not, when I first joined Lucene, I had much the same approach as you are proposing.  Let's get some real project management around here and plan out what we are going to do, etc.  The problem is you never know where the next good thing is going to come from, so all you can do is take care of what you control (i.e. your items of interest) and then set aside some part of your time to help with other stuff.  As you become more involved in the project, you will likely grow an interest in other areas.  For example, I knew nothing about recommenders coming into Mahout, but have been learning simply by looking at the code and through osmosis, such that I now feel comfortable doing some things in there.

I think we have enough direction (scalable machine learning implementations) now and enough people interested in a variety of places that I'm not worried about the half-finished stuff anymore.  It is being taken care of by people who use it.   Just look at the Colt stuff.  No one was sure who would start adding tests and cleaning it up, but Benson, Drew and Jake stepped up and are doing it b/c they have a need for it or simply have the time to do it.  Over time (usually short) if they continue, they will be made committers. 

> 
> 
> I have no reason to believe this SVM patch, should it materialize,
> would fall through the cracks in this way, but want to ask now how we
> can just make sure. So, can we answer:
> 
> 1) Is SVM in scope for Mahout? (I am guessing so.)

Of course it is.  SVM is a well known machine learning algorithm.

> 2) Who is nominally committing to shepherd the code into the code base
> and fix bugs and answer questions? (Jake?)

I wouldn't worry about it.  People will select it when it fits either with their available time or their particular need.  The contributor's job is to make sure it is ready to go by having tests, responding to questions, comments, etc.  For instance, I have an interest in SVM's, I will likely review it at some point.

Re: SVM algo, code, etc.

Posted by Sean Owen <sr...@gmail.com>.

This is a timely message, since I'm currently presuming to close some
old Mahout issues at the moment and it raises a related concern.

There's lots of old JIRA issues of the form:
1) somebody submits a patch implementing part of something
2) some comments happen, maybe
3) nothing happens for a year
4) I close it now

At an early stage, this is fine actually. 20 people contribute at the
start; 3 select themselves naturally as regular contributors. 20
patches go up; the 5 that are of use an interest naturally get picked
up and eventually committed. But going forward, this probably won't
do. Potential committers get discouraged and work goes wasted. (See
comments about Commons Math on this list for an example of the
fallout.)

I wonder what the obstacles are to avoiding this?

1) Do we need to be clearer about what the project is and isn't about?
What the priorities are, what work is already on the table to be done?
This is why I am keen on cleaning up JIRA now; it's hard for even us
to understand what's in progress, what's important,

2) Do we need some more official ownership or responsibility for
components? For example I am not sure who would manage changes to
Clustering stuff. I know it isn't me; I don't know about that part. So
what happens to an incoming patch to clustering? While too much
command-and-control isn't possible or desirable in open source, lack
of it is harmful too. I don't think the answer is "just let people
commit bits and bobs" since it makes the project appear to be a
workbench of half-finished jobs, which does a disservice to the
components that are polished.


I have no reason to believe this SVM patch, should it materialize,
would fall through the cracks in this way, but want to ask now how we
can just make sure. So, can we answer:

1) Is SVM in scope for Mahout? (I am guessing so.)
2) Who is nominally committing to shepherd the code into the code base
and fix bugs and answer questions? (Jake?)


I'm not really bothered about this particular patch, but the more
general question.

Re: SVM algo, code, etc.

Posted by Jake Mannix <ja...@gmail.com>.

Hi Zhao,

  I would certainly love to see a nice parallel SVM on hadoop.  Submit a
patch, let's
get it in Mahout!

  -jake

On Fri, Dec 11, 2009 at 3:52 AM, zhao zhendong <zh...@gmail.com>wrote:

> True, I am still wondering about whether it is valuable to implement a
> parallel SVM on hadoop? I really wanna join in mike's group.
>
> Just like Olivier concerned, some linear version of SVM solvers can handle
> large-scale data sets ( several seconds for 100K-level samples). It's true
> that the linear version does not use Mercer Kernel, however, linear method
> always can obtain very similar accuracy as the solvers with advanced kernel
> does on large-scale data set. I really don't know whether it is true or
> not.
>
>
>
> On Thu, Dec 3, 2009 at 6:12 PM, Olivier Grisel <olivier.grisel@ensta.org
> >wrote:
>
> > 2009/12/3 Ted Dunning <te...@gmail.com>:
> > > Very interesting results, particularly the lack of dependence on data
> > size.
> > >
> > > On Thu, Dec 3, 2009 at 12:02 AM, David Hall <dl...@cs.berkeley.edu>
> > wrote:
> > >
> > >> On Wed, Nov 25, 2009 at 2:35 AM, Isabel Drost <is...@apache.org>
> > wrote:
> > >> > On Fri Grant Ingersoll <gs...@apache.org> wrote:
> > >> >> On Nov 19, 2009, at 1:15 PM, Sean Owen wrote:
> > >> >> > Post a patch if you'd like to proceed, IMHO.
> > >> >> +1
> > >> >
> > >> > +1 from me as well. I would love to see solid svm support in Mahout.
> > >>
> > >> And another +1 from me. If you want a pointer, I've recently stumbled
> > >> on a new solver for SVMs that seems to be remarkably easy to
> > >> implement.
> > >>
> > >> It's called Pegasos:
> > >>
> > >> ttic.uchicago.edu/~shai/papers/ShalevSiSr07.pdf<http://ttic.uchicago.edu/%7Eshai/papers/ShalevSiSr07.pdf>
> <
> > http://ttic.uchicago.edu/%7Eshai/papers/ShalevSiSr07.pdf>
> >
> > Pegasos and other online implementations of SVMs based on regularized
> > variants of stochastic gradient descent are indeed amenable to large
> > scale problems. They solve the SVM optimization problem with a
> > stochastic approximation of the primal (as opposed to more 'classical'
> > solvers such as libsvm that solve the dual problem using Sequential
> > Minimal Optimization). However SGD based SVM implementation are
> > currently limited to the linear 'kernel' (which is often expressive
> > enough for common NLP tasks such as document categorization).
> >
> > Other interesting resources on the topic:
> >
> > A simple reference implementation of Pegasos:
> > - http://leon.bottou.org/projects/sgd
> >
> > Speeding up the convergence of linear SGD based SVM using estimate of
> > the diagonal of the hessian:
> > -  http://webia.lip6.fr/~bordes/mywiki/doku.php?id=sgdqn<http://webia.lip6.fr/%7Ebordes/mywiki/doku.php?id=sgdqn>
> >
> > Using a sparsifying L1 priors as a regularizer to perform automated
> > feature selection:
> > - http://www.cs.berkeley.edu/~jduchi/projects/DuchiSi09_folos.html<http://www.cs.berkeley.edu/%7Ejduchi/projects/DuchiSi09_folos.html>
> >
> > Working coordinate-wise on large dimensional problems using L1 priors
> > too (maybe easier to make map-reduceable efficiently):
> > - http://ttic.uchicago.edu/~tewari/code/scd/<http://ttic.uchicago.edu/%7Etewari/code/scd/>
> >
> > Also do not overlook the higly optimized Vowpal Wabbit, probably the
> > fastest linear classifier on earth:
> > - http://hunch.net/~vw/ <http://hunch.net/%7Evw/>
> >
> > --
> > Olivier
> > http://twitter.com/ogrisel - http://code.oliviergrisel.name
> >
>
>
>
> --
> -------------------------------------------------------------
>
> Zhen-Dong Zhao (Maxim)
>
> <><<><><><><><><><>><><><><><>>>>>>
>
> Department of Computer Science
> School of Computing
> National University of Singapore
>
> ><><><><><><><><><><><><><><><><<<<
> Homepage:http://zhaozhendong.googlepages.com
> Mail: zhaozhendong@gmail.com
> >>>>>>><><><><><><><><<><>><><<<<<<
>

Re: SVM algo, code, etc.

Posted by zhao zhendong <zh...@gmail.com>.

True, I am still wondering about whether it is valuable to implement a
parallel SVM on hadoop? I really wanna join in mike's group.

Just like Olivier concerned, some linear version of SVM solvers can handle
large-scale data sets ( several seconds for 100K-level samples). It's true
that the linear version does not use Mercer Kernel, however, linear method
always can obtain very similar accuracy as the solvers with advanced kernel
does on large-scale data set. I really don't know whether it is true or
not.



On Thu, Dec 3, 2009 at 6:12 PM, Olivier Grisel <ol...@ensta.org>wrote:

> 2009/12/3 Ted Dunning <te...@gmail.com>:
> > Very interesting results, particularly the lack of dependence on data
> size.
> >
> > On Thu, Dec 3, 2009 at 12:02 AM, David Hall <dl...@cs.berkeley.edu>
> wrote:
> >
> >> On Wed, Nov 25, 2009 at 2:35 AM, Isabel Drost <is...@apache.org>
> wrote:
> >> > On Fri Grant Ingersoll <gs...@apache.org> wrote:
> >> >> On Nov 19, 2009, at 1:15 PM, Sean Owen wrote:
> >> >> > Post a patch if you'd like to proceed, IMHO.
> >> >> +1
> >> >
> >> > +1 from me as well. I would love to see solid svm support in Mahout.
> >>
> >> And another +1 from me. If you want a pointer, I've recently stumbled
> >> on a new solver for SVMs that seems to be remarkably easy to
> >> implement.
> >>
> >> It's called Pegasos:
> >>
> >> ttic.uchicago.edu/~shai/papers/ShalevSiSr07.pdf<
> http://ttic.uchicago.edu/%7Eshai/papers/ShalevSiSr07.pdf>
>
> Pegasos and other online implementations of SVMs based on regularized
> variants of stochastic gradient descent are indeed amenable to large
> scale problems. They solve the SVM optimization problem with a
> stochastic approximation of the primal (as opposed to more 'classical'
> solvers such as libsvm that solve the dual problem using Sequential
> Minimal Optimization). However SGD based SVM implementation are
> currently limited to the linear 'kernel' (which is often expressive
> enough for common NLP tasks such as document categorization).
>
> Other interesting resources on the topic:
>
> A simple reference implementation of Pegasos:
> - http://leon.bottou.org/projects/sgd
>
> Speeding up the convergence of linear SGD based SVM using estimate of
> the diagonal of the hessian:
> -  http://webia.lip6.fr/~bordes/mywiki/doku.php?id=sgdqn
>
> Using a sparsifying L1 priors as a regularizer to perform automated
> feature selection:
> - http://www.cs.berkeley.edu/~jduchi/projects/DuchiSi09_folos.html
>
> Working coordinate-wise on large dimensional problems using L1 priors
> too (maybe easier to make map-reduceable efficiently):
> - http://ttic.uchicago.edu/~tewari/code/scd/
>
> Also do not overlook the higly optimized Vowpal Wabbit, probably the
> fastest linear classifier on earth:
> - http://hunch.net/~vw/
>
> --
> Olivier
> http://twitter.com/ogrisel - http://code.oliviergrisel.name
>



-- 
-------------------------------------------------------------

Zhen-Dong Zhao (Maxim)

<><<><><><><><><><>><><><><><>>>>>>

Department of Computer Science
School of Computing
National University of Singapore

><><><><><><><><><><><><><><><><<<<
Homepage:http://zhaozhendong.googlepages.com
Mail: zhaozhendong@gmail.com
>>>>>>><><><><><><><><<><>><><<<<<<

Re: SVM algo, code, etc.

Posted by Olivier Grisel <ol...@ensta.org>.

2009/12/3 Ted Dunning <te...@gmail.com>:
> Very interesting results, particularly the lack of dependence on data size.
>
> On Thu, Dec 3, 2009 at 12:02 AM, David Hall <dl...@cs.berkeley.edu> wrote:
>
>> On Wed, Nov 25, 2009 at 2:35 AM, Isabel Drost <is...@apache.org> wrote:
>> > On Fri Grant Ingersoll <gs...@apache.org> wrote:
>> >> On Nov 19, 2009, at 1:15 PM, Sean Owen wrote:
>> >> > Post a patch if you'd like to proceed, IMHO.
>> >> +1
>> >
>> > +1 from me as well. I would love to see solid svm support in Mahout.
>>
>> And another +1 from me. If you want a pointer, I've recently stumbled
>> on a new solver for SVMs that seems to be remarkably easy to
>> implement.
>>
>> It's called Pegasos:
>>
>> ttic.uchicago.edu/~shai/papers/ShalevSiSr07.pdf<http://ttic.uchicago.edu/%7Eshai/papers/ShalevSiSr07.pdf>

Pegasos and other online implementations of SVMs based on regularized
variants of stochastic gradient descent are indeed amenable to large
scale problems. They solve the SVM optimization problem with a
stochastic approximation of the primal (as opposed to more 'classical'
solvers such as libsvm that solve the dual problem using Sequential
Minimal Optimization). However SGD based SVM implementation are
currently limited to the linear 'kernel' (which is often expressive
enough for common NLP tasks such as document categorization).

Other interesting resources on the topic:

A simple reference implementation of Pegasos:
- http://leon.bottou.org/projects/sgd

Speeding up the convergence of linear SGD based SVM using estimate of
the diagonal of the hessian:
-  http://webia.lip6.fr/~bordes/mywiki/doku.php?id=sgdqn

Using a sparsifying L1 priors as a regularizer to perform automated
feature selection:
- http://www.cs.berkeley.edu/~jduchi/projects/DuchiSi09_folos.html

Working coordinate-wise on large dimensional problems using L1 priors
too (maybe easier to make map-reduceable efficiently):
- http://ttic.uchicago.edu/~tewari/code/scd/

Also do not overlook the higly optimized Vowpal Wabbit, probably the
fastest linear classifier on earth:
- http://hunch.net/~vw/

-- 
Olivier
http://twitter.com/ogrisel - http://code.oliviergrisel.name

Re: SVM algo, code, etc.

Posted by Ted Dunning <te...@gmail.com>.

Very interesting results, particularly the lack of dependence on data size.

On Thu, Dec 3, 2009 at 12:02 AM, David Hall <dl...@cs.berkeley.edu> wrote:

> On Wed, Nov 25, 2009 at 2:35 AM, Isabel Drost <is...@apache.org> wrote:
> > On Fri Grant Ingersoll <gs...@apache.org> wrote:
> >> On Nov 19, 2009, at 1:15 PM, Sean Owen wrote:
> >> > Post a patch if you'd like to proceed, IMHO.
> >> +1
> >
> > +1 from me as well. I would love to see solid svm support in Mahout.
>
> And another +1 from me. If you want a pointer, I've recently stumbled
> on a new solver for SVMs that seems to be remarkably easy to
> implement.
>
> It's called Pegasos:
>
> ttic.uchicago.edu/~shai/papers/ShalevSiSr07.pdf<http://ttic.uchicago.edu/%7Eshai/papers/ShalevSiSr07.pdf>
>
> -- David
>



-- 
Ted Dunning, CTO
DeepDyve