You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Marko Novakovic <at...@yahoo.com> on 2008/03/31 20:45:26 UTC

GSOC (SVM algorithm)

I established collaboration with professor Srdjan
Stnkovic from my faculty, about his mentoring me about
SVM. Next week I expose my ideas how I would
paralelize solution which I apply at Google's Summer
of Code.
Who would be my possible mentor from Appache?

Greetings


      ____________________________________________________________________________________
No Cost - Get a month of Blockbuster Total Access now. Sweet deal for Yahoo! users and friends. 
http://tc.deals.yahoo.com/tc/blockbuster/text1.com

Re: GSOC (SVM algorithm)

Posted by Isabel Drost <ap...@isabel-drost.de>.

On Monday 31 March 2008, Grant Ingersoll wrote:
> On Mar 31, 2008, at 2:45 PM, Marko Novakovic wrote:
> > I established collaboration with professor Srdjan
> > Stnkovic from my faculty, about his mentoring me about
> > SVM. Next week I expose my ideas how I would
> > paralelize solution which I apply at Google's Summer
> > of Code.

Would you be willing to work for Mahout after GSoC is over? I think, having 
someone from your faculty providing additional inside is really great and the 
community would give you a warm welcome here :)

Isabel


-- 
Toddlers are the stormtroopers of the Lord of Entropy.
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

Re: GSOC (SVM algorithm)

Posted by Marko Novakovic <at...@yahoo.com>.

> For everyone applying for GSOC, not just Marko:
> 
> We have a good number of applications and will
> probably only get 1 or  
> 2 students (I'm not even sure 1 is guaranteed but it
> is likely), even  
> though we have 4 willing mentors since the ASF is
> slotted a certain  
> number of students for the whole group.

If I understand correctly, only 1 or 2 students will
be chosed in Mahout project.

> I would
> encourage everyone to  
> make sure their proposals are as strong as you can
> possibly make  
> them.  This means timelines, bios, supporting
> materials,  
> recommendations, references, etc.  Basic job
> interview stuff, I guess.
> 
> I can't speak to the other evaluators, but I know at
> least part of my  
> criteria will be based on the level of details
> provided, etc.  In my  
> mind, it is not enough to simply say you are going
> to work on some ML  
> algorithm, or, as some have said, claim to implement
> all 10 algorithms  
> in the NIPS paper on the wiki in 3 months, or pick
> one once the  
> project starts.  I'd also make some effort to show
> you have done your  
> background research and include any references that
> you have that  
> discuss the problem or would be helpful in us
> understanding them better.
> 
> Cheers,
> Grant
> 

This is my proposal:

This is my application, give me feedback, please.

The Implementation of Support Vector Machine Algorithm
at Hadoop Platform

Abstract

I have been researching in Search Engines
functionalities, like ranking, presenting relevant
page to users, etc. 
I noted that SVM algorithm is good solution for
clasifying crawled Web pages in search engines.
After I had been reading and elaborating article
[Joachims, 2007]
I decided to implement SVM optimized for processing
text data and retrieving relevant feedback.
According to SVM is very complex algorithm, which has
a lot of operations, 
I choose map-reduce Hadoop platform.

[Joachims, 2007] T. Joachims, F. Radlinski: "Search
Engines that Laerning from Implicit Feedback," IEEE
Computer, August 2007, pp 38

Detailed Description

Dear Google and Apache,

Project: Lucene Mahout

My Idea:

I have idea to implement model and solution for
retrieving relevant ranking Web pages, in order to
user's recent behavior. 
According to SE-s have a lot of crawled Web pages, 
machine learning algorithms, which is used by SE, must
be realized as distributed or paralilized, if we want
to obtain  real-time results  and have fresh retrieved
database. 
I want to implement the Support Vector Machine (SVM)
formulation for optimizing multivariate performance
measures described in [Joachims, 2005]. Furthermore,
that would implement the alternative structural
formulation of the SVM optimization problem for
conventional binary classification with error rate and
ordinal regression described in [Joachims, 2006].
There is not usually important to use a large parallel
cluster for processing relevance feedback because
there are only a few training examples in these cases.
According to SVM training cost goes up extremly with
the size of the problem (quadratic complexity), I want
to deploy this solution at first 100 pages for each
combination of user and query.
I also, choose SVM algorithm because I comprehend that
this is big temptation for me and will be useful for
professors at my college.
I will exploit working on this project for writing new
article about deployment of SVM algorithm optimization
at SE-a.
I have prepared to this project reading articles:
[1] C. Burges, "A Tutorial on Suppot Vector Machines
for Pattern Recognition," Kluwer Academic Publishers,
Boston, 1998
[2] R.E Fan, P.H Chen, C.J. Lin, "Working Set
Selection Using Second Order Information for Training
Support Vector Machines," Journal of Machine Learning
Research 6 (2005), pp 18891918
I also have read Hadoop documentation and examined
your implementations of algoritm kMeans at this
platform.

Methodoligies of Development:

- Test Driven Development
- Deployment ANT an JUnit
- SDK: Eclipse
- SVN System for Versioning
- Javadoc

About Me:

My resume you can see at link
http://atisha34.googlepages.com/.
I also participate in some academic projects at my
college:
- Working at topic based Search Engine, called Grain,
which is in construction at my faculty.
- Tutorial about SE-s, mentored by professor Veljko
Milutinovic: "The New Avenues in Search Engines" 
presentation:
http://atisha34.googlepages.com/Searchengines.ppt
abstract:
http://atisha34.googlepages.com/TheNewAvenuesinWebSearch.docx
I should publish article driven by this presentation
at IPSI Magazine.
- Other projects in which I participate aren't related
to machine learning and search engines.

My Interests:
- Search Engines
- Software Engineering and Test Driven Development
- Machine Learning
- Database Modeling and OO Design
- ERP and Business Processes

Sincerely Yours,
Marko Novakovic
 
[Joachims, 2006] T. Joachims, Training Linear SVMs in
Linear Time, Proceedings of the ACM Conference on
Knowledge Discovery and Data Mining (KDD), 2006.
[Joachims, 2005] T. Joachims, A Support Vector Method
for Multivariate Performance Measures, Proceedings of
the International Conference on Machine Learning
(ICML), 2005.




      ____________________________________________________________________________________
Special deal for Yahoo! users & friends - No Cost. Get a month of Blockbuster Total Access now 
http://tc.deals.yahoo.com/tc/blockbuster/text3.com

Re: GSOC (SVM algorithm)

Posted by Grant Ingersoll <gs...@apache.org>.

On Mar 31, 2008, at 2:45 PM, Marko Novakovic wrote:

> I established collaboration with professor Srdjan
> Stnkovic from my faculty, about his mentoring me about
> SVM. Next week I expose my ideas how I would
> paralelize solution which I apply at Google's Summer
> of Code.
> Who would be my possible mentor from Appache?

See http://mahout.markmail.org/message/43yvfxaux4rp2wce

For everyone applying for GSOC, not just Marko:

We have a good number of applications and will probably only get 1 or  
2 students (I'm not even sure 1 is guaranteed but it is likely), even  
though we have 4 willing mentors since the ASF is slotted a certain  
number of students for the whole group.  I would encourage everyone to  
make sure their proposals are as strong as you can possibly make  
them.  This means timelines, bios, supporting materials,  
recommendations, references, etc.  Basic job interview stuff, I guess.

I can't speak to the other evaluators, but I know at least part of my  
criteria will be based on the level of details provided, etc.  In my  
mind, it is not enough to simply say you are going to work on some ML  
algorithm, or, as some have said, claim to implement all 10 algorithms  
in the NIPS paper on the wiki in 3 months, or pick one once the  
project starts.  I'd also make some effort to show you have done your  
background research and include any references that you have that  
discuss the problem or would be helpful in us understanding them better.

Cheers,
Grant