You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Hao Zheng <vo...@gmail.com> on 2008/04/01 07:39:28 UTC

another GSoC application

Dear all,

I have submitted my application on Google. It seems that all students
also post the application here. So I hope it will not too late for me
to post it here. Please give me some suggestion on my proposal,
thanks.

Coincidently, Yun Jiang, another applicant, and me are in the same lab :-).

Application

Abstract

I have solid background knowledge on Machine Learning. Naive Bayes,
Neural Networks, Logistic Regression, Locally Weighted Linear
Regression, and k-Means are easy for me to implement, while SVM, PCA,
ICA, EM, and GDA may cost me some effort.

For each algorithm, I plan to find an existing stable implementation
for reference first. Secondly I will implement a single-machine
version, and verify the correctness with the reference implementation.
Then I will implement a Map/Reduce version, and verify the correctness
with the reference implementation/the single-machine version. Finally,
I will find some large datasets to benchmark the Map/Reduce version,
and figure out the speedup of it. Of course, I will also write
documentation and unit tests during each step.

I am interested in Open Source development, and I am eager to
participate in an open source project. I have used so many open source
software/tools for a long time. GSoC is a good opportunity for me to
contribute to open source community. I want te get started here, and
continue to contribute even after the GSoC.

1. Biography

I am a graduate student at CS department, Shanghai JiaoTong
University, Shanghai, China. I have read through the dev maillist of
Mahout, and I have a rough idea of the progress of Mahout. My research
interests include Social Annotation, Information Retrieval, Web
Mining, Semantic Web, Web 2.0, etc. Statistical Learning and Machine
Learning are the fundamental knowledge to me, because I have to deal
with many tasks on data and knowledge management. My resume could be
accessed at http://www.apexlab.org/apex_wiki/hzheng.

Despite my research in lab, I have taken two highly-related courses
about Machine Learning: Machine Learning (textbook: Machine Learning.
Tom Mitchell, McGraw Hill, 1997.
http://www.amazon.com/Machine-Learning-Tom-M-Mitchell/dp/0070428077),
and Statistical Learning (textbook: The Elements of Statistical
Learning. T Hastie, R Tibshirani, J Friedman, Springer, 2001.
http://www.amazon.com/Elements-Statistical-Learning-T-Hastie/dp/0387952845).
So I believe I have solid background knowledge on Machine Learning. My
plan for the Mahout project of GSoC is detailed in Section 2.

Recently, I am interested in Open Source development, and I am eager
to participate in an open source project. I have used so many open
source software/tools for a long time. GSoC is a good opportunity for
me to contribute to open source community. I want te get start here,
and continue to contribute even after the GSoC.

2. Plan

2.1. Preparation Phase

About Machine Learning, I believe that Naive Bayes, Neural Networks,
Logistic Regression, Locally Weighted Linear Regression, and k-Means
are easy for me to implement, while SVM, PCA, ICA, EM, and GDA may
cost my much effort. I notice that there are issues on Naive Bayes,
k-Means, and EM on JIRA, while svn trunk only has code on k-Means. I
can help current commiters on these existing algorithms, and also
create new algorithms.

About Map/Reduce, I have read the Google paper "MapReduce: Simplified
Data Processing on Large Clusters", and the NIPS paper "Map-Reduce for
Machine Learning on Multicore". I learned the general idea of
Map/Reduce, but I have to admit that I have no experience of it. I
will learn Hadoop first, and try some trivial use case on Hadoop to
get familiar with Map/Reduce programming. As long as I get familiar
with Hadoop, I think I have no problem in this aspect, too.

About general programming skills, I have about 4 years experience in
Java programming. I am proficient in Java, and have taken part in
several large projects.

2.2. Coding Phase

I predict it will take me about 2 weeks to implement Naive Bayes,
Neural Networks, Logistic Regression, Locally Weighted Linear
Regression, and k-Means; 4 weeks to implement SVM, PCA, ICA, EM, and
GDA. By "implement", I mean the following thing:

a). find an existing stable implementation for reference
b). implement a single-machine version, and verify the correctness
with the reference implementation
c). implement a Map/Reduce version, and verify the correctness with
the reference implementation/the single-machine version
d). find some large datasets to benchmark the Map/Reduce version, and
figure out the speedup of it
* During each step, I will also write documentation/unit tests.

The a), b), c) steps can ensure the correctness of our Map/Reduce
implementation, while the d) step can measure the performance of it.

2.3. Miscellaneous but Non-trivial Aspect

I can also help Mahout on some miscellaneous but non-trivial aspect.
For example, input/output format standards, input/output utils, and
anything proposed on JIRA. My experience on software engineering may
help on these aspects.

3. Schedule

now - May 26: Preparation Phase. Learn more on Map/Reduce. Consult
mentors on what to started first. Take part in the discussion on the
dev maillist.

May 27 - August 11: Coding Phase. In this 11 weeks, plan to implement
3-4 algorithms. Also help on some miscellaneous aspects. Write
documentation and unit tests.

August 12 - August 18: Revise some minor errors. Complete some documentation.