You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Yun Jiang <he...@gmail.com> on 2008/03/30 13:41:20 UTC

GSoC proposal

Hi,
Here is my proposal. Hope you can give me some advice. Thanks a lot!

*Overview*
Among those ten machine learning algorithms mentioned by Cheng-Tao Chu et
al.[1], I'm really interested in Logistic Regression(LR). I would like to
implement a LR program hadoop which can classify both binary and multiple
classes. Apparently, LR is an effective and robust algorithm as well-known
as Naive Bayes. Its principle is quite clear and formula is simple. So it's
widely used. However, its time complexity is comparable higher than other
algorithms, and we should deal all potential error which can occur during
calculating inversion matrix. If LR can be integrated in Mahout which can
take advantage of MapReduce's quick speed, then so many machine learning
researchers can save their times to build a time-consuming program just for
benchmark.

*Project*

Logistic Regression has been widely used in binary classification. And many
new brilliant learning algorithms are based on LR. So I would not only
implement the basic edition of LR, actually, I have to design the
architecture carefull so that it can change its core function flexibly, and
can receive cost-sensitive data(with 'cost' feature), and may easily adapt
to other framework. [1] can be a good reference framework.



For more detail information, in my implementation of LR, It will surpport
the following features:

   - *basic binary classification.*

        This function is almost same with the one described in [2]. It
receives data with the input format like SVMlight[3], and output a model for
the training set. Then use it to predict test data.

   - *simplified binary classification*

        In practical, many experiments use LR as a benchmark, and don't need
to calculate Hessian matrix in procedure of iteration. Actually, we often
use a small constant float *h* which is regarded as step length to
substitute Hessian matrix. This is because calculating Hessian matrix (this
procedure includes several times of multiplication of large matrices which
complexity is really high) is the most time-consuming part of the whole
algorithm. Also, due to the precision problem, inversion matrix sometimes
can be out-of-control in thousands times of iteration, while using a small
constant float in those cases can be safer and more accurate.

         However, how to specify *h *is a problem. Usually, we estimate it
so that the change to vector will not too large, and then try some
*h*manually. Of course we can leave this specification job to user,
that is to
provide a interface for this parameter.  But we can automatically estimate
proper scope of h in maybe the first five iterations and pick one randomly.

   - *Automactic stop*

        Like the previous feature, this is for user's convenience. When the
change of vector is tiny enough, the iteration should stop.

   - *Cost-sensitive data*

         In some special cases, we may not treat every training sample
equally, so the more important sample which is assigned larger weight can
have bigger influence to predictor. This function is easy to implement but
it plays an important role when it is integrated into other advanced
framework.

   - *Multiclass case*

         Multiclass classification's principle is similar to binary
classification of LR. But it will have more applications in real world. The
time complexity of this is K^3 times of binary one, in which K indicates the
number of classes.



All the features listed above are belong to 'traditional' LR, in my
opinion. As far as I know, LR can have some other applications in the
following cases:

   - *Discriminative Learning*

         In paper[7], authors proposed a novel algorithms that use both
training and test data to build a better predictor. Their intention is to
use unlabeled test data to correct the difference between the distributions
of training and test data. This phenomenon that training and test data are
under different distributions is quit common, and it's occurred by Sample
Selection Bias[6]. Especially in cross-domain experiments, using
discriminative can improve LR's performance obviously. So I would like
to add this function to LR.

   -  *Positive Only Learning*

         In this case, training data consists of labeled data which is all
positive examples and unlabeled data. This situation may occur when getting
a comprehensive negative training data is too expensive. For example, if
users want to classify a kind of web pages such as 'homepage' from the all
the web pages, they may just simply submit some homepages which is the
positive training data. So we get only positive labeled data and a large
number of unlabeled data. Also, POL is very useful in filtering spam-email.
Because the definition of 'spam' is subjective to people, every spam mail
from one user can be considered as a positive data, and mails from all users
can be seen as unlabeled data. So we needn't bother to find exactly ham
mails for each user without worry one's ham maybe other's spam.

         There have some effective way to solve this problem such as [8]. I
can use the similar algorithm to extract key negative features and then use
them to classify.



These two expansions to LR are important but not the least. I hope my
implementation of LR can give users as much convience and options as
possible.

Besides, I think that during this project, I may help on some related work:

   - *Matrix calculation*

         LR needs lots of matrix calculation. From Jira of mahout, I find
that some matrix related work has begun. I'll be very glad if I can
participate in this field.

         LR uses two major functions of matrix calculation: multiplication
and inversion. For multiplication, we can use a simple *divide and
conquer*algorithm to reduce time complexity from O(n^3) to about O(n^
2.81) if both matrices' dimension is n*n[4]. Although it's not fastest, it's
really easier-implementing than Fast Fourier Transform. And as for
inversion, the precision is very crucial. Basically, I'll use LU
decomposition which is widely used by many math software. But with the
detail such as how to dealing with devide-by-zero, I'll consult to others.

   - *Generalized linear model*

         Logistic regression is one of a class of models known as
generalized linear models. When I told my idea of implementing LR to mahout
members, they suggested me to design my architecture carefully so that can
implement various kinds of linear regression such as Poisson all in one
step. So, besides LR, I'm willing to implement some other linear regression
algorithms.



*Functional Milestones*

The LR classifier will be able to do basic binary classification described
in [2]

The LR classifier will be able to do multiclass classification.

The LR classifier will be able to automatically stop and estimate *h*

Implement Poisson and some other Generalized linear models.

Implement Descriminative Learning.

Implement Positive Only Learning.



*Biography*

I'm an undergraduate student in Shanghai Jiao Tong University. My major is
computer science, and have taken most of key courses, such as Probability,
Algorithms, UML&Design Pattern etc. My programming ability is quite good. In
2005 and 2006, I participated in ACM/ICPC and got championship in Korea and
Vietnam sites respectively. Now I'm a research member of Apex Lab in SJTU
and I'm interested in machine learning, especially classification[9].

Though I'm not familiar with open software development, I'll try my best to
finish this job.


 *Reference*

[1]http://www.stat.columbia.edu/~gelman/standardize/<http://www.stat.columbia.edu/%7Egelman/standardize/>

[2]Ch.T.Ch, S.K.Kim, Y.A.Lin, Y.Y.Yu, G.Bradski, and A.Y.Ng. Map-Reduce for
Machine Learning on Multicore.Advances in Neural Information Processing
Systems.

[3]http://svmlight.joachims.org/

[4]S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani. *Algorithms.* 66-67*.
*

[5]J.J.Heckman. Sample selection bias as a specification error. *
Econometrica*, 47:153-161,1979.

[6]B.Zadrozny. Learning and evaluating classifiers under sample selection
bias. In *Proceedings of the Twenty-First International Conference on
Machine Learning*, pages 114-121,2004.

[7]S Bickel, M Brückner, T Scheffer. Discriminative learning for dffering
training and test distributions. In *Proceedings of the 24th international
conference on Machine learning*, 2007.

[8]H Yu, J Han, KCC Chang.PEBL: positive example based learning for Web page
classification using SVM. In *Proceedings of the eighth ACM SIGKDD*.2002.

[9]X.Lin, W.Y.Dai, Y.Jiang, Y.Qiang.Classifying Chinese Web Pages based on
English Training Data.In *Proceedings of the 17th international World Wide
Web* *Conference.*2008


Yun Jiang.

Re: GSoC proposal

Posted by Yun Jiang <he...@gmail.com>.
Thanks.
If I can't finish the whole project in summer which I'll definitely try,
then I'll manage to finish after GSoC.

On Mon, Mar 31, 2008 at 4:20 AM, Isabel Drost <ap...@isabel-drost.de>
wrote:

> On Sunday 30 March 2008, Ted Dunning wrote:
> > This is an excellent proposal.  It might be a little bit ambitious for a
> > summer, but it is nicely separated so that partial success will stand
> > alone.
>
> +1
>
> --
> They are called computers simply because computation is the only
> significant
> job that has so far been given to them.
>  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
>  /,`.-'`'    -.  ;-;;,_
>  |,4-  ) )-,_..;\ (  `'-'
> '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>
>

Re: GSoC proposal

Posted by Isabel Drost <ap...@isabel-drost.de>.
On Sunday 30 March 2008, Ted Dunning wrote:
> This is an excellent proposal.  It might be a little bit ambitious for a
> summer, but it is nicely separated so that partial success will stand
> alone.

+1

-- 
They are called computers simply because computation is the only significant 
job that has so far been given to them.
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

Re: GSoC proposal

Posted by Ted Dunning <td...@veoh.com>.
This is an excellent proposal.  It might be a little bit ambitious for a
summer, but it is nicely separated so that partial success will stand alone.

I would be happy to help mentor on this, as I expect would most of the
Mahout community.



On 3/30/08 4:41 AM, "Yun Jiang" <he...@gmail.com> wrote:

> Hi,
> Here is my proposal. Hope you can give me some advice. Thanks a lot!
> 
> *Overview*
> Among those ten machine learning algorithms mentioned by Cheng-Tao Chu et
> al.[1], I'm really interested in Logistic Regression(LR). I would like to
> implement a LR program hadoop which can classify both binary and multiple
> classes. Apparently, LR is an effective and robust algorithm as well-known
> as Naive Bayes. Its principle is quite clear and formula is simple. So it's
> widely used. However, its time complexity is comparable higher than other
> algorithms, and we should deal all potential error which can occur during
> calculating inversion matrix. If LR can be integrated in Mahout which can
> take advantage of MapReduce's quick speed, then so many machine learning
> researchers can save their times to build a time-consuming program just for
> benchmark.
> 
> *Project*
> 
> Logistic Regression has been widely used in binary classification. And many
> new brilliant learning algorithms are based on LR. So I would not only
> implement the basic edition of LR, actually, I have to design the
> architecture carefull so that it can change its core function flexibly, and
> can receive cost-sensitive data(with 'cost' feature), and may easily adapt
> to other framework. [1] can be a good reference framework.
> 
> 
> 
> For more detail information, in my implementation of LR, It will surpport
> the following features:
> 
>    - *basic binary classification.*
> 
>         This function is almost same with the one described in [2]. It
> receives data with the input format like SVMlight[3], and output a model for
> the training set. Then use it to predict test data.
> 
>    - *simplified binary classification*
> 
>         In practical, many experiments use LR as a benchmark, and don't need
> to calculate Hessian matrix in procedure of iteration. Actually, we often
> use a small constant float *h* which is regarded as step length to
> substitute Hessian matrix. This is because calculating Hessian matrix (this
> procedure includes several times of multiplication of large matrices which
> complexity is really high) is the most time-consuming part of the whole
> algorithm. Also, due to the precision problem, inversion matrix sometimes
> can be out-of-control in thousands times of iteration, while using a small
> constant float in those cases can be safer and more accurate.
> 
>          However, how to specify *h *is a problem. Usually, we estimate it
> so that the change to vector will not too large, and then try some
> *h*manually. Of course we can leave this specification job to user,
> that is to
> provide a interface for this parameter.  But we can automatically estimate
> proper scope of h in maybe the first five iterations and pick one randomly.
> 
>    - *Automactic stop*
> 
>         Like the previous feature, this is for user's convenience. When the
> change of vector is tiny enough, the iteration should stop.
> 
>    - *Cost-sensitive data*
> 
>          In some special cases, we may not treat every training sample
> equally, so the more important sample which is assigned larger weight can
> have bigger influence to predictor. This function is easy to implement but
> it plays an important role when it is integrated into other advanced
> framework.
> 
>    - *Multiclass case*
> 
>          Multiclass classification's principle is similar to binary
> classification of LR. But it will have more applications in real world. The
> time complexity of this is K^3 times of binary one, in which K indicates the
> number of classes.
> 
> 
> 
> All the features listed above are belong to 'traditional' LR, in my
> opinion. As far as I know, LR can have some other applications in the
> following cases:
> 
>    - *Discriminative Learning*
> 
>          In paper[7], authors proposed a novel algorithms that use both
> training and test data to build a better predictor. Their intention is to
> use unlabeled test data to correct the difference between the distributions
> of training and test data. This phenomenon that training and test data are
> under different distributions is quit common, and it's occurred by Sample
> Selection Bias[6]. Especially in cross-domain experiments, using
> discriminative can improve LR's performance obviously. So I would like
> to add this function to LR.
> 
>    -  *Positive Only Learning*
> 
>          In this case, training data consists of labeled data which is all
> positive examples and unlabeled data. This situation may occur when getting
> a comprehensive negative training data is too expensive. For example, if
> users want to classify a kind of web pages such as 'homepage' from the all
> the web pages, they may just simply submit some homepages which is the
> positive training data. So we get only positive labeled data and a large
> number of unlabeled data. Also, POL is very useful in filtering spam-email.
> Because the definition of 'spam' is subjective to people, every spam mail
> from one user can be considered as a positive data, and mails from all users
> can be seen as unlabeled data. So we needn't bother to find exactly ham
> mails for each user without worry one's ham maybe other's spam.
> 
>          There have some effective way to solve this problem such as [8]. I
> can use the similar algorithm to extract key negative features and then use
> them to classify.
> 
> 
> 
> These two expansions to LR are important but not the least. I hope my
> implementation of LR can give users as much convience and options as
> possible.
> 
> Besides, I think that during this project, I may help on some related work:
> 
>    - *Matrix calculation*
> 
>          LR needs lots of matrix calculation. From Jira of mahout, I find
> that some matrix related work has begun. I'll be very glad if I can
> participate in this field.
> 
>          LR uses two major functions of matrix calculation: multiplication
> and inversion. For multiplication, we can use a simple *divide and
> conquer*algorithm to reduce time complexity from O(n^3) to about O(n^
> 2.81) if both matrices' dimension is n*n[4]. Although it's not fastest, it's
> really easier-implementing than Fast Fourier Transform. And as for
> inversion, the precision is very crucial. Basically, I'll use LU
> decomposition which is widely used by many math software. But with the
> detail such as how to dealing with devide-by-zero, I'll consult to others.
> 
>    - *Generalized linear model*
> 
>          Logistic regression is one of a class of models known as
> generalized linear models. When I told my idea of implementing LR to mahout
> members, they suggested me to design my architecture carefully so that can
> implement various kinds of linear regression such as Poisson all in one
> step. So, besides LR, I'm willing to implement some other linear regression
> algorithms.
> 
> 
> 
> *Functional Milestones*
> 
> The LR classifier will be able to do basic binary classification described
> in [2]
> 
> The LR classifier will be able to do multiclass classification.
> 
> The LR classifier will be able to automatically stop and estimate *h*
> 
> Implement Poisson and some other Generalized linear models.
> 
> Implement Descriminative Learning.
> 
> Implement Positive Only Learning.
> 
> 
> 
> *Biography*
> 
> I'm an undergraduate student in Shanghai Jiao Tong University. My major is
> computer science, and have taken most of key courses, such as Probability,
> Algorithms, UML&Design Pattern etc. My programming ability is quite good. In
> 2005 and 2006, I participated in ACM/ICPC and got championship in Korea and
> Vietnam sites respectively. Now I'm a research member of Apex Lab in SJTU
> and I'm interested in machine learning, especially classification[9].
> 
> Though I'm not familiar with open software development, I'll try my best to
> finish this job.
> 
> 
>  *Reference*
> 
> [1]http://www.stat.columbia.edu/~gelman/standardize/<http://www.stat.columbia.
> edu/%7Egelman/standardize/>
> 
> [2]Ch.T.Ch, S.K.Kim, Y.A.Lin, Y.Y.Yu, G.Bradski, and A.Y.Ng. Map-Reduce for
> Machine Learning on Multicore.Advances in Neural Information Processing
> Systems.
> 
> [3]http://svmlight.joachims.org/
> 
> [4]S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani. *Algorithms.* 66-67*.
> *
> 
> [5]J.J.Heckman. Sample selection bias as a specification error. *
> Econometrica*, 47:153-161,1979.
> 
> [6]B.Zadrozny. Learning and evaluating classifiers under sample selection
> bias. In *Proceedings of the Twenty-First International Conference on
> Machine Learning*, pages 114-121,2004.
> 
> [7]S Bickel, M Brückner, T Scheffer. Discriminative learning for dffering
> training and test distributions. In *Proceedings of the 24th international
> conference on Machine learning*, 2007.
> 
> [8]H Yu, J Han, KCC Chang.PEBL: positive example based learning for Web page
> classification using SVM. In *Proceedings of the eighth ACM SIGKDD*.2002.
> 
> [9]X.Lin, W.Y.Dai, Y.Jiang, Y.Qiang.Classifying Chinese Web Pages based on
> English Training Data.In *Proceedings of the 17th international World Wide
> Web* *Conference.*2008
> 
> 
> Yun Jiang.