You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Hao Zheng <vo...@gmail.com> on 2008/03/20 17:15:22 UTC

application of GSoC

hi all mahout devs,

I am interested in your idea on GSoC, mahout-machine-learning.

I am a graduate student at SJTU, Shanghai, China. My research
interests include Social Annotation, Information Retrieval, Web
Mining, Semantic Web, Web 2.0, etc. Statistical Learning and Machine
Learning are the fundamental knowledge to me. I have had the course:

Machine Learning (textbook: Machine Learning,  Tom Mitchell, McGraw Hill, 1997.
http://www.amazon.com/Machine-Learning-Tom-M-Mitchell/dp/0070428077)

,and Statistical Learning (textbook: The Elements of Statistical Learning
T Hastie, R Tibshirani, J Friedman, Springer, 2001).

I have read the incubator proposal, and I believe that Naive Bayes,
Neural Networks, Logistic Regression, Locally Weighted Linear
Regression, and k-Means are easy for me to implement, as a
single-machine program. I have learned SVM, PCA, ICA, EM, and GDA,
too. But I am not sure whether I could implement them easily, for the
advanced mathamatics behind them. Do you require the candidates to
implement all the algorithms mentioned above? I really want to have a
try here.

Another reason for my application is that I am interested in open
source development. I am experienced and proficient in Java. I have
used many apache products (Ant, Commons, Tomcat, Log4j, Lucene, etc.)
But I am new to open source development, so maybe this is a good
chance for me to move from a consumer to a producer.

Re: application of GSoC

Posted by Hao Zheng <vo...@gmail.com>.

ok, I see. thank you all.

On Sat, Mar 22, 2008 at 12:32 AM, Ted Dunning <td...@veoh.com> wrote:
>
> Yes.
>
> This is what I was saying.
>
> The decision about what is good to have in mahout has most to do with what
> machine learning related task needs parallelism not whether it is "learning"
> or "feature selection" or "feature extraction".
>
> That said, the mahout project also needs reference implementations of all of
> these algorithms in sequential form for testing.
>
>
> On 3/21/08 2:08 AM, "Hao Zheng" <vo...@gmail.com> wrote:
>
> > I understand. Actually, I mean the other thing. Maybe "feature
> > selection" is not precise, let me restate my question.
> >
> > Generally, no matter for image recognition or text classification, we
> > have to ture the original material into a featrue vector. This step is
> >  called "feature extration" or sth like that. My question is will this
> > step be part of the mahout project? If yes, we have to care about the
> > transformation step; if not, all we need to process are the numbers,
> > which will make thing easier.
> >
> > On Fri, Mar 21, 2008 at 9:02 AM, Ted Dunning <td...@veoh.com> wrote:
> >>
> >>  I think a better description is that this project is about ML algorithms
> >>  that need large scale.
> >>
> >>  If you have very inexpensive feature selection that can run sequentially,
> >>  then it probably doesn't matter to use hadoop/mahout for that.  Some forms
> >>  of feature extraction is very expensive, however, and could definitely
> >>  benefit from parallelism.  For instance, you could imagine that the feature
> >>  extraction step involves a large scale non-deterministic clustering.  It
> >>  might even be that the the feature extraction requires parallel processing,
> >>  but the actual learning algorithm does not.
>
>

Re: 答复: application of GSoC

Posted by Isabel Drost <ap...@isabel-drost.de>.

On Friday 21 March 2008, Grant Ingersoll wrote:
> I also, agree, though, that we don't want to be in that biz. too much, at
> least not yet.  As we grow and attract more people/contributions, it may
> just happen. 

I think nevertheless it would be nice to have the models stored in a format 
that does not only store the parameters of the algorithms used for training 
but also the steps taken for data preprocessing: In many real world settings 
I have seen that data preprocessing largely affects the e.g. classification 
quality that can be achieved. This also relates to the Mahout-18 JIRA issue.

Isabel

-- 
If you're going to do something tonight that you'll be sorry for tomorrow 
morning, sleep late.		-- Henny Youngman
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

Re: 答复: application of GSoC

Posted by Grant Ingersoll <gs...@apache.org>.

I kind of imagine that we may have some utilities that help with some  
of these common tasks, but I don't know what they are at the moment,  
since we don't have that much code just yet.  I also, agree, though,  
that we don't want to be in that biz. too much, at least not yet.  As  
we grow and attract more people/contributions, it may just happen.


On Mar 21, 2008, at 9:28 AM, Hao Zheng wrote:

> to shunkai,
> I agree. It will make the project more dedicated to ML itself, rather
> than many tricks on feature preparation/extration/selection.
>
> 2008/3/21 shunkai.fu <sh...@roboo.com>:
>>
>> I think what you are discussing is about data transformation or data
>> preparation. It is not necessary part of the Mahout project, or  
>> maybe the
>> applicants can implement their own InputFormat.
>>
>> Best,
>>
>> Shunkai
>
> to Paul,
> Combining features into other features involves many tricks. And there
> is no standards or criterion to follow. Different users may have
> different requirement. I think such thing will make mahout more
> complicated. Maybe leaving data preparation to users themselves is a
> better choice. All that mahout will process are numbers only.
>
> On Fri, Mar 21, 2008 at 7:49 PM, Paul Elschot  
> <pa...@xs4all.nl> wrote:
>> For text applications, I think we can leave the basic feature
>> (ie. term) extraction safely to Lucene, see the
>> org.apache.lucene.analysis package. Likewise, term vectors
>> in Lucene are pretty close to general feature vectors.
>>
>> Combining features into other features is another thing.
>> For text this boils down to making queries combining basic terms.
>> There are quite a few opportunities for parallelism there, so
>> I'd like this to be part of mahout.
>>
>> I wouldn't know whether a similar breakdown into feature extraction
>> and feature combination also applies to image recognition.
>> Is there (more or less) general purpose software available for
>> basic feature extraction from images?
>>
>> Regards,
>> Paul Elschot

--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: 答复: application of GSoC

Posted by Hao Zheng <vo...@gmail.com>.

to shunkai,
I agree. It will make the project more dedicated to ML itself, rather
than many tricks on feature preparation/extration/selection.

2008/3/21 shunkai.fu <sh...@roboo.com>:
>
>  I think what you are discussing is about data transformation or data
>  preparation. It is not necessary part of the Mahout project, or maybe the
>  applicants can implement their own InputFormat.
>
>  Best,
>
>  Shunkai

to Paul,
Combining features into other features involves many tricks. And there
is no standards or criterion to follow. Different users may have
different requirement. I think such thing will make mahout more
complicated. Maybe leaving data preparation to users themselves is a
better choice. All that mahout will process are numbers only.

On Fri, Mar 21, 2008 at 7:49 PM, Paul Elschot <pa...@xs4all.nl> wrote:
> For text applications, I think we can leave the basic feature
>  (ie. term) extraction safely to Lucene, see the
>  org.apache.lucene.analysis package. Likewise, term vectors
>  in Lucene are pretty close to general feature vectors.
>
>  Combining features into other features is another thing.
>  For text this boils down to making queries combining basic terms.
>  There are quite a few opportunities for parallelism there, so
>  I'd like this to be part of mahout.
>
>  I wouldn't know whether a similar breakdown into feature extraction
>  and feature combination also applies to image recognition.
>  Is there (more or less) general purpose software available for
>  basic feature extraction from images?
>
>  Regards,
>  Paul Elschot

答复: application of GSoC

Posted by "shunkai.fu" <sh...@roboo.com>.

I think what you are discussing is about data transformation or data
preparation. It is not necessary part of the Mahout project, or maybe the
applicants can implement their own InputFormat.

Best,

Shunkai 

-----邮件原件-----
发件人: Paul Elschot [mailto:paul.elschot@xs4all.nl] 
发送时间: 2008年3月21日 19:50
收件人: mahout-dev@lucene.apache.org
主题: Re: application of GSoC

For text applications, I think we can leave the basic feature
(ie. term) extraction safely to Lucene, see the
org.apache.lucene.analysis package. Likewise, term vectors
in Lucene are pretty close to general feature vectors.

Combining features into other features is another thing.
For text this boils down to making queries combining basic terms.
There are quite a few opportunities for parallelism there, so
I'd like this to be part of mahout.

I wouldn't know whether a similar breakdown into feature extraction
and feature combination also applies to image recognition.
Is there (more or less) general purpose software available for
basic feature extraction from images?

Regards,
Paul Elschot

Op Friday 21 March 2008 10:08:18 schreef Hao Zheng:
> I understand. Actually, I mean the other thing. Maybe "feature
> selection" is not precise, let me restate my question.
>
> Generally, no matter for image recognition or text classification, we
> have to ture the original material into a featrue vector. This step
> is called "feature extration" or sth like that. My question is will
> this step be part of the mahout project? If yes, we have to care
> about the transformation step; if not, all we need to process are the
> numbers, which will make thing easier.
>
> On Fri, Mar 21, 2008 at 9:02 AM, Ted Dunning <td...@veoh.com> 
wrote:
> >  I think a better description is that this project is about ML
> > algorithms that need large scale.
> >
> >  If you have very inexpensive feature selection that can run
> > sequentially, then it probably doesn't matter to use hadoop/mahout
> > for that.  Some forms of feature extraction is very expensive,
> > however, and could definitely benefit from parallelism.  For
> > instance, you could imagine that the feature extraction step
> > involves a large scale non-deterministic clustering.  It might even
> > be that the the feature extraction requires parallel processing,
> > but the actual learning algorithm does not.
> >
> >  On 3/20/08 5:57 PM, "Hao Zheng" <vo...@gmail.com> wrote:
> >  > Another question, this project is all about the ML algorithm
> >  > itself? all we will deal with is feature vectors/matrix
> >  > constructed already? that is, the project will not include
> >  > feature selection part of ML, e.g. extracting feature vector
> >  > from a document collection?

Re: application of GSoC

Posted by Paul Elschot <pa...@xs4all.nl>.

For text applications, I think we can leave the basic feature
(ie. term) extraction safely to Lucene, see the
org.apache.lucene.analysis package. Likewise, term vectors
in Lucene are pretty close to general feature vectors.

Combining features into other features is another thing.
For text this boils down to making queries combining basic terms.
There are quite a few opportunities for parallelism there, so
I'd like this to be part of mahout.

I wouldn't know whether a similar breakdown into feature extraction
and feature combination also applies to image recognition.
Is there (more or less) general purpose software available for
basic feature extraction from images?

Regards,
Paul Elschot


Op Friday 21 March 2008 10:08:18 schreef Hao Zheng:
> I understand. Actually, I mean the other thing. Maybe "feature
> selection" is not precise, let me restate my question.
>
> Generally, no matter for image recognition or text classification, we
> have to ture the original material into a featrue vector. This step
> is called "feature extration" or sth like that. My question is will
> this step be part of the mahout project? If yes, we have to care
> about the transformation step; if not, all we need to process are the
> numbers, which will make thing easier.
>
> On Fri, Mar 21, 2008 at 9:02 AM, Ted Dunning <td...@veoh.com> 
wrote:
> >  I think a better description is that this project is about ML
> > algorithms that need large scale.
> >
> >  If you have very inexpensive feature selection that can run
> > sequentially, then it probably doesn't matter to use hadoop/mahout
> > for that.  Some forms of feature extraction is very expensive,
> > however, and could definitely benefit from parallelism.  For
> > instance, you could imagine that the feature extraction step
> > involves a large scale non-deterministic clustering.  It might even
> > be that the the feature extraction requires parallel processing,
> > but the actual learning algorithm does not.
> >
> >  On 3/20/08 5:57 PM, "Hao Zheng" <vo...@gmail.com> wrote:
> >  > Another question, this project is all about the ML algorithm
> >  > itself? all we will deal with is feature vectors/matrix
> >  > constructed already? that is, the project will not include
> >  > feature selection part of ML, e.g. extracting feature vector
> >  > from a document collection?

Re: application of GSoC

Posted by Ted Dunning <td...@veoh.com>.

Yes.

This is what I was saying.

The decision about what is good to have in mahout has most to do with what
machine learning related task needs parallelism not whether it is "learning"
or "feature selection" or "feature extraction".

That said, the mahout project also needs reference implementations of all of
these algorithms in sequential form for testing.

On 3/21/08 2:08 AM, "Hao Zheng" <vo...@gmail.com> wrote:

> I understand. Actually, I mean the other thing. Maybe "feature
> selection" is not precise, let me restate my question.
> 
> Generally, no matter for image recognition or text classification, we
> have to ture the original material into a featrue vector. This step is
>  called "feature extration" or sth like that. My question is will this
> step be part of the mahout project? If yes, we have to care about the
> transformation step; if not, all we need to process are the numbers,
> which will make thing easier.
> 
> On Fri, Mar 21, 2008 at 9:02 AM, Ted Dunning <td...@veoh.com> wrote:
>> 
>>  I think a better description is that this project is about ML algorithms
>>  that need large scale.
>> 
>>  If you have very inexpensive feature selection that can run sequentially,
>>  then it probably doesn't matter to use hadoop/mahout for that.  Some forms
>>  of feature extraction is very expensive, however, and could definitely
>>  benefit from parallelism.  For instance, you could imagine that the feature
>>  extraction step involves a large scale non-deterministic clustering.  It
>>  might even be that the the feature extraction requires parallel processing,
>>  but the actual learning algorithm does not.

Re: application of GSoC

Posted by Hao Zheng <vo...@gmail.com>.

I understand. Actually, I mean the other thing. Maybe "feature
selection" is not precise, let me restate my question.

Generally, no matter for image recognition or text classification, we
have to ture the original material into a featrue vector. This step is
 called "feature extration" or sth like that. My question is will this
step be part of the mahout project? If yes, we have to care about the
transformation step; if not, all we need to process are the numbers,
which will make thing easier.

On Fri, Mar 21, 2008 at 9:02 AM, Ted Dunning <td...@veoh.com> wrote:
>
>  I think a better description is that this project is about ML algorithms
>  that need large scale.
>
>  If you have very inexpensive feature selection that can run sequentially,
>  then it probably doesn't matter to use hadoop/mahout for that.  Some forms
>  of feature extraction is very expensive, however, and could definitely
>  benefit from parallelism.  For instance, you could imagine that the feature
>  extraction step involves a large scale non-deterministic clustering.  It
>  might even be that the the feature extraction requires parallel processing,
>  but the actual learning algorithm does not.
>
>
>
>
>  On 3/20/08 5:57 PM, "Hao Zheng" <vo...@gmail.com> wrote:
>
>  > Another question, this project is all about the ML algorithm itself?
>  > all we will deal with is feature vectors/matrix constructed already?
>  > that is, the project will not include feature selection part of ML,
>  > e.g. extracting feature vector from a document collection?
>
>

答复: application of GSoC

Posted by "shunkai.fu" <sh...@roboo.com>.

Large number of features is a challenge. 

Another one is the emerging stream mining application. 

Other than these two challenges, Mahout may be suitable for ensemble
learning, where parallel computing will save time greatly. 

Shunkai 

-----邮件原件-----
发件人: Ted Dunning [mailto:tdunning@veoh.com] 
发送时间: 2008年3月21日 9:02
收件人: mahout-dev@lucene.apache.org
主题: Re: application of GSoC

I think a better description is that this project is about ML algorithms
that need large scale.

If you have very inexpensive feature selection that can run sequentially,
then it probably doesn't matter to use hadoop/mahout for that.  Some forms
of feature extraction is very expensive, however, and could definitely
benefit from parallelism.  For instance, you could imagine that the feature
extraction step involves a large scale non-deterministic clustering.  It
might even be that the the feature extraction requires parallel processing,
but the actual learning algorithm does not.

On 3/20/08 5:57 PM, "Hao Zheng" <vo...@gmail.com> wrote:

> Another question, this project is all about the ML algorithm itself?
> all we will deal with is feature vectors/matrix constructed already?
> that is, the project will not include feature selection part of ML,
> e.g. extracting feature vector from a document collection?

Re: application of GSoC

Posted by Ted Dunning <td...@veoh.com>.

I think a better description is that this project is about ML algorithms
that need large scale.

If you have very inexpensive feature selection that can run sequentially,
then it probably doesn't matter to use hadoop/mahout for that.  Some forms
of feature extraction is very expensive, however, and could definitely
benefit from parallelism.  For instance, you could imagine that the feature
extraction step involves a large scale non-deterministic clustering.  It
might even be that the the feature extraction requires parallel processing,
but the actual learning algorithm does not.

On 3/20/08 5:57 PM, "Hao Zheng" <vo...@gmail.com> wrote:

> Another question, this project is all about the ML algorithm itself?
> all we will deal with is feature vectors/matrix constructed already?
> that is, the project will not include feature selection part of ML,
> e.g. extracting feature vector from a document collection?

Re: application of GSoC

Posted by Hao Zheng <vo...@gmail.com>.

thanks, Grant. I will learn more about MapReduce these days, and i
will read through all mails on the dev mailing list. I will surely
submit a proposal, and hope i will be the lucky dog.

Another question, this project is all about the ML algorithm itself?
all we will deal with is feature vectors/matrix constructed already?
that is, the project will not include feature selection part of ML,
e.g. extracting feature vector from a document collection?

On Fri, Mar 21, 2008 at 3:21 AM, Grant Ingersoll <gs...@apache.org> wrote:
>
>  On Mar 20, 2008, at 12:15 PM, Hao Zheng wrote:
>
>  > hi all mahout devs,
>  >
>  > I am interested in your idea on GSoC, mahout-machine-learning.
>  >
>  > I am a graduate student at SJTU, Shanghai, China. My research
>  > interests include Social Annotation, Information Retrieval, Web
>  > Mining, Semantic Web, Web 2.0, etc. Statistical Learning and Machine
>  > Learning are the fundamental knowledge to me. I have had the course:
>  >
>  > Machine Learning (textbook: Machine Learning,  Tom Mitchell, McGraw
>  > Hill, 1997.
>  > http://www.amazon.com/Machine-Learning-Tom-M-Mitchell/dp/0070428077)
>  >
>  > ,and Statistical Learning (textbook: The Elements of Statistical
>  > Learning
>  > T Hastie, R Tibshirani, J Friedman, Springer, 2001).
>  >
>  > I have read the incubator proposal, and I believe that Naive Bayes,
>  > Neural Networks, Logistic Regression, Locally Weighted Linear
>  > Regression, and k-Means are easy for me to implement, as a
>  > single-machine program. I have learned SVM, PCA, ICA, EM, and GDA,
>  > too. But I am not sure whether I could implement them easily, for the
>  > advanced mathamatics behind them. Do you require the candidates to
>  > implement all the algorithms mentioned above? I really want to have a
>  > try here.
>  >
>
>  I don't think you need to implement them all.  I'd say pick one or
>  more that your find interesting, unless you think you can do all of
>  them on a M/R framework in that period of time.  Also feel free to
>  suggest a different ML algorithm.
>
>  The main thing to do is pick what you think you can do in that time
>  frame and make a proposal, I guess.  I think we are all a bit new to
>  GSOC here, so we'll discover as we go, I guess.
>
>  At any rate, your backgrounds sounds appealing, so please do submit a
>  proposal.
>
>  -Grant
>

Re: application of GSoC

Posted by Grant Ingersoll <gs...@apache.org>.

On Mar 20, 2008, at 12:15 PM, Hao Zheng wrote:

> hi all mahout devs,
>
> I am interested in your idea on GSoC, mahout-machine-learning.
>
> I am a graduate student at SJTU, Shanghai, China. My research
> interests include Social Annotation, Information Retrieval, Web
> Mining, Semantic Web, Web 2.0, etc. Statistical Learning and Machine
> Learning are the fundamental knowledge to me. I have had the course:
>
> Machine Learning (textbook: Machine Learning,  Tom Mitchell, McGraw  
> Hill, 1997.
> http://www.amazon.com/Machine-Learning-Tom-M-Mitchell/dp/0070428077)
>
> ,and Statistical Learning (textbook: The Elements of Statistical  
> Learning
> T Hastie, R Tibshirani, J Friedman, Springer, 2001).
>
> I have read the incubator proposal, and I believe that Naive Bayes,
> Neural Networks, Logistic Regression, Locally Weighted Linear
> Regression, and k-Means are easy for me to implement, as a
> single-machine program. I have learned SVM, PCA, ICA, EM, and GDA,
> too. But I am not sure whether I could implement them easily, for the
> advanced mathamatics behind them. Do you require the candidates to
> implement all the algorithms mentioned above? I really want to have a
> try here.
>

I don't think you need to implement them all.  I'd say pick one or  
more that your find interesting, unless you think you can do all of  
them on a M/R framework in that period of time.  Also feel free to  
suggest a different ML algorithm.

The main thing to do is pick what you think you can do in that time  
frame and make a proposal, I guess.  I think we are all a bit new to  
GSOC here, so we'll discover as we go, I guess.

At any rate, your backgrounds sounds appealing, so please do submit a  
proposal.

-Grant