You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Samee Zahur <sa...@gmail.com> on 2008/03/29 06:43:25 UTC

Undergrad stud interested in GSoC

Hello,
I have read through nips paper and the march archive for this mailing
list, and I feel I can implement some of the algorithms (as permitted
by time) described in the nips paper. Being an undergrad student
interested in the field of data-intensive machine learning techniques
and applications, I am interested in implementing these algorithms as
a way of getting an exposure into this field.

Even though I have already applied to work in this SoC, I do have one
question though. When coding, how am I expected to test the
effectiveness of my algorithms without running it on a multicore
platform? Or do I simply assume that a sufficiently sensible
application of M/R will allow Hadoop to take care of scalability? What
is the usual development platform used here? Sorry if such questions
seem a bit silly, but it is in order gain the experience in such
development that I want to work in this project.

And about the application for SoC, I selected the ASF as the mentoring
organisation - how do I make sure that someone from mahout reviews it?

Initially for the SoC, I want to implement LWLR, NN and PCA, but later
beyond the GSoC I want to continue contributing to this project in
other ways I figure out once I gain familiarity with the scope of this
project.

Samee

Re: Undergrad stud interested in GSoC

Posted by Isabel Drost <ap...@isabel-drost.de>.

On Friday 04 April 2008, Samee Zahur wrote:
> I do see others at least including signature.asc as an attachment at times.

I would guess there is some upper limit on the size of attachments. Don't know 
for sure though.

Isabel

-- 
You cannot use your friends and have them too.
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

Re: Undergrad stud interested in GSoC

Posted by Samee Zahur <sa...@gmail.com>.

I do see others at least including signature.asc as an attachment at times.


On 4/4/08, Grant Ingersoll <gs...@apache.org> wrote:
>
>  On Apr 4, 2008, at 5:21 AM, Samee Zahur wrote:
>
> >
> > For future reference, just one question: what kind of attachments do
> > go through the mailing list here?
> >
>
>
>  None, I think
>

Re: Undergrad stud interested in GSoC

Posted by Grant Ingersoll <gs...@apache.org>.

On Apr 4, 2008, at 5:21 AM, Samee Zahur wrote:
>
> For future reference, just one question: what kind of attachments do
> go through the mailing list here?


None, I think

Re: Undergrad stud interested in GSoC

Posted by Samee Zahur <sa...@gmail.com>.

Will do. As soon as I write up the unit tests I will submit it. So far
I was testing it by manually invoking it from the command line just to
avoid having to download JUnit :p

For future reference, just one question: what kind of attachments do
go through the mailing list here?

Samee

On 4/3/08, Grant Ingersoll <gs...@apache.org> wrote:
> Please post the code as a patch in JIRA.  See the Wiki "How to Contribute"
> section.
>
>  -Grant
>
>

Re: Undergrad stud interested in GSoC

Posted by Grant Ingersoll <gs...@apache.org>.

Please post the code as a patch in JIRA.  See the Wiki "How to  
Contribute" section.

-Grant

On Apr 3, 2008, at 1:19 AM, Samee Zahur wrote:

> Please rename the attachment to LWLR.tar.gz for openning. Couldn't get
> it mailed otherwise.

Re: Undergrad stud interested in GSoC

Posted by Samee Zahur <sa...@gmail.com>.

Please rename the attachment to LWLR.tar.gz for openning. Couldn't get
it mailed otherwise.

Re: Undergrad stud interested in GSoC

Posted by Samee Zahur <sa...@gmail.com>.

I have updated my proposal (see below) to GSoC to include a timeline
and expanded its scope to include ALL the algorithms in the nips
paper.

Frankly, I was a little confused when Isabel Drost said that's a lot
of work for one summer. So I kept doing some background study the last
week, gone through the API, the existing codes in mahout, and I think
I understand them well enough to write similar codes by myself - and
that fast enough to implement all the algorithms described in the nips
paper in this summer. To confirm myself, I even built up a LWLR
implementation on hadoop from scratch. Took me around 3-5 hours of
coding and testing.

I know it is only a draft implementation, and that it has lots of room
for improvement, but all that is doable.

Please, if you still think I have grossly underestimated the
difficulty in someway, let me know so exactly parts are the most
difficult and time-consuming, and I will adjust my position
accordingly.

My implementation of LWLR is attached here. I ran it along with the
trunk codes from svn. It is just the implementation, the test modules
are not there yet. But I think it is still simple enough for you to
inspect and comment.


Samee

---- Proposal starts ---------

Introduction

I am Samee Zahur, a senior-year student expecting a BS degree in
Computer Engineering in less than a year. I was very excited to learn
that you would mentor students from the GSoC program as I feel that
the Apache Mahout project would provide me with an excellent
first-hand introduction into working with large-scale machine-learning
systems. Given that I have been involved in various forms of
programming tasks for over 10 years now, I strongly feel that I can
provide you with all the services that you require for this particular
task, and that within the short time-frame.


Scope of Work

During the Google Summer of Code 2008, I intend to implement all of
the ten machine-learning algorithms described in the nips paper (or
fewer if specified by the mentor), following the proper project
standards as dictated by the mentor, perpare illustratively large
sample cases for demonstration of each of those algorithms, and
document the API thus coded, once again, according to project
standards.


Expected phases of my work

1) First two weeks I will just be reading documentations, doing
background studies, going through the existing codebase, discussing
project conventions (how much to comment the codes, data I/O format
design, where to use Float[] instead of mahout.util classes etc.), and
getting to know the project in general. This is also the time I will
be using to gain greater familiarity with the Hadoop framework.

2) At this point, I will start coding. During this time I will code
LWLR, NN and PCA algorithms. It is expected that I will take a maximum
of 2-3 days for a draft working system for each of these algorithms,
along with one more week for fine-tuning, flushing things back into
project convention, documenting, testing etc. All these activities
will be reported to the mahout-dev mailing list, so that mentors may
check progress and correct/adjust my activity accordingly.

3) Once these three algorithms have been completely implemented, I
expect to be quite thoroughly familiar with the project - codebase and
conventions. At this time I should be able to implement any and all of
the remaining algorithms without much trouble, once again, taking 2-3
days for each. At this point, I expect my progress to be slowed down
by the fact that I may start contributing to this project in ways
outside the scope of the GSoC. Nevertheless, I expect this phase to be
over by a maximum of 5 weeks.

4) By this time I should have delivered almost all my work related to
the GSoC, and explained them well to the mentor. Only demonstration
and test data and usage tutorial writeups are expected to be remaining
at this point. Once again, I expect this to take around 3-5 weeks,
depending on the actual size of the API I implemented thus far.

Personal Background

As you may realize, I have been involved in programming ever since
middle-school, and since then I have used my skills sometimes to
semi-professional needs, sometimes purely as a hobby. From the
descriptions of my experience and interests below, you will realize
that I am usually a quick learner. And it is this particular quality
which I feel will make me the most suitable candidate for your work
here. I realize that any student taking up this task would be required
to gain quick familiarity to Hadoop and its implementation of
MapReduce, Mahout design plans and planned interfaces, and in-depth
understanding of principles relating to scalability. You will find
that in the descriptions below, I have repeatedly proven myself to be
capable of rapidly acquiring new skills in this area and using them
within a tight time constraint.


Relevant Activities

I have been an active contestant in various algorithm-based
programming competitions for the last 2-3 years, both on national and
international levels. Competitions I have participated include ACM
ICPC Regional Contests, TopCoder Collegiate Challenge and TopCoder
Open, and various other nationwide contests. You can see a graphical
representation of my performance over the years here:

http://www.topcoder.com/tc?module=MemberProfile&cr=16039001

This profile reflects my performance in programming competitions,
mainly reflecting the results of weekly Single Round Matches I have
participated in TopCoder to keep in touch with my skills. As the graph
shows, at the start I was not quite adept at such competitions. But
with practice, my performance improved. The tasks required us to read
problem statements, analyze them, classify the feasible solution
approach and code an algorithm which solves the given problem. With
practice, I was able to go from opening the problem statement to
submitting the solution even in under 10 mins for moderately easy
problems. This 10 mins included the time to read the problem, code,
test and debug the solutions. At the peak of my performance a few
months back, I was ranked number one in the country and ranked better
than 250 out of 7000+ coders in the world. I have, however, decided to
concentrate on other tasks in this summer and therefore will
not participate and practice there till the end of August. The point
is, however, that I went from nothing to being the top in the country
(and 250 in the world) within a matter of 2-3 years, even when I am
competing against international contestants who have been in such
competitions for 5-10 years more than me.

I have also had to work on real-life projects when I used to do
lightweight freelancing work on website backend development. There I
coded server-side scripts in popular languages like ASP, PHP etc. My
work ranged from simple order-placement scripts to payment gateway
authentication integration. At this time my skills at understanding
and integrating with existing codebases were put to the test. Small
projects frequently involved delivery within weeks, which meant I had
to be fast in getting acquainted with the existing systems in place
and in getting familiar with new technologies. I worked in such small
projects for approximately another 2 years.

Over the years my hobbies included work in C, C++, VB, 80x86 Assembly,
PHP, ASP and Java, among other languages. In the course of learning
several languages I have tested my abilities to learn fast and
thoroughly more than once.



---- Proposal Ends --------

Re: Undergrad stud interested in GSoC

Posted by Isabel Drost <ap...@isabel-drost.de>.

On Saturday 29 March 2008, Samee Zahur wrote:
> Being an undergrad student interested in the field of data-intensive machine
> learning techniques and applications, I am interested in implementing these
> algorithms as a way of getting an exposure into this field.

Great. Nice to have you here.

> Even though I have already applied to work in this SoC, I do have one
> question though. When coding, how am I expected to test the
> effectiveness of my algorithms without running it on a multicore
> platform?

I think the first step should be to make sure your algorithms run correct in a 
single core environment but on top of Hadoop. Currently we do not yet have 
the infrastructure to test the runtime performance of the implementations on 
large clusters.

> What is the usual development platform used here?

I guess every developer uses the platform that best fits his needs. Myself I 
am currently developing on a black MacBook that is running Debian Etch. As 
IDE I am still using Eclipse, for svn and ant the command line tools are 
sufficient for me.

> And about the application for SoC, I selected the ASF as the mentoring
> organisation - how do I make sure that someone from mahout reviews it?

Those mahout developers that volunteered to mentor students will see your 
application automatically in the Google web app, don't worry about it. Yet it 
would be nice if you posted your application to the list as well, that way 
people who have no time for mentoring can still discuss your proposal and 
contribute that way.

> Initially for the SoC, I want to implement LWLR, NN and PCA,

Sounds like quite a bit of work for this summer. You are sure you can 
successfully implement all three algorithms?

> but later beyond the GSoC I want to continue contributing to this project in
> other ways I figure out once I gain familiarity with the scope of this
> project.

Good to hear. Grant already sent a posting on contributing even w/o Google 
funding. I would very much like to see students that remain in the project 
after the GSoC is over.

Isabel

-- 
It got to the point where I had to get a haircut or both feet firmly planted 
in the air.
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

Re: Undergrad stud interested in GSoC

Posted by Karl Wettin <ka...@gmail.com>.

Samee Zahur skrev:
> Hello,

Hi Samee.

> I have read through nips paper and the march archive for this mailing
> list, and I feel I can implement some of the algorithms (as permitted
> by time) described in the nips paper. Being an undergrad student
> interested in the field of data-intensive machine learning techniques
> and applications, I am interested in implementing these algorithms as
> a way of getting an exposure into this field.
> 
> Even though I have already applied to work in this SoC, I do have one
> question though. When coding, how am I expected to test the
> effectiveness of my algorithms without running it on a multicore
> platform? Or do I simply assume that a sufficiently sensible
> application of M/R will allow Hadoop to take care of scalability?

The effectiveness is not as interesting as the scalabillity. M/R is a 
good fit if your job job can be divided in more map tasks than you have 
processor cores in your computer.

I have heard of reports stating that in some cases up to 30% of the 
resources are waisted compared to non M/R operations on a single 
computer. But if you can run it on 2000 nodes usig M/R, well..


> What is the usual development platform used here? Sorry if such questions
> seem a bit silly, but it is in order gain the experience in such
> development that I want to work in this project.

Platform as in source code editor?

> And about the application for SoC, I selected the ASF as the mentoring
> organisation - how do I make sure that someone from mahout reviews it?

Post it here.

> Initially for the SoC, I want to implement LWLR, NN and PCA, but later
> beyond the GSoC I want to continue contributing to this project in
> other ways I figure out once I gain familiarity with the scope of this
> project.

This is great to hear! Glad to have you here!


      karl

Re: Undergrad stud interested in GSoC

Posted by Isabel Drost <ap...@isabel-drost.de>.

On Monday 31 March 2008, Jeff Eastman wrote:
> I think we can refer to external datasets in our documentation and load
> them on demand when we run against them.  That way we do not have to store
> them either.

So I guess, we should just come up with a list of dataset that are interesting 
to us.

I think, working with external datasets we might get more experience to work 
on the JIRA issue Mahout-8 (the data definition model for Mahout)?

Isabel

-- 
BOFH excuse #371:Incorrectly configured static routes on the corerouters.
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

RE: Undergrad stud interested in GSoC

Posted by Jeff Eastman <je...@windwardsolutions.com>.

I think we can refer to external datasets in our documentation and load them
on demand when we run against them.  That way we do not have to store them
either.

Jeff

Jeff Eastman, Ph.D.
Windward Solutions Inc.
+1.415.298.0023
http://windwardsolutions.com
http://jeffeastman.blogspot.com
 

> -----Original Message-----
> From: Isabel Drost [mailto:apache_mahout@isabel-drost.de]
> Sent: Sunday, March 30, 2008 1:24 PM
> To: mahout-dev@lucene.apache.org
> Subject: Re: Undergrad stud interested in GSoC
> 
> On Sunday 30 March 2008, Jeff Eastman wrote:
> > I'm working with my colleagues at CollabNet who have expressed interest
> in
> > providing us some EC2 time for this sort of testing.
> 
> Sounds great to me.
> 
> 
> > They are working on EC2 deployment of Hadoop using their CUBiT machine
> > allocation environment and the quid pro quo would be that we help them
> > exercise this tool.
> 
> I think that's only fair, for us getting some real world experimental
> results.
> 
> 
> > I also think it is important that we build up some example datasets to
> test
> > our existing code
> 
> +1 I don't know the exact licensing of existing datasets, maybe we could
> use
> them for checking as long as we do not ship them as part of our
> code/project?
> 
> Isabel
> 
> 
> 
> --
> [FORTRAN] will persist for some time -- probably for at least the next
> decade.		-- T. Cheatham
>   |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
>   /,`.-'`'    -.  ;-;;,_
>  |,4-  ) )-,_..;\ (  `'-'
> '---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

Re: Undergrad stud interested in GSoC

Posted by Isabel Drost <ap...@isabel-drost.de>.

On Sunday 30 March 2008, Jeff Eastman wrote:
> I'm working with my colleagues at CollabNet who have expressed interest in
> providing us some EC2 time for this sort of testing.

Sounds great to me.


> They are working on EC2 deployment of Hadoop using their CUBiT machine
> allocation environment and the quid pro quo would be that we help them
> exercise this tool.

I think that's only fair, for us getting some real world experimental results.


> I also think it is important that we build up some example datasets to test
> our existing code

+1 I don't know the exact licensing of existing datasets, maybe we could use 
them for checking as long as we do not ship them as part of our code/project?

Isabel



-- 
[FORTRAN] will persist for some time -- probably for at least the next 
decade.		-- T. Cheatham
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

RE: Undergrad stud interested in GSoC

Posted by Jeff Eastman <je...@windwardsolutions.com>.

I'm working with my colleagues at CollabNet who have expressed interest in
providing us some EC2 time for this sort of testing. They are working on EC2
deployment of Hadoop using their CUBiT machine allocation environment and
the quid pro quo would be that we help them exercise this tool. We have not
yet worked out the administrative details but I will keep the list posted on
progress.

I also think it is important that we build up some example datasets to test
our existing code and I intend to devote some energy to this beginning soon.

Jeff

Jeff Eastman, Ph.D.
Windward Solutions Inc.
+1.415.298.0023
http://windwardsolutions.com
http://jeffeastman.blogspot.com
 

> -----Original Message-----
> From: Samee Zahur [mailto:samee.zahur@gmail.com]
> Sent: Friday, March 28, 2008 10:43 PM
> To: mahout-dev@lucene.apache.org
> Subject: Undergrad stud interested in GSoC
> 
> Hello,
> I have read through nips paper and the march archive for this mailing
> list, and I feel I can implement some of the algorithms (as permitted
> by time) described in the nips paper. Being an undergrad student
> interested in the field of data-intensive machine learning techniques
> and applications, I am interested in implementing these algorithms as
> a way of getting an exposure into this field.
> 
> Even though I have already applied to work in this SoC, I do have one
> question though. When coding, how am I expected to test the
> effectiveness of my algorithms without running it on a multicore
> platform? Or do I simply assume that a sufficiently sensible
> application of M/R will allow Hadoop to take care of scalability? What
> is the usual development platform used here? Sorry if such questions
> seem a bit silly, but it is in order gain the experience in such
> development that I want to work in this project.
> 
> And about the application for SoC, I selected the ASF as the mentoring
> organisation - how do I make sure that someone from mahout reviews it?
> 
> Initially for the SoC, I want to implement LWLR, NN and PCA, but later
> beyond the GSoC I want to continue contributing to this project in
> other ways I figure out once I gain familiarity with the scope of this
> project.
> 
> Samee