You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by satyam sinha <su...@gmail.com> on 2013/04/05 18:24:56 UTC

GSOC 2013 Aspirant on #MAHOUT-1177 and #MAHOUT-1179

Hello all,
I am Satyam, a Computer Science student, with great interest in Scalability
and Machine Learning, and a strong flair in java programming .I recently
discovered Apache technologies in the above areas: Hadoop and Mahout, while
browsing Google I/O videos on youtube.
I aspire to get involved in these projects .Incidentally, Mahout has a
strong presence in GSOC and has me interested.
I have been digging the mahout mailing-list archives and stumbled upon the
threads that seem to be the origin of interest in the JIRA issues.
http://find.searchhub.org/document/d8a473db89c7b99e
I do not have any experience with Hadoop and am working towards building a
sound understanding of this technology.
I consider myself to be a self-starter.I am already digging through the
Mahout source code and trying to get familiar with it. Please give
directions and suggestions to help me on my very first FOSS experience.
Hope I can get to be a part of this community and get a chance to
contribute to these amazing projects.

Regards,
Satyam

Re: GSOC 2013 Aspirant on #MAHOUT-1177 and #MAHOUT-1179

Posted by satyam sinha <su...@gmail.com>.
Review Request :
I've submitted a very generalized proposal to ASF.
Is there some way I can confirm that it has been channeled and delivered to
mahout?

The proposal is as following. Any advice is appreciated. (Perhaps i should
have provided a link instead ? )

*Short description:* Main goal of this project is to refactor for
performace/ease-of-use based on Mahout API design decided by community.
Additionally provide for info-graphic based documentation. Add/Redesign
:test , examples, benchmarks.

*Problem Description*

There is a need for restructuring the Mahout API to provide streamlined
input and output formats, and an intuitive structuring of the class and
project hierarchy. Several related projects may need a common interface
with regularized prototypes. We need to redesign several tests, benchmarks
and examples; and to add these in case they are not present.



*Deliverables*

   - Clean and optimized API
   - Documentation with info-graphics and dependency charts.
   - New tests and benchmarks



*Design Document*

The design of the new API is expected to come up well before the coding
phase starts based on ongoing discussions in mailing-lists. I will set up a
wiki that allows easy access and exchange of opinions on design.

I am a huge believer in info-graphics and will include the design graphic
and dependency graph on the Mahout documentation.



*Approach*

Largely IDE based development with help of integrated tools.Intent to
resort to CLI for writing and editing scripts.



*Timeline*

The summer break is on, so I am essentially free till the mid of July. So,
I have a lot of time on my hands that I can devote to my project. I can
commit to over 40 hours every week. Regular classes resume thereafter
(which are no hindrance).



*Pre-Coding Phase:<3 weeks : 5 May - 26 May >*

Address few PMD, Find Bugs, Check Style, Open Tasks on Jenkins to gain
familiarity with the code-base and associated tools.Meanwhile, create the
re-factoring road-map based on open discussion in mahout community.



*Phase 1:**<3 weeks : 27 May - 16 June >*

Restructure the code-base to the new API design .Provide Regression testing
and redesign tests when required.



*Review** 1: **<1 week : 17 June - 23 June >*

Update the Mahout wiki and the Documentation .Provide and run
Diagnostics.Also document the tests and examples for the beginners.Profiler
report analysis, look for bottle-necks.



*Phase 2: **<2 weeks : 24 June - **7** July >*

Write tests and examples and benchmarks .Address community feedback on the
work in Phase 1.



*Review** **2**: **<1 week : **8** Ju**ly** - **14** Ju**ly** >*

End-to-end testing.Fix outstanding bugs.Report on performance improvement.



*Phase 3:<4 weeks : 15 July - 12 July >*

I hope to have built a very good foundation by now.Re-commence Integrated
development with concurrent testing and documentation.Resolve related JIRA
issues.



*Beyond GSOC:*

Remain associated with Mahout.Work towards becoming a commiter.




Due to the nature of the project; the timeline maybe subject to changes, to
reflect the variations in the roadmap.I am also open to tasks that my
mentor may see fit to assign me.

*References:*
https://issues.apache.org/jira/browse/MAHOUT-1177
https://issues.apache.org/jira/browse/MAHOUT-1179



*About Me*

I am an under-graduate student about to start the final year of the
4-year-programme for Computer Science and Engineering, at Birla Institute
Of Technology, Mesra,India . I have a proper background in statistics ,
object-oriented programming, and system architecture.

I endeavor to build a career in scalable data science. I have developed a
preliminary understanding of Hadoop and Mahout API's and hope to build upon
the knowledge as we progress along GSOC.



This is my first experience with Open-source and I will surely give my
best.


On Fri, May 3, 2013 at 5:59 AM, satyam sinha <su...@gmail.com> wrote:

> I lost a lot of time due to semester-evaluations at my institute.(I should
> have notified perhaps.)
> The summer break is begun and now I have uninterrupted time to devote to
> GSOC.
>
> I have already setup hadoop-1.0.4 on opensuse-12.3.
> Mahout 0.8-SNAPSHOT via svn on netbeans-7.3
> I've been running various examples and tests included.
> It took me almost a week( Okay I'm not a wizard !! :) ) to setup and go
> through various talks and slides.
> I need some insight whether it is advisable to look into Avro now.
>
> TL;DR
> I was away for college, but am back now full-time.
> (May this not reflect badly upon me.)
> Will setup the wiki with my initial ideas in under 24 hours, so that we
> can all discuss needs of the API.
>
>
> On Mon, Apr 8, 2013 at 12:23 PM, Isabel Drost-Fromm <is...@apache.org>wrote:
>
>>
>> Hi Satyam,
>>
>> On Friday, April 05, 2013 09:54:56 PM satyam sinha wrote:
>> > Please give directions and suggestions to help me on my very first FOSS
>> > experience.
>>
>> I guess the best way to get started is to check out the source code,
>> build the
>> project and get familiar with the code. Both issues you mention in the
>> subject
>> a good for people with less experience in machine learning and/or Hadoop.
>>
>> However both are pretty involved - you will need to understand the
>> existing
>> code, come up with a good design for new APIs and discuss that design
>> with the
>> community. So best to concentrate on just one of them.
>>
>> Feel free to also create a separate wiki page that contains a living
>> design
>> document for the APIs that others can contribute to as well.
>>
>>
>> Isabel
>>
>
>

Re: GSOC 2013 Aspirant on #MAHOUT-1177 and #MAHOUT-1179

Posted by satyam sinha <su...@gmail.com>.
I lost a lot of time due to semester-evaluations at my institute.(I should
have notified perhaps.)
The summer break is begun and now I have uninterrupted time to devote to
GSOC.

I have already setup hadoop-1.0.4 on opensuse-12.3.
Mahout 0.8-SNAPSHOT via svn on netbeans-7.3
I've been running various examples and tests included.
It took me almost a week( Okay I'm not a wizard !! :) ) to setup and go
through various talks and slides.
I need some insight whether it is advisable to look into Avro now.

TL;DR
I was away for college, but am back now full-time.
(May this not reflect badly upon me.)
Will setup the wiki with my initial ideas in under 24 hours, so that we can
all discuss needs of the API.


On Mon, Apr 8, 2013 at 12:23 PM, Isabel Drost-Fromm <is...@apache.org>wrote:

>
> Hi Satyam,
>
> On Friday, April 05, 2013 09:54:56 PM satyam sinha wrote:
> > Please give directions and suggestions to help me on my very first FOSS
> > experience.
>
> I guess the best way to get started is to check out the source code, build
> the
> project and get familiar with the code. Both issues you mention in the
> subject
> a good for people with less experience in machine learning and/or Hadoop.
>
> However both are pretty involved - you will need to understand the existing
> code, come up with a good design for new APIs and discuss that design with
> the
> community. So best to concentrate on just one of them.
>
> Feel free to also create a separate wiki page that contains a living design
> document for the APIs that others can contribute to as well.
>
>
> Isabel
>

Re: GSOC 2013 Aspirant on #MAHOUT-1177 and #MAHOUT-1179

Posted by Isabel Drost-Fromm <is...@apache.org>.
Hi Satyam,

On Friday, April 05, 2013 09:54:56 PM satyam sinha wrote:
> Please give directions and suggestions to help me on my very first FOSS
> experience.

I guess the best way to get started is to check out the source code, build the 
project and get familiar with the code. Both issues you mention in the subject 
a good for people with less experience in machine learning and/or Hadoop.

However both are pretty involved - you will need to understand the existing 
code, come up with a good design for new APIs and discuss that design with the 
community. So best to concentrate on just one of them.

Feel free to also create a separate wiki page that contains a living design 
document for the APIs that others can contribute to as well.


Isabel