You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Robert Evans <ev...@yahoo-inc.com> on 2013/02/07 16:13:26 UTC
Re: More information regarding the Project suggestions given on the Hadoop website

This conversation is probably better for common-user@ so I am moving it
over there, I put common-dev@ in the BCC.

I am not really sure what you mean by validate.  I assume you want to test
that your library does what you want it to do.  I would start out with
unit tests to validate the individual pieces work as you designed them to.
 After that you want to do some system level testing.  When I typically
port an algorithm over to Hadoop there are one of two goals that I have.
I either want to reproduce the original algorithm exactly or I want to
create a good enough approximation of it that is extremely scalable.

If you recreated the algorithm exactly you could validate it against the
single computer reference implementation and check that the results are
identical.  With machine learning this is often difficult because many
algorithms use random numbers as part of the process.  To get around this
you sometimes have to modify both implementations to be able to use a
consistent set of pseudo-random numbers.

The other alternative is to use statistics, and this works fairly well no
matter how you ported the algorithm.  Train using the same input data
multiple times using each implementation.  Compare the results against a
test set.  As grad students you probably already understand the stats
necessary to do this correctly already.  Your advisor will probably also
be able to give you better advice on this too, because they can sit down
with you and give you much faster feedback.

--Bobby

On 2/7/13 12:55 AM, "Varsha Raveendran" <va...@gmail.com>
wrote:

>Hello!
>
>
>Based on couple of existing genetic algorithms library available on the
>net, my team and I have come up with a design for the library. But we are
>not able to understand how to validate the library -
>
>Are there any test designs followed to test if a library is working
>correctly?
>
>
>I would like to again mention that we are graduate students and have just
>started working on Hadoop.
>
>Thanks in advance,
>Varsha
>
>
>
>On Sat, Jan 19, 2013 at 9:42 AM, Varsha Raveendran <
>varsha.raveendran@gmail.com> wrote:
>
>> Thank you! I will check with the Mahout team and also go through Commons
>> Math site.
>>
>> Thanks & Regards,
>> Varsha
>>
>>
>> On Sat, Jan 19, 2013 at 12:16 AM, Robert Evans
>><ev...@yahoo-inc.com>wrote:
>>
>>> I'm not sure I am exactly the right person for this, but I assume that
>>>you
>>> are familiar with genetic algorithms.  The Mahout Project is probably a
>>> good place to start http://mahout.apache.org/ they have a number of
>>> machine learning algorithms that run on top of Hadoop.  I did a search
>>>and
>>> it looks like there may already be some support for them in Mahout,
>>>but I
>>> don't know the current state of it.  It looked like there was some
>>> discussion about it being abandoned and might be deleted.  Either way
>>>it
>>> would be a good starting point.  Commons Math may be a good place to
>>>look
>>> too because there is an implementation there that is already Apache
>>> licensed. So if you borrow some of the code there is no issue
>>> http://commons.apache.org/math/userguide/genetics.html.
>>>
>>> --Bobby Evans
>>>
>>> On 1/16/13 8:24 AM, "Varsha Raveendran" <va...@gmail.com>
>>> wrote:
>>>
>>> >Hello!
>>> >
>>> >I require information regarding a project given on the Hadoop website.
>>> Can
>>> >anyone guide me in the right direction?
>>> >
>>> >The project is "Implement a library/framework to support Genetic
>>> >Algorithms<http://en.wikipedia.org/wiki/Genetic_algorithm>on Hadoop
>>> >Map-Reduce."
>>> >
>>> >
>>> >Regards,
>>> >Varsha
>>> >
>>> >New to Hadoop :)
>>>
>>>
>>
>>
>> --
>> *-Varsha *
>>
>
>
>
>-- 
>*-Varsha *