You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Farid Bourennani <Fa...@uoit.ca> on 2008/04/04 17:03:17 UTC

FW: Google Summger of Code

Farid:
On Thursday 03 April 2008, Farid Bourennani wrote:
> > 3) any additional tools (such as GUI) required to be developed to prove
> > the my implementation?
>
> By GUI, I meant plotting tools in order to be able to visualize every
> iteration of the implemented machine learning algorithm and validate the
> final results graphically (eg. Gaussian VS Random Data).

Isabel Drost:
There should be some automated means of validating your results that does not
need human intervention. Where possible your algorithms should come with unit
tests to prove that they work.

Farid:
> 5) It was also mentioned on the project that "Students are also encouraged
> to work on projects related to their own machine learning research". Do
> that means that all the algorithms used have to posted right a way.

Isabel Drost:
Well, I think you should make available all code and libraries that you use in
a way that is compatible with both: The Apache Software License the code you
develop during your project will be licensed under. And the license the
libraries you want to use are licensed under.

That said you need to make available everything that is necessary for your
code to work correctly. It does not make a lot of sense to me, to include
some java module that one can only use if one owns a Matlab license. Or
worse, that only works with a library that is only available to your research
lab. But I guess, that was clear to you already ;)

Farid (NEW QUESTION)

I understand that the complete code must be published; no doubt about it! With attention to the project Lucene-Mahot is very close to my research thesis. So,  I am aiming for a possible publication with some Hybrid learning algorithms. Correct me please if I am wrong: My understanding is the algorithm implemented is entirely the property of Apache and I would be very happy to contribute to the community. This being sad, are the publications related to the Hybrid machine learning algorithms are still the property university? I am not talking about the code here only, not about the publication. The reason of my question is that I am new in the Open-Source world as well as to the publication world: it's very exiting! I wanted only to clarify everything before very hopefully starting. 

Farid:
> 6)I assume that we will be using Lucene? Even though the learning
> algorithms can be used for different applications (Images, Speech
> recognition ...), I am more interested on Text algorithms specially since
> Lucene offers Stemming, , Stop Words Filtering, Text Normalization  and
> even Synonym Expansion functionalities.

Isabel Drost:
I think it should be fine to use Lucene for the preprocessing steps and for
feature extraction. It would be nice, if the algorithm was designed and
implemented general enough to allow others to use it for processing images,
speech or whatever they like - if that is possible and makes sense for your
algorithm.

Farid (NEW QUESTION)
That's not an issue, all the algorithms use VSM usually. I have already implemented some learning algorithms iin the past such a way learning machine algo could be applied to any type of data (image, speech...). However, I wanted only to know if the use of LUCENE is required, suggested or neither?

Regards,
Farid

Re: FW: Google Summger of Code

Posted by Isabel Drost <ap...@isabel-drost.de>.

On Friday 04 April 2008, Farid Bourennani wrote:
> This being sad, are the publications related to the Hybrid machine learning
> algorithms are still the property university? 

Apache will not claim these - although citing our project would sure be nice 
if the publication relies on Mahout.

As Ted already said it depends on the publisher of your paper whether the 
copyright will remain yours after publication though.

> I am new in the Open-Source world as well as to the publication world: it's
> very exiting! I wanted only to clarify everything before very hopefully
> starting. 

To add a little experience from German universities: As far as I know they are 
pretty open to the idea of publishing code under open source licenses 
although students get support if they want to turn their research into a 
business model. You should certainly speak to your university concerning your 
plan to contribute your software to Mahout as well as your intention to take 
part in Google summer of code. I would not expect any problems, but you never 
know.

> However, I wanted only to know if the use of LUCENE is required, suggested
> or neither?

Well, several Mahout people have a Lucene background and are interested in 
text mining. Of course this does not imply that we reject any patches that 
are not lucene centric ;)

Isabel

-- 
"Mind if I smoke?"	"Yes, I'd like to see that, does it come out of your ears 
or what?"
  |\      _,,,---,,_       Web:   <http://www.isabel-drost.de>
  /,`.-'`'    -.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  <xm...@spaceboyz.net>

Re: Google Summger of Code

Posted by Ted Dunning <td...@veoh.com>.

Actually, the author is the owner until publication at which point the
journal (usually) takes ownership of the copyright.


On 4/4/08 10:06 AM, "Grant Ingersoll" <gs...@apache.org> wrote:

> Publications are different, and you/University are the owner of that
> and the copyright holder.

Re: Google Summger of Code

Posted by Grant Ingersoll <gs...@apache.org>.

I agree w/ Ted and can add:

By writing the code (for the GSOC), you are donating it to the ASF.   
The ASF has the copyright for it and I think (IANAL) would be  
considered the owner, as the community will no doubt extend it and  
change it.  Having said that, the Apache license is such that it can  
be used by anyone for pretty much any purpose, you just can't say it  
is yours or call it Mahout.   You see this quite a bit, in fact.   
Sun's JavaDB and IBM's Cloudscape (I think they call it that) are just  
Apache Derby, I believe.

And yes, you should check w/ your University.  Some are very closed  
when it comes to Open Source (especially the Apache license).  That  
being said, if you are being paid to do the work for the ASF by Google  
as a summer internship, I don't see how they could lay claims to it as  
their intellectual property.

Publications are different, and you/University are the owner of that  
and the copyright holder.  Only way the ASF would be is if you donated  
the publication to the ASF (which doesn't make much sense in the  
academic paper way, but would in the tutorial sense).  Just make sure  
you call it Apache Mahout when referring to the code and project (and  
a URL would be great, too!).

This is very cool, though.  One of my biggest hopes for Mahout is that  
it will become something Universities will latch onto for teaching and  
creating and we will attract more and more students.

-Grant

On Apr 4, 2008, at 6:41 PM, Ted Dunning wrote:

>
> Apache doesn't have to be the "owner".  They just have to have  
> complete
> rights to create derivative works and redistribute the software.
>
> Your university may have an issue with that.  You should ask them.
>
> You should also check with anybody who is funding your research.  The
> university research officer should be a good place to start with that
> question as well.
>
> Also, be very careful because many people answering your question  
> (like me)
> will be giving you US-centric answers.  Since you are in Canada, the  
> answers
> may be importantly different.
>
>
> On 4/4/08 8:03 AM, "Farid Bourennani" <Fa...@uoit.ca>  
> wrote:
>
>> I understand that the complete code must be published; no doubt  
>> about it! With
>> attention to the project Lucene-Mahot is very close to my research  
>> thesis. So,
>> I am aiming for a possible publication with some Hybrid learning  
>> algorithms.
>> Correct me please if I am wrong: My understanding is the algorithm  
>> implemented
>> is entirely the property of Apache and I would be very happy to  
>> contribute to
>> the community. This being sad, are the publications related to the  
>> Hybrid
>> machine learning algorithms are still the property university? I am  
>> not
>> talking about the code here only, not about the publication. The  
>> reason of my
>> question is that I am new in the Open-Source world as well as to the
>> publication world: it's very exiting! I wanted only to clarify  
>> everything
>> before very hopefully starting.
>

Re: Google Summger of Code

Posted by Ted Dunning <td...@veoh.com>.

Apache doesn't have to be the "owner".  They just have to have complete
rights to create derivative works and redistribute the software.

Your university may have an issue with that.  You should ask them.

You should also check with anybody who is funding your research.  The
university research officer should be a good place to start with that
question as well.

Also, be very careful because many people answering your question (like me)
will be giving you US-centric answers.  Since you are in Canada, the answers
may be importantly different.

On 4/4/08 8:03 AM, "Farid Bourennani" <Fa...@uoit.ca> wrote:

> I understand that the complete code must be published; no doubt about it! With
> attention to the project Lucene-Mahot is very close to my research thesis. So,
> I am aiming for a possible publication with some Hybrid learning algorithms.
> Correct me please if I am wrong: My understanding is the algorithm implemented
> is entirely the property of Apache and I would be very happy to contribute to
> the community. This being sad, are the publications related to the Hybrid
> machine learning algorithms are still the property university? I am not
> talking about the code here only, not about the publication. The reason of my
> question is that I am new in the Open-Source world as well as to the
> publication world: it's very exiting! I wanted only to clarify everything
> before very hopefully starting.