You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Tarek Elgamal <ta...@gmail.com> on 2015/05/18 12:13:05 UTC

Contribute code to MLlib

Hi,

I would like to contribute an algorithm to the MLlib project. I have
implemented a scalable PCA algorithm on spark. It is scalable for both tall
and fat matrices and the paper around it is accepted for publication in
SIGMOD 2015 conference. I looked at the guidelines in the following link:

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines

I believe that most of the guidelines applies in my case, however, the code
is written in java and it was not clear in the guidelines whether MLLib
project accepts java code or not.
My algorithm can be found under this repository:
https://github.com/Qatar-Computing-Research-Institute/sPCA

Any help on how to make it suitable for MLlib project will be greatly
appreciated.

Best Regards,
Tarek Elgamal

Re: Contribute code to MLlib

Posted by Trevor Grant <tr...@gmail.com>.

Thank you Ram and Joseph.

I am also hoping to contribute to MLib once my Scala gets up to snuff, this
is the guidance I needed for how to proceed when ready.

Best wishes,
Trevor



On Wed, May 20, 2015 at 1:55 PM, Joseph Bradley <jo...@databricks.com>
wrote:

> Hi Trevor,
>
> I may be repeating what Ram said, but to 2nd it, a few points:
>
> We do want MLlib to become an extensive and rich ML library; as you said,
> scikit-learn is a great example.  To make that happen, we of course need to
> include important algorithms.  "Important" is hazy, but roughly means being
> useful to a large number of users, improving a large number of use cases
> (above what is currently available), and being well-established and tested.
>
> Others and I may not be familiar with Tarek's algorithm (since it is so
> new), so it will be important to discuss details on JIRA to establish the
> cases in which the algorithm improves over current PCA.  That may require
> discussion, community testing, etc.  If we establish that it is a clear
> improvement in a large domain, then it could be valuable to have in MLlib
> proper.  It's always going to be hard to tell where to draw the line, so
> less common algorithms will require more testing before we commit to
> including them in MLlib.
>
> I like the Spark package suggestion since it would allow users immediately
> start using the code, while the discussion on JIRA happens.  (Plus, if
> package users find it useful, they can report that on the JIRA.)
>
> Joseph
>
> On Wed, May 20, 2015 at 10:01 AM, Ram Sriharsha <sr...@gmail.com>
> wrote:
>
>> Hi Trevor
>>
>> I'm attaching the MLLib contribution guideline here:
>>
>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines
>>
>> It speaks to widely known and accepted algorithms but not to whether an
>> algorithm has to be better than another in every scenario etc
>>
>> I think the guideline explains what a good contribution to the core
>> library should look like better than I initially attempted to !
>>
>> Sent from my iPhone
>>
>> On May 20, 2015, at 9:31 AM, Ram Sriharsha <sr...@gmail.com>
>> wrote:
>>
>> Hi Trevor
>>
>> Good point, I didn't mean that some algorithm has to be clearly better
>> than another in every scenario to be included in MLLib. However, even if
>> someone is willing to be the maintainer of a piece of code, it does not
>> make sense to accept every possible algorithm into the core library.
>>
>> That said, the specific algorithms should be discussed in the JIRA: as
>> you point out, there is no clear way to decide what algorithm to include
>> and what not to, and usually mature algorithms that serve a wide variety of
>> scenarios are easier to argue about but nothing prevents anyone from
>> opening a ticket to discuss any specific machine learning algorithm.
>>
>> My suggestion was simply that for purposes of making experimental or
>> newer algorithms available to Spark users, it doesn't necessarily have to
>> be in the core library. Spark packages are good enough in this respect.
>>
>> Isn't it better for newer algorithms to take this route and prove
>> themselves before we bring them into the core library? Especially given the
>> barrier to using spark packages is very low.
>>
>> Ram
>>
>>
>>
>> On Wed, May 20, 2015 at 9:05 AM, Trevor Grant <tr...@gmail.com>
>> wrote:
>>
>>> Hey Ram,
>>>
>>> I'm not speaking to Tarek's package specifically but to the spirit of
>>> MLib.  There are a number of method/algorithms for PCA, I'm not sure by
>>> what criterion the current one is considered 'standard'.
>>>
>>> It is rare to find ANY machine learning algo that is 'clearly better'
>>> than any other.  They are all tools, they have their place and time.  I
>>> agree that it makes sense to field new algorithms as packages and then
>>> integrate into MLib once they are 'proven' (in terms of
>>> stability/performance/anyone cares).  That being said, if MLib takes the
>>> stance that 'what we have is good enough unless something is *clearly*
>>> better', then it will never grow into a suite with the depth and richness
>>> of sklearn. From a practitioner's stand point, its nice to have everything
>>> I could ever want ready in an 'off-the-shelf' form.
>>>
>>> 'A large number of use cases better than existing' shouldn't be a
>>> criteria when selecting what to include in MLib.  The important question
>>> should be, 'Are you willing to take on responsibility for maintaining this
>>> because you may be the only person on earth who understands the mechanics
>>> AND how to code it?'.   Obviously we don't want any random junk algo
>>> included.  But trying to say, 'this way of doing PCA is better than that
>>> way in a large class of cases' is like trying to say 'geometry is more
>>> important than calculus in large class of cases", maybe its true- but
>>> geometry won't help you if you are in a case where you need calculus.
>>>
>>> This all relies on the assumption that MLib is destined to be a rich
>>> data science/machine learning package.  It may be that the goal is to make
>>> the project as lightweight and parsimonious as possible, if so excuse me
>>> for speaking out of turn.
>>>
>>>
>>> On Tue, May 19, 2015 at 10:41 AM, Ram Sriharsha <sriharsha.ram@gmail.com
>>> > wrote:
>>>
>>>> Hi Trevor, Tarek
>>>>
>>>> You make non standard algorithms (PCA or otherwise) available to users
>>>> of Spark as Spark Packages.
>>>> http://spark-packages.org
>>>> https://databricks.com/blog/2014/12/22/announcing-spark-packages.html
>>>>
>>>> With the availability of spark packages, adding powerful experimental /
>>>> alternative machine learning algorithms to the pipeline has never been
>>>> easier. I would suggest that route in scenarios where one machine learning
>>>> algorithm is not clearly better in the common scenarios than an existing
>>>> implementation in MLLib.
>>>>
>>>> If your algorithm is for a large class of use cases better than the
>>>> existing PCA implementation, then we should open a JIRA and discuss the
>>>> relative strengths/ weaknesses (perhaps with some benchmarks) so we can
>>>> better understand if it makes sense to switch out the existing PCA
>>>> implementation and make yours the default.
>>>>
>>>> Ram
>>>>
>>>> On Tue, May 19, 2015 at 6:56 AM, Trevor Grant <trevor.d.grant@gmail.com
>>>> > wrote:
>>>>
>>>>>  There are most likely advantages and disadvantages to Tarek's
>>>>> algorithm against the current implementation, and different scenarios where
>>>>> each is more appropriate.
>>>>>
>>>>> Would we not offer multiple PCA algorithms and let the user choose?
>>>>>
>>>>> Trevor
>>>>>
>>>>> Trevor Grant
>>>>> Data Scientist
>>>>>
>>>>> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>>>>>
>>>>>
>>>>> On Mon, May 18, 2015 at 4:18 PM, Joseph Bradley <joseph@databricks.com
>>>>> > wrote:
>>>>>
>>>>>> Hi Tarek,
>>>>>>
>>>>>> Thanks for your interest & for checking the guidelines first!  On 2
>>>>>> points:
>>>>>>
>>>>>> Algorithm: PCA is of course a critical algorithm.  The main question
>>>>>> is how your algorithm/implementation differs from the current PCA.  If it's
>>>>>> different and potentially better, I'd recommend opening up a JIRA for
>>>>>> explaining & discussing it.
>>>>>>
>>>>>> Java/Scala: We really do require that algorithms be in Scala, for the
>>>>>> sake of maintainability.  The conversion should be doable if you're willing
>>>>>> since Scala is a pretty friendly language.  If you create the JIRA, you
>>>>>> could also ask for help there to see if someone can collaborate with you to
>>>>>> convert the code to Scala.
>>>>>>
>>>>>> Thanks!
>>>>>> Joseph
>>>>>>
>>>>>> On Mon, May 18, 2015 at 3:13 AM, Tarek Elgamal <
>>>>>> tarek.elgamal@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I would like to contribute an algorithm to the MLlib project. I have
>>>>>>> implemented a scalable PCA algorithm on spark. It is scalable for both tall
>>>>>>> and fat matrices and the paper around it is accepted for publication in
>>>>>>> SIGMOD 2015 conference. I looked at the guidelines in the following link:
>>>>>>>
>>>>>>>
>>>>>>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines
>>>>>>>
>>>>>>> I believe that most of the guidelines applies in my case, however,
>>>>>>> the code is written in java and it was not clear in the guidelines whether
>>>>>>> MLLib project accepts java code or not.
>>>>>>> My algorithm can be found under this repository:
>>>>>>> https://github.com/Qatar-Computing-Research-Institute/sPCA
>>>>>>>
>>>>>>> Any help on how to make it suitable for MLlib project will be
>>>>>>> greatly appreciated.
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Tarek Elgamal
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Contribute code to MLlib

Posted by Joseph Bradley <jo...@databricks.com>.

Hi Trevor,

I may be repeating what Ram said, but to 2nd it, a few points:

We do want MLlib to become an extensive and rich ML library; as you said,
scikit-learn is a great example.  To make that happen, we of course need to
include important algorithms.  "Important" is hazy, but roughly means being
useful to a large number of users, improving a large number of use cases
(above what is currently available), and being well-established and tested.

Others and I may not be familiar with Tarek's algorithm (since it is so
new), so it will be important to discuss details on JIRA to establish the
cases in which the algorithm improves over current PCA.  That may require
discussion, community testing, etc.  If we establish that it is a clear
improvement in a large domain, then it could be valuable to have in MLlib
proper.  It's always going to be hard to tell where to draw the line, so
less common algorithms will require more testing before we commit to
including them in MLlib.

I like the Spark package suggestion since it would allow users immediately
start using the code, while the discussion on JIRA happens.  (Plus, if
package users find it useful, they can report that on the JIRA.)

Joseph

On Wed, May 20, 2015 at 10:01 AM, Ram Sriharsha <sr...@gmail.com>
wrote:

> Hi Trevor
>
> I'm attaching the MLLib contribution guideline here:
>
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines
>
> It speaks to widely known and accepted algorithms but not to whether an
> algorithm has to be better than another in every scenario etc
>
> I think the guideline explains what a good contribution to the core
> library should look like better than I initially attempted to !
>
> Sent from my iPhone
>
> On May 20, 2015, at 9:31 AM, Ram Sriharsha <sr...@gmail.com>
> wrote:
>
> Hi Trevor
>
> Good point, I didn't mean that some algorithm has to be clearly better
> than another in every scenario to be included in MLLib. However, even if
> someone is willing to be the maintainer of a piece of code, it does not
> make sense to accept every possible algorithm into the core library.
>
> That said, the specific algorithms should be discussed in the JIRA: as you
> point out, there is no clear way to decide what algorithm to include and
> what not to, and usually mature algorithms that serve a wide variety of
> scenarios are easier to argue about but nothing prevents anyone from
> opening a ticket to discuss any specific machine learning algorithm.
>
> My suggestion was simply that for purposes of making experimental or newer
> algorithms available to Spark users, it doesn't necessarily have to be in
> the core library. Spark packages are good enough in this respect.
>
> Isn't it better for newer algorithms to take this route and prove
> themselves before we bring them into the core library? Especially given the
> barrier to using spark packages is very low.
>
> Ram
>
>
>
> On Wed, May 20, 2015 at 9:05 AM, Trevor Grant <tr...@gmail.com>
> wrote:
>
>> Hey Ram,
>>
>> I'm not speaking to Tarek's package specifically but to the spirit of
>> MLib.  There are a number of method/algorithms for PCA, I'm not sure by
>> what criterion the current one is considered 'standard'.
>>
>> It is rare to find ANY machine learning algo that is 'clearly better'
>> than any other.  They are all tools, they have their place and time.  I
>> agree that it makes sense to field new algorithms as packages and then
>> integrate into MLib once they are 'proven' (in terms of
>> stability/performance/anyone cares).  That being said, if MLib takes the
>> stance that 'what we have is good enough unless something is *clearly*
>> better', then it will never grow into a suite with the depth and richness
>> of sklearn. From a practitioner's stand point, its nice to have everything
>> I could ever want ready in an 'off-the-shelf' form.
>>
>> 'A large number of use cases better than existing' shouldn't be a
>> criteria when selecting what to include in MLib.  The important question
>> should be, 'Are you willing to take on responsibility for maintaining this
>> because you may be the only person on earth who understands the mechanics
>> AND how to code it?'.   Obviously we don't want any random junk algo
>> included.  But trying to say, 'this way of doing PCA is better than that
>> way in a large class of cases' is like trying to say 'geometry is more
>> important than calculus in large class of cases", maybe its true- but
>> geometry won't help you if you are in a case where you need calculus.
>>
>> This all relies on the assumption that MLib is destined to be a rich data
>> science/machine learning package.  It may be that the goal is to make the
>> project as lightweight and parsimonious as possible, if so excuse me for
>> speaking out of turn.
>>
>>
>> On Tue, May 19, 2015 at 10:41 AM, Ram Sriharsha <sr...@gmail.com>
>> wrote:
>>
>>> Hi Trevor, Tarek
>>>
>>> You make non standard algorithms (PCA or otherwise) available to users
>>> of Spark as Spark Packages.
>>> http://spark-packages.org
>>> https://databricks.com/blog/2014/12/22/announcing-spark-packages.html
>>>
>>> With the availability of spark packages, adding powerful experimental /
>>> alternative machine learning algorithms to the pipeline has never been
>>> easier. I would suggest that route in scenarios where one machine learning
>>> algorithm is not clearly better in the common scenarios than an existing
>>> implementation in MLLib.
>>>
>>> If your algorithm is for a large class of use cases better than the
>>> existing PCA implementation, then we should open a JIRA and discuss the
>>> relative strengths/ weaknesses (perhaps with some benchmarks) so we can
>>> better understand if it makes sense to switch out the existing PCA
>>> implementation and make yours the default.
>>>
>>> Ram
>>>
>>> On Tue, May 19, 2015 at 6:56 AM, Trevor Grant <tr...@gmail.com>
>>> wrote:
>>>
>>>>  There are most likely advantages and disadvantages to Tarek's
>>>> algorithm against the current implementation, and different scenarios where
>>>> each is more appropriate.
>>>>
>>>> Would we not offer multiple PCA algorithms and let the user choose?
>>>>
>>>> Trevor
>>>>
>>>> Trevor Grant
>>>> Data Scientist
>>>>
>>>> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>>>>
>>>>
>>>> On Mon, May 18, 2015 at 4:18 PM, Joseph Bradley <jo...@databricks.com>
>>>> wrote:
>>>>
>>>>> Hi Tarek,
>>>>>
>>>>> Thanks for your interest & for checking the guidelines first!  On 2
>>>>> points:
>>>>>
>>>>> Algorithm: PCA is of course a critical algorithm.  The main question
>>>>> is how your algorithm/implementation differs from the current PCA.  If it's
>>>>> different and potentially better, I'd recommend opening up a JIRA for
>>>>> explaining & discussing it.
>>>>>
>>>>> Java/Scala: We really do require that algorithms be in Scala, for the
>>>>> sake of maintainability.  The conversion should be doable if you're willing
>>>>> since Scala is a pretty friendly language.  If you create the JIRA, you
>>>>> could also ask for help there to see if someone can collaborate with you to
>>>>> convert the code to Scala.
>>>>>
>>>>> Thanks!
>>>>> Joseph
>>>>>
>>>>> On Mon, May 18, 2015 at 3:13 AM, Tarek Elgamal <
>>>>> tarek.elgamal@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I would like to contribute an algorithm to the MLlib project. I have
>>>>>> implemented a scalable PCA algorithm on spark. It is scalable for both tall
>>>>>> and fat matrices and the paper around it is accepted for publication in
>>>>>> SIGMOD 2015 conference. I looked at the guidelines in the following link:
>>>>>>
>>>>>>
>>>>>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines
>>>>>>
>>>>>> I believe that most of the guidelines applies in my case, however,
>>>>>> the code is written in java and it was not clear in the guidelines whether
>>>>>> MLLib project accepts java code or not.
>>>>>> My algorithm can be found under this repository:
>>>>>> https://github.com/Qatar-Computing-Research-Institute/sPCA
>>>>>>
>>>>>> Any help on how to make it suitable for MLlib project will be greatly
>>>>>> appreciated.
>>>>>>
>>>>>> Best Regards,
>>>>>> Tarek Elgamal
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Contribute code to MLlib

Posted by Ram Sriharsha <sr...@gmail.com>.

Hi Trevor

I'm attaching the MLLib contribution guideline here:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines

It speaks to widely known and accepted algorithms but not to whether an algorithm has to be better than another in every scenario etc

I think the guideline explains what a good contribution to the core library should look like better than I initially attempted to !

Sent from my iPhone

> On May 20, 2015, at 9:31 AM, Ram Sriharsha <sr...@gmail.com> wrote:
> 
> Hi Trevor
> 
> Good point, I didn't mean that some algorithm has to be clearly better than another in every scenario to be included in MLLib. However, even if someone is willing to be the maintainer of a piece of code, it does not make sense to accept every possible algorithm into the core library.
> 
> That said, the specific algorithms should be discussed in the JIRA: as you point out, there is no clear way to decide what algorithm to include and what not to, and usually mature algorithms that serve a wide variety of scenarios are easier to argue about but nothing prevents anyone from opening a ticket to discuss any specific machine learning algorithm.
> 
> My suggestion was simply that for purposes of making experimental or newer algorithms available to Spark users, it doesn't necessarily have to be in the core library. Spark packages are good enough in this respect.
> 
> Isn't it better for newer algorithms to take this route and prove themselves before we bring them into the core library? Especially given the barrier to using spark packages is very low.
> 
> Ram
> 
> 
> 
>> On Wed, May 20, 2015 at 9:05 AM, Trevor Grant <tr...@gmail.com> wrote:
>> Hey Ram,
>> 
>> I'm not speaking to Tarek's package specifically but to the spirit of MLib.  There are a number of method/algorithms for PCA, I'm not sure by what criterion the current one is considered 'standard'.  
>> 
>> It is rare to find ANY machine learning algo that is 'clearly better' than any other.  They are all tools, they have their place and time.  I agree that it makes sense to field new algorithms as packages and then integrate into MLib once they are 'proven' (in terms of stability/performance/anyone cares).  That being said, if MLib takes the stance that 'what we have is good enough unless something is clearly better', then it will never grow into a suite with the depth and richness of sklearn. From a practitioner's stand point, its nice to have everything I could ever want ready in an 'off-the-shelf' form. 
>> 
>> 'A large number of use cases better than existing' shouldn't be a criteria when selecting what to include in MLib.  The important question should be, 'Are you willing to take on responsibility for maintaining this because you may be the only person on earth who understands the mechanics AND how to code it?'.   Obviously we don't want any random junk algo included.  But trying to say, 'this way of doing PCA is better than that way in a large class of cases' is like trying to say 'geometry is more important than calculus in large class of cases", maybe its true- but geometry won't help you if you are in a case where you need calculus. 
>> 
>> This all relies on the assumption that MLib is destined to be a rich data science/machine learning package.  It may be that the goal is to make the project as lightweight and parsimonious as possible, if so excuse me for speaking out of turn. 
>>   
>> 
>>> On Tue, May 19, 2015 at 10:41 AM, Ram Sriharsha <sr...@gmail.com> wrote:
>>> Hi Trevor, Tarek
>>> 
>>> You make non standard algorithms (PCA or otherwise) available to users of Spark as Spark Packages.
>>> http://spark-packages.org
>>> https://databricks.com/blog/2014/12/22/announcing-spark-packages.html
>>> 
>>> With the availability of spark packages, adding powerful experimental / alternative machine learning algorithms to the pipeline has never been easier. I would suggest that route in scenarios where one machine learning algorithm is not clearly better in the common scenarios than an existing implementation in MLLib.
>>> 
>>> If your algorithm is for a large class of use cases better than the existing PCA implementation, then we should open a JIRA and discuss the relative strengths/ weaknesses (perhaps with some benchmarks) so we can better understand if it makes sense to switch out the existing PCA implementation and make yours the default.
>>> 
>>> Ram
>>> 
>>>> On Tue, May 19, 2015 at 6:56 AM, Trevor Grant <tr...@gmail.com> wrote:
>>>>  There are most likely advantages and disadvantages to Tarek's algorithm against the current implementation, and different scenarios where each is more appropriate.
>>>> 
>>>> Would we not offer multiple PCA algorithms and let the user choose?
>>>> 
>>>> Trevor
>>>> 
>>>> Trevor Grant
>>>> Data Scientist
>>>> 
>>>> "Fortunate is he, who is able to know the causes of things."  -Virgil
>>>> 
>>>> 
>>>>> On Mon, May 18, 2015 at 4:18 PM, Joseph Bradley <jo...@databricks.com> wrote:
>>>>> Hi Tarek,
>>>>> 
>>>>> Thanks for your interest & for checking the guidelines first!  On 2 points:
>>>>> 
>>>>> Algorithm: PCA is of course a critical algorithm.  The main question is how your algorithm/implementation differs from the current PCA.  If it's different and potentially better, I'd recommend opening up a JIRA for explaining & discussing it.
>>>>> 
>>>>> Java/Scala: We really do require that algorithms be in Scala, for the sake of maintainability.  The conversion should be doable if you're willing since Scala is a pretty friendly language.  If you create the JIRA, you could also ask for help there to see if someone can collaborate with you to convert the code to Scala.
>>>>> 
>>>>> Thanks!
>>>>> Joseph
>>>>> 
>>>>>> On Mon, May 18, 2015 at 3:13 AM, Tarek Elgamal <ta...@gmail.com> wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> I would like to contribute an algorithm to the MLlib project. I have implemented a scalable PCA algorithm on spark. It is scalable for both tall and fat matrices and the paper around it is accepted for publication in SIGMOD 2015 conference. I looked at the guidelines in the following link:
>>>>>> 
>>>>>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines
>>>>>> 
>>>>>> I believe that most of the guidelines applies in my case, however, the code is written in java and it was not clear in the guidelines whether MLLib project accepts java code or not. 
>>>>>> My algorithm can be found under this repository:
>>>>>> https://github.com/Qatar-Computing-Research-Institute/sPCA
>>>>>> 
>>>>>> Any help on how to make it suitable for MLlib project will be greatly appreciated. 
>>>>>> 
>>>>>> Best Regards,
>>>>>> Tarek Elgamal
>

Re: Contribute code to MLlib

Posted by Ram Sriharsha <sr...@gmail.com>.

Hi Trevor

Good point, I didn't mean that some algorithm has to be clearly better than
another in every scenario to be included in MLLib. However, even if someone
is willing to be the maintainer of a piece of code, it does not make sense
to accept every possible algorithm into the core library.

That said, the specific algorithms should be discussed in the JIRA: as you
point out, there is no clear way to decide what algorithm to include and
what not to, and usually mature algorithms that serve a wide variety of
scenarios are easier to argue about but nothing prevents anyone from
opening a ticket to discuss any specific machine learning algorithm.

My suggestion was simply that for purposes of making experimental or newer
algorithms available to Spark users, it doesn't necessarily have to be in
the core library. Spark packages are good enough in this respect.

Isn't it better for newer algorithms to take this route and prove
themselves before we bring them into the core library? Especially given the
barrier to using spark packages is very low.

Ram



On Wed, May 20, 2015 at 9:05 AM, Trevor Grant <tr...@gmail.com>
wrote:

> Hey Ram,
>
> I'm not speaking to Tarek's package specifically but to the spirit of
> MLib.  There are a number of method/algorithms for PCA, I'm not sure by
> what criterion the current one is considered 'standard'.
>
> It is rare to find ANY machine learning algo that is 'clearly better' than
> any other.  They are all tools, they have their place and time.  I agree
> that it makes sense to field new algorithms as packages and then integrate
> into MLib once they are 'proven' (in terms of stability/performance/anyone
> cares).  That being said, if MLib takes the stance that 'what we have is
> good enough unless something is *clearly* better', then it will never
> grow into a suite with the depth and richness of sklearn. From a
> practitioner's stand point, its nice to have everything I could ever want
> ready in an 'off-the-shelf' form.
>
> 'A large number of use cases better than existing' shouldn't be a criteria
> when selecting what to include in MLib.  The important question should be,
> 'Are you willing to take on responsibility for maintaining this because you
> may be the only person on earth who understands the mechanics AND how to
> code it?'.   Obviously we don't want any random junk algo included.  But
> trying to say, 'this way of doing PCA is better than that way in a large
> class of cases' is like trying to say 'geometry is more important than
> calculus in large class of cases", maybe its true- but geometry won't help
> you if you are in a case where you need calculus.
>
> This all relies on the assumption that MLib is destined to be a rich data
> science/machine learning package.  It may be that the goal is to make the
> project as lightweight and parsimonious as possible, if so excuse me for
> speaking out of turn.
>
>
> On Tue, May 19, 2015 at 10:41 AM, Ram Sriharsha <sr...@gmail.com>
> wrote:
>
>> Hi Trevor, Tarek
>>
>> You make non standard algorithms (PCA or otherwise) available to users of
>> Spark as Spark Packages.
>> http://spark-packages.org
>> https://databricks.com/blog/2014/12/22/announcing-spark-packages.html
>>
>> With the availability of spark packages, adding powerful experimental /
>> alternative machine learning algorithms to the pipeline has never been
>> easier. I would suggest that route in scenarios where one machine learning
>> algorithm is not clearly better in the common scenarios than an existing
>> implementation in MLLib.
>>
>> If your algorithm is for a large class of use cases better than the
>> existing PCA implementation, then we should open a JIRA and discuss the
>> relative strengths/ weaknesses (perhaps with some benchmarks) so we can
>> better understand if it makes sense to switch out the existing PCA
>> implementation and make yours the default.
>>
>> Ram
>>
>> On Tue, May 19, 2015 at 6:56 AM, Trevor Grant <tr...@gmail.com>
>> wrote:
>>
>>>  There are most likely advantages and disadvantages to Tarek's algorithm
>>> against the current implementation, and different scenarios where each is
>>> more appropriate.
>>>
>>> Would we not offer multiple PCA algorithms and let the user choose?
>>>
>>> Trevor
>>>
>>> Trevor Grant
>>> Data Scientist
>>>
>>> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>>>
>>>
>>> On Mon, May 18, 2015 at 4:18 PM, Joseph Bradley <jo...@databricks.com>
>>> wrote:
>>>
>>>> Hi Tarek,
>>>>
>>>> Thanks for your interest & for checking the guidelines first!  On 2
>>>> points:
>>>>
>>>> Algorithm: PCA is of course a critical algorithm.  The main question is
>>>> how your algorithm/implementation differs from the current PCA.  If it's
>>>> different and potentially better, I'd recommend opening up a JIRA for
>>>> explaining & discussing it.
>>>>
>>>> Java/Scala: We really do require that algorithms be in Scala, for the
>>>> sake of maintainability.  The conversion should be doable if you're willing
>>>> since Scala is a pretty friendly language.  If you create the JIRA, you
>>>> could also ask for help there to see if someone can collaborate with you to
>>>> convert the code to Scala.
>>>>
>>>> Thanks!
>>>> Joseph
>>>>
>>>> On Mon, May 18, 2015 at 3:13 AM, Tarek Elgamal <tarek.elgamal@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I would like to contribute an algorithm to the MLlib project. I have
>>>>> implemented a scalable PCA algorithm on spark. It is scalable for both tall
>>>>> and fat matrices and the paper around it is accepted for publication in
>>>>> SIGMOD 2015 conference. I looked at the guidelines in the following link:
>>>>>
>>>>>
>>>>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines
>>>>>
>>>>> I believe that most of the guidelines applies in my case, however, the
>>>>> code is written in java and it was not clear in the guidelines whether
>>>>> MLLib project accepts java code or not.
>>>>> My algorithm can be found under this repository:
>>>>> https://github.com/Qatar-Computing-Research-Institute/sPCA
>>>>>
>>>>> Any help on how to make it suitable for MLlib project will be greatly
>>>>> appreciated.
>>>>>
>>>>> Best Regards,
>>>>> Tarek Elgamal
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Contribute code to MLlib

Posted by Trevor Grant <tr...@gmail.com>.

Hey Ram,

I'm not speaking to Tarek's package specifically but to the spirit of
MLib.  There are a number of method/algorithms for PCA, I'm not sure by
what criterion the current one is considered 'standard'.

It is rare to find ANY machine learning algo that is 'clearly better' than
any other.  They are all tools, they have their place and time.  I agree
that it makes sense to field new algorithms as packages and then integrate
into MLib once they are 'proven' (in terms of stability/performance/anyone
cares).  That being said, if MLib takes the stance that 'what we have is
good enough unless something is *clearly* better', then it will never grow
into a suite with the depth and richness of sklearn. From a practitioner's
stand point, its nice to have everything I could ever want ready in an
'off-the-shelf' form.

'A large number of use cases better than existing' shouldn't be a criteria
when selecting what to include in MLib.  The important question should be,
'Are you willing to take on responsibility for maintaining this because you
may be the only person on earth who understands the mechanics AND how to
code it?'.   Obviously we don't want any random junk algo included.  But
trying to say, 'this way of doing PCA is better than that way in a large
class of cases' is like trying to say 'geometry is more important than
calculus in large class of cases", maybe its true- but geometry won't help
you if you are in a case where you need calculus.

This all relies on the assumption that MLib is destined to be a rich data
science/machine learning package.  It may be that the goal is to make the
project as lightweight and parsimonious as possible, if so excuse me for
speaking out of turn.

On Tue, May 19, 2015 at 10:41 AM, Ram Sriharsha <sr...@gmail.com>
wrote:

> Hi Trevor, Tarek
>
> You make non standard algorithms (PCA or otherwise) available to users of
> Spark as Spark Packages.
> http://spark-packages.org
> https://databricks.com/blog/2014/12/22/announcing-spark-packages.html
>
> With the availability of spark packages, adding powerful experimental /
> alternative machine learning algorithms to the pipeline has never been
> easier. I would suggest that route in scenarios where one machine learning
> algorithm is not clearly better in the common scenarios than an existing
> implementation in MLLib.
>
> If your algorithm is for a large class of use cases better than the
> existing PCA implementation, then we should open a JIRA and discuss the
> relative strengths/ weaknesses (perhaps with some benchmarks) so we can
> better understand if it makes sense to switch out the existing PCA
> implementation and make yours the default.
>
> Ram
>
> On Tue, May 19, 2015 at 6:56 AM, Trevor Grant <tr...@gmail.com>
> wrote:
>
>>  There are most likely advantages and disadvantages to Tarek's algorithm
>> against the current implementation, and different scenarios where each is
>> more appropriate.
>>
>> Would we not offer multiple PCA algorithms and let the user choose?
>>
>> Trevor
>>
>> Trevor Grant
>> Data Scientist
>>
>> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>>
>>
>> On Mon, May 18, 2015 at 4:18 PM, Joseph Bradley <jo...@databricks.com>
>> wrote:
>>
>>> Hi Tarek,
>>>
>>> Thanks for your interest & for checking the guidelines first!  On 2
>>> points:
>>>
>>> Algorithm: PCA is of course a critical algorithm.  The main question is
>>> how your algorithm/implementation differs from the current PCA.  If it's
>>> different and potentially better, I'd recommend opening up a JIRA for
>>> explaining & discussing it.
>>>
>>> Java/Scala: We really do require that algorithms be in Scala, for the
>>> sake of maintainability.  The conversion should be doable if you're willing
>>> since Scala is a pretty friendly language.  If you create the JIRA, you
>>> could also ask for help there to see if someone can collaborate with you to
>>> convert the code to Scala.
>>>
>>> Thanks!
>>> Joseph
>>>
>>> On Mon, May 18, 2015 at 3:13 AM, Tarek Elgamal <ta...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I would like to contribute an algorithm to the MLlib project. I have
>>>> implemented a scalable PCA algorithm on spark. It is scalable for both tall
>>>> and fat matrices and the paper around it is accepted for publication in
>>>> SIGMOD 2015 conference. I looked at the guidelines in the following link:
>>>>
>>>>
>>>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines
>>>>
>>>> I believe that most of the guidelines applies in my case, however, the
>>>> code is written in java and it was not clear in the guidelines whether
>>>> MLLib project accepts java code or not.
>>>> My algorithm can be found under this repository:
>>>> https://github.com/Qatar-Computing-Research-Institute/sPCA
>>>>
>>>> Any help on how to make it suitable for MLlib project will be greatly
>>>> appreciated.
>>>>
>>>> Best Regards,
>>>> Tarek Elgamal
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Contribute code to MLlib

Posted by Ram Sriharsha <sr...@gmail.com>.

Hi Trevor, Tarek

You make non standard algorithms (PCA or otherwise) available to users of
Spark as Spark Packages.
http://spark-packages.org
https://databricks.com/blog/2014/12/22/announcing-spark-packages.html

With the availability of spark packages, adding powerful experimental /
alternative machine learning algorithms to the pipeline has never been
easier. I would suggest that route in scenarios where one machine learning
algorithm is not clearly better in the common scenarios than an existing
implementation in MLLib.

If your algorithm is for a large class of use cases better than the
existing PCA implementation, then we should open a JIRA and discuss the
relative strengths/ weaknesses (perhaps with some benchmarks) so we can
better understand if it makes sense to switch out the existing PCA
implementation and make yours the default.

Ram

On Tue, May 19, 2015 at 6:56 AM, Trevor Grant <tr...@gmail.com>
wrote:

>  There are most likely advantages and disadvantages to Tarek's algorithm
> against the current implementation, and different scenarios where each is
> more appropriate.
>
> Would we not offer multiple PCA algorithms and let the user choose?
>
> Trevor
>
> Trevor Grant
> Data Scientist
>
> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>
>
> On Mon, May 18, 2015 at 4:18 PM, Joseph Bradley <jo...@databricks.com>
> wrote:
>
>> Hi Tarek,
>>
>> Thanks for your interest & for checking the guidelines first!  On 2
>> points:
>>
>> Algorithm: PCA is of course a critical algorithm.  The main question is
>> how your algorithm/implementation differs from the current PCA.  If it's
>> different and potentially better, I'd recommend opening up a JIRA for
>> explaining & discussing it.
>>
>> Java/Scala: We really do require that algorithms be in Scala, for the
>> sake of maintainability.  The conversion should be doable if you're willing
>> since Scala is a pretty friendly language.  If you create the JIRA, you
>> could also ask for help there to see if someone can collaborate with you to
>> convert the code to Scala.
>>
>> Thanks!
>> Joseph
>>
>> On Mon, May 18, 2015 at 3:13 AM, Tarek Elgamal <ta...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I would like to contribute an algorithm to the MLlib project. I have
>>> implemented a scalable PCA algorithm on spark. It is scalable for both tall
>>> and fat matrices and the paper around it is accepted for publication in
>>> SIGMOD 2015 conference. I looked at the guidelines in the following link:
>>>
>>>
>>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines
>>>
>>> I believe that most of the guidelines applies in my case, however, the
>>> code is written in java and it was not clear in the guidelines whether
>>> MLLib project accepts java code or not.
>>> My algorithm can be found under this repository:
>>> https://github.com/Qatar-Computing-Research-Institute/sPCA
>>>
>>> Any help on how to make it suitable for MLlib project will be greatly
>>> appreciated.
>>>
>>> Best Regards,
>>> Tarek Elgamal
>>>
>>>
>>>
>>>
>>
>

Re: Contribute code to MLlib

Posted by Trevor Grant <tr...@gmail.com>.

 There are most likely advantages and disadvantages to Tarek's algorithm
against the current implementation, and different scenarios where each is
more appropriate.

Would we not offer multiple PCA algorithms and let the user choose?

Trevor

Trevor Grant
Data Scientist

*"Fortunate is he, who is able to know the causes of things."  -Virgil*


On Mon, May 18, 2015 at 4:18 PM, Joseph Bradley <jo...@databricks.com>
wrote:

> Hi Tarek,
>
> Thanks for your interest & for checking the guidelines first!  On 2 points:
>
> Algorithm: PCA is of course a critical algorithm.  The main question is
> how your algorithm/implementation differs from the current PCA.  If it's
> different and potentially better, I'd recommend opening up a JIRA for
> explaining & discussing it.
>
> Java/Scala: We really do require that algorithms be in Scala, for the sake
> of maintainability.  The conversion should be doable if you're willing
> since Scala is a pretty friendly language.  If you create the JIRA, you
> could also ask for help there to see if someone can collaborate with you to
> convert the code to Scala.
>
> Thanks!
> Joseph
>
> On Mon, May 18, 2015 at 3:13 AM, Tarek Elgamal <ta...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I would like to contribute an algorithm to the MLlib project. I have
>> implemented a scalable PCA algorithm on spark. It is scalable for both tall
>> and fat matrices and the paper around it is accepted for publication in
>> SIGMOD 2015 conference. I looked at the guidelines in the following link:
>>
>>
>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines
>>
>> I believe that most of the guidelines applies in my case, however, the
>> code is written in java and it was not clear in the guidelines whether
>> MLLib project accepts java code or not.
>> My algorithm can be found under this repository:
>> https://github.com/Qatar-Computing-Research-Institute/sPCA
>>
>> Any help on how to make it suitable for MLlib project will be greatly
>> appreciated.
>>
>> Best Regards,
>> Tarek Elgamal
>>
>>
>>
>>
>

Re: Contribute code to MLlib

Posted by Joseph Bradley <jo...@databricks.com>.

Hi Tarek,

Thanks for your interest & for checking the guidelines first!  On 2 points:

Algorithm: PCA is of course a critical algorithm.  The main question is how
your algorithm/implementation differs from the current PCA.  If it's
different and potentially better, I'd recommend opening up a JIRA for
explaining & discussing it.

Java/Scala: We really do require that algorithms be in Scala, for the sake
of maintainability.  The conversion should be doable if you're willing
since Scala is a pretty friendly language.  If you create the JIRA, you
could also ask for help there to see if someone can collaborate with you to
convert the code to Scala.

Thanks!
Joseph

On Mon, May 18, 2015 at 3:13 AM, Tarek Elgamal <ta...@gmail.com>
wrote:

> Hi,
>
> I would like to contribute an algorithm to the MLlib project. I have
> implemented a scalable PCA algorithm on spark. It is scalable for both tall
> and fat matrices and the paper around it is accepted for publication in
> SIGMOD 2015 conference. I looked at the guidelines in the following link:
>
>
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines
>
> I believe that most of the guidelines applies in my case, however, the
> code is written in java and it was not clear in the guidelines whether
> MLLib project accepts java code or not.
> My algorithm can be found under this repository:
> https://github.com/Qatar-Computing-Research-Institute/sPCA
>
> Any help on how to make it suitable for MLlib project will be greatly
> appreciated.
>
> Best Regards,
> Tarek Elgamal
>
>
>
>