You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Reynold Xin <rx...@databricks.com> on 2016/04/01 09:15:06 UTC

[discuss] using deep learning to improve Spark

Hi all,

Hope you all enjoyed the Tesla 3 unveiling earlier tonight.

I'd like to bring your attention to a project called DeepSpark that we have
been working on for the past three years. We realized that scaling software
development was challenging. A large fraction of software engineering has
been manual and mundane: writing test cases, fixing bugs, implementing
features according to specs, and reviewing pull requests. So we started
this project to see how much we could automate.

After three years of development and one year of testing, we now have
enough confidence that this could work well in practice. For example, Matei
confessed to me today: "It looks like DeepSpark has a better understanding
of Spark internals than I ever will. It updated several pieces of code I
wrote long ago that even I no longer understood.”


I think it's time to discuss as a community about how we want to continue
this project to ensure Spark is stable, secure, and easy to use yet able to
progress as fast as possible. I'm still working on a more formal design
doc, and it might take a little bit more time since I haven't been able to
fully grasp DeepSpark's capabilities yet. Based on my understanding right
now, I've written a blog post about DeepSpark here:
https://databricks.com/blog/2016/04/01/unreasonable-effectiveness-of-deep-learning-on-spark.html


Please take a look and share your thoughts. Obviously, this is an ambitious
project and could take many years to fully implement. One major challenge
is cost. The current Spark Jenkins infrastructure provided by the AMPLab
has only 8 machines, but DeepSpark uses 12000 machines. I'm not sure
whether AMPLab or Databricks can fund DeepSpark's operation for a long
period of time. Perhaps AWS can help out here. Let me know if you have
other ideas.

Re: [discuss] using deep learning to improve Spark

Posted by Ricardo Almeida <ri...@actnowib.com>.
Amazing! I'll fund $1/2 million for such a interesting initiative.
Oh, wait... I have only $4 on my pocket

Cheers :)

On 1 April 2016 at 11:40, Takeshi Yamamuro <li...@gmail.com> wrote:

> Oh, the annual event...
>
> On Fri, Apr 1, 2016 at 4:37 PM, Xiao Li <ga...@gmail.com> wrote:
>
>> April 1st... : )
>>
>> 2016-04-01 0:33 GMT-07:00 Michael Malak <mi...@yahoo.com.invalid>:
>>
>>> I see you've been burning the midnight oil.
>>>
>>>
>>> ------------------------------
>>> *From:* Reynold Xin <rx...@databricks.com>
>>> *To:* "dev@spark.apache.org" <de...@spark.apache.org>
>>> *Sent:* Friday, April 1, 2016 1:15 AM
>>> *Subject:* [discuss] using deep learning to improve Spark
>>>
>>> Hi all,
>>>
>>> Hope you all enjoyed the Tesla 3 unveiling earlier tonight.
>>>
>>> I'd like to bring your attention to a project called DeepSpark that we
>>> have been working on for the past three years. We realized that scaling
>>> software development was challenging. A large fraction of software
>>> engineering has been manual and mundane: writing test cases, fixing bugs,
>>> implementing features according to specs, and reviewing pull requests. So
>>> we started this project to see how much we could automate.
>>>
>>> After three years of development and one year of testing, we now have
>>> enough confidence that this could work well in practice. For example, Matei
>>> confessed to me today: "It looks like DeepSpark has a better understanding
>>> of Spark internals than I ever will. It updated several pieces of code I
>>> wrote long ago that even I no longer understood.”
>>>
>>>
>>> I think it's time to discuss as a community about how we want to
>>> continue this project to ensure Spark is stable, secure, and easy to use
>>> yet able to progress as fast as possible. I'm still working on a more
>>> formal design doc, and it might take a little bit more time since I haven't
>>> been able to fully grasp DeepSpark's capabilities yet. Based on my
>>> understanding right now, I've written a blog post about DeepSpark here:
>>> https://databricks.com/blog/2016/04/01/unreasonable-effectiveness-of-deep-learning-on-spark.html
>>>
>>>
>>> Please take a look and share your thoughts. Obviously, this is an
>>> ambitious project and could take many years to fully implement. One major
>>> challenge is cost. The current Spark Jenkins infrastructure provided by the
>>> AMPLab has only 8 machines, but DeepSpark uses 12000 machines. I'm not sure
>>> whether AMPLab or Databricks can fund DeepSpark's operation for a long
>>> period of time. Perhaps AWS can help out here. Let me know if you have
>>> other ideas.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: [discuss] using deep learning to improve Spark

Posted by Takeshi Yamamuro <li...@gmail.com>.
Oh, the annual event...

On Fri, Apr 1, 2016 at 4:37 PM, Xiao Li <ga...@gmail.com> wrote:

> April 1st... : )
>
> 2016-04-01 0:33 GMT-07:00 Michael Malak <mi...@yahoo.com.invalid>:
>
>> I see you've been burning the midnight oil.
>>
>>
>> ------------------------------
>> *From:* Reynold Xin <rx...@databricks.com>
>> *To:* "dev@spark.apache.org" <de...@spark.apache.org>
>> *Sent:* Friday, April 1, 2016 1:15 AM
>> *Subject:* [discuss] using deep learning to improve Spark
>>
>> Hi all,
>>
>> Hope you all enjoyed the Tesla 3 unveiling earlier tonight.
>>
>> I'd like to bring your attention to a project called DeepSpark that we
>> have been working on for the past three years. We realized that scaling
>> software development was challenging. A large fraction of software
>> engineering has been manual and mundane: writing test cases, fixing bugs,
>> implementing features according to specs, and reviewing pull requests. So
>> we started this project to see how much we could automate.
>>
>> After three years of development and one year of testing, we now have
>> enough confidence that this could work well in practice. For example, Matei
>> confessed to me today: "It looks like DeepSpark has a better understanding
>> of Spark internals than I ever will. It updated several pieces of code I
>> wrote long ago that even I no longer understood.”
>>
>>
>> I think it's time to discuss as a community about how we want to continue
>> this project to ensure Spark is stable, secure, and easy to use yet able to
>> progress as fast as possible. I'm still working on a more formal design
>> doc, and it might take a little bit more time since I haven't been able to
>> fully grasp DeepSpark's capabilities yet. Based on my understanding right
>> now, I've written a blog post about DeepSpark here:
>> https://databricks.com/blog/2016/04/01/unreasonable-effectiveness-of-deep-learning-on-spark.html
>>
>>
>> Please take a look and share your thoughts. Obviously, this is an
>> ambitious project and could take many years to fully implement. One major
>> challenge is cost. The current Spark Jenkins infrastructure provided by the
>> AMPLab has only 8 machines, but DeepSpark uses 12000 machines. I'm not sure
>> whether AMPLab or Databricks can fund DeepSpark's operation for a long
>> period of time. Perhaps AWS can help out here. Let me know if you have
>> other ideas.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>


-- 
---
Takeshi Yamamuro

Re: [discuss] using deep learning to improve Spark

Posted by Xiao Li <ga...@gmail.com>.
April 1st... : )

2016-04-01 0:33 GMT-07:00 Michael Malak <mi...@yahoo.com.invalid>:

> I see you've been burning the midnight oil.
>
>
> ------------------------------
> *From:* Reynold Xin <rx...@databricks.com>
> *To:* "dev@spark.apache.org" <de...@spark.apache.org>
> *Sent:* Friday, April 1, 2016 1:15 AM
> *Subject:* [discuss] using deep learning to improve Spark
>
> Hi all,
>
> Hope you all enjoyed the Tesla 3 unveiling earlier tonight.
>
> I'd like to bring your attention to a project called DeepSpark that we
> have been working on for the past three years. We realized that scaling
> software development was challenging. A large fraction of software
> engineering has been manual and mundane: writing test cases, fixing bugs,
> implementing features according to specs, and reviewing pull requests. So
> we started this project to see how much we could automate.
>
> After three years of development and one year of testing, we now have
> enough confidence that this could work well in practice. For example, Matei
> confessed to me today: "It looks like DeepSpark has a better understanding
> of Spark internals than I ever will. It updated several pieces of code I
> wrote long ago that even I no longer understood.”
>
>
> I think it's time to discuss as a community about how we want to continue
> this project to ensure Spark is stable, secure, and easy to use yet able to
> progress as fast as possible. I'm still working on a more formal design
> doc, and it might take a little bit more time since I haven't been able to
> fully grasp DeepSpark's capabilities yet. Based on my understanding right
> now, I've written a blog post about DeepSpark here:
> https://databricks.com/blog/2016/04/01/unreasonable-effectiveness-of-deep-learning-on-spark.html
>
>
> Please take a look and share your thoughts. Obviously, this is an
> ambitious project and could take many years to fully implement. One major
> challenge is cost. The current Spark Jenkins infrastructure provided by the
> AMPLab has only 8 machines, but DeepSpark uses 12000 machines. I'm not sure
> whether AMPLab or Databricks can fund DeepSpark's operation for a long
> period of time. Perhaps AWS can help out here. Let me know if you have
> other ideas.
>
>
>
>
>
>
>
>
>

Re: [discuss] using deep learning to improve Spark

Posted by Michael Malak <mi...@yahoo.com.INVALID>.
I see you've been burning the midnight oil.

      From: Reynold Xin <rx...@databricks.com>
 To: "dev@spark.apache.org" <de...@spark.apache.org> 
 Sent: Friday, April 1, 2016 1:15 AM
 Subject: [discuss] using deep learning to improve Spark
   
Hi all,
Hope you all enjoyed the Tesla 3 unveiling earlier tonight.
I'd like to bring your attention to a project called DeepSpark that we have been working on for the past three years. We realized that scaling software development was challenging. A large fraction of software engineering has been manual and mundane: writing test cases, fixing bugs, implementing features according to specs, and reviewing pull requests. So we started this project to see how much we could automate.
After three years of development and one year of testing, we now have enough confidence that this could work well in practice. For example, Matei confessed to me today: "It looks like DeepSpark has a better understanding of Spark internals than I ever will. It updated several pieces of code I wrote long ago that even I no longer understood.”

I think it's time to discuss as a community about how we want to continue this project to ensure Spark is stable, secure, and easy to use yet able to progress as fast as possible. I'm still working on a more formal design doc, and it might take a little bit more time since I haven't been able to fully grasp DeepSpark's capabilities yet. Based on my understanding right now, I've written a blog post about DeepSpark here: https://databricks.com/blog/2016/04/01/unreasonable-effectiveness-of-deep-learning-on-spark.html

Please take a look and share your thoughts. Obviously, this is an ambitious project and could take many years to fully implement. One major challenge is cost. The current Spark Jenkins infrastructure provided by the AMPLab has only 8 machines, but DeepSpark uses 12000 machines. I'm not sure whether AMPLab or Databricks can fund DeepSpark's operation for a long period of time. Perhaps AWS can help out here. Let me know if you have other ideas.