You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nemo.apache.org by 송원욱 <wo...@apache.org> on 2019/07/01 15:40:28 UTC

Back from Berlin Beam Summit 2019

Hi!

I got back from the Beam Summit Europe 2019 that happened last week in
Berlin, and I had lots of interesting conversations and feedbacks from the
people that I've met there. I thought I would share some of them with the
dev list. By the way you can check out the talk on youtube
<https://youtu.be/DKxYE8YWF_o>!

First of all, a lot of people were *very* interested in Apache Nemo! and a
lot of people from the Beam community were very excited to hear about a new
runner with primary support for their language! A few reasons for their
interest had been that since Beam does not actually get involved in the
runtime layer, where the actual scheduling or communication or distributed
computation happens, they were interested in the optimizations that can be
done in such layers.

Second, with all the support from the TFX team, as well as the Beam SQL
team, it would bring loads of new possibilities for Nemo by supporting the
*portability* *layer* of Beam, which supports applications written with any
languages among Java, Python, and Go (and more in the future!). The
portability layer is getting more and more mature, and I think it's about
time to support the portability layer for Nemo as well, as not a lot of
runners support it so far and it would give Nemo a head start.

Another thing that I've noticed is that a lot of people are still very much
interested in *batch* processing rather than stream processing. From the
people that I've talked to, I've learned that people found stream
processing to be quite pricey and that they haven't found stream processing
worth the price that they were paying (for example, Spotify runs all of
their data processing workloads as batch). I guess Nemo could be a good
candidate to run batch processing, as Spark often suffers from problems as
large-scale shuffle and data skew problems, if not provided with machines
with enough memory, whereas Nemo is able to provide the optimizations for
such problems. I've also found the people were interested if Nemo supports
Kubernetes, which is a topic that we should definitely look into.

I've also had many questions from the engineers from *Seznam.cz *and
*shopify.com
<http://shopify.com>* where they run their own datacenters to process their
data (I think). They have been facing exactly the same problems as
illustrated above (large-scale shuffle, data skew, frequent data reloading
for broadcasted data, utilizing transient resources, etc.), and have had
questions about running their data processing workloads on their large
amounts of data that they are facing every day (upto 40TB/day). I should
definitely follow up with them to see how they are doing, if they are
trying to use Nemo in their production, to provide help if needed and to
see Nemo's performance with real workloads.

Lastly, I have been talking with Pablo (from Beam) about the trip to
*Seattle* and Renton, Washington next week regarding the USENIX ATC '19
conference, and have had a chat about organizing a lunch and maybe a small
talk with the Googlers there as well! I've also heard that Davor is also
based in Seattle, so I have been thinking that it would be a great
opportunity for us to meet in person. 😀The date would be probably the *15th
of July*, so please keep the date in mind if you would be interested!

Cheers,
Wonook

Re: Back from Berlin Beam Summit 2019

Posted by Davor Bonaci <da...@apache.org>.

Let's move the scheduling off-list.

On Tue, Jul 2, 2019 at 12:50 AM 송원욱 <ws...@gmail.com> wrote:

> Ooh, that’s going to bit a tricky. We’re landing in SEA in the morning of
> the 9th and flying back early on the 16th. The conference is held from the
> 10th until the 12th. Any other dates that could be possible, or should we
> try pushing it for the evening of the 15th, or should we try giving it an
> another go maybe next time?
>
> > On 2 Jul 2019, at 1:08 AM, Davor Bonaci <da...@apache.org> wrote:
> >
> > Would love to meet if you are in the area.
> >
> > Scheduling wise, I’ll be landing at SEA around 6 pm on the 15th, so 16th
> > and onwards would be better. Evening on the 15th can work, but it is
> > pushing it.
> >
> > Davor
> >
> > On Mon, Jul 1, 2019 at 8:40 AM 송원욱 <wo...@apache.org> wrote:
> >
> >> Hi!
> >>
> >> I got back from the Beam Summit Europe 2019 that happened last week in
> >> Berlin, and I had lots of interesting conversations and feedbacks from
> the
> >> people that I've met there. I thought I would share some of them with
> the
> >> dev list. By the way you can check out the talk on youtube
> >> <https://youtu.be/DKxYE8YWF_o>!
> >>
> >> First of all, a lot of people were *very* interested in Apache Nemo!
> and a
> >> lot of people from the Beam community were very excited to hear about a
> new
> >> runner with primary support for their language! A few reasons for their
> >> interest had been that since Beam does not actually get involved in the
> >> runtime layer, where the actual scheduling or communication or
> distributed
> >> computation happens, they were interested in the optimizations that can
> be
> >> done in such layers.
> >>
> >> Second, with all the support from the TFX team, as well as the Beam SQL
> >> team, it would bring loads of new possibilities for Nemo by supporting
> the
> >> *portability* *layer* of Beam, which supports applications written with
> any
> >> languages among Java, Python, and Go (and more in the future!). The
> >> portability layer is getting more and more mature, and I think it's
> about
> >> time to support the portability layer for Nemo as well, as not a lot of
> >> runners support it so far and it would give Nemo a head start.
> >>
> >> Another thing that I've noticed is that a lot of people are still very
> much
> >> interested in *batch* processing rather than stream processing. From the
> >> people that I've talked to, I've learned that people found stream
> >> processing to be quite pricey and that they haven't found stream
> processing
> >> worth the price that they were paying (for example, Spotify runs all of
> >> their data processing workloads as batch). I guess Nemo could be a good
> >> candidate to run batch processing, as Spark often suffers from problems
> as
> >> large-scale shuffle and data skew problems, if not provided with
> machines
> >> with enough memory, whereas Nemo is able to provide the optimizations
> for
> >> such problems. I've also found the people were interested if Nemo
> supports
> >> Kubernetes, which is a topic that we should definitely look into.
> >>
> >> I've also had many questions from the engineers from *Seznam.cz *and
> >> *shopify.com
> >> <http://shopify.com>* where they run their own datacenters to process
> >> their
> >> data (I think). They have been facing exactly the same problems as
> >> illustrated above (large-scale shuffle, data skew, frequent data
> reloading
> >> for broadcasted data, utilizing transient resources, etc.), and have had
> >> questions about running their data processing workloads on their large
> >> amounts of data that they are facing every day (upto 40TB/day). I should
> >> definitely follow up with them to see how they are doing, if they are
> >> trying to use Nemo in their production, to provide help if needed and to
> >> see Nemo's performance with real workloads.
> >>
> >> Lastly, I have been talking with Pablo (from Beam) about the trip to
> >> *Seattle* and Renton, Washington next week regarding the USENIX ATC '19
> >> conference, and have had a chat about organizing a lunch and maybe a
> small
> >> talk with the Googlers there as well! I've also heard that Davor is also
> >> based in Seattle, so I have been thinking that it would be a great
> >> opportunity for us to meet in person. 😀The date would be probably the
> >> *15th
> >> of July*, so please keep the date in mind if you would be interested!
> >>
> >> Cheers,
> >> Wonook
> >>
>
>

Re: Back from Berlin Beam Summit 2019

Posted by 송원욱 <ws...@gmail.com>.

Ooh, that’s going to bit a tricky. We’re landing in SEA in the morning of the 9th and flying back early on the 16th. The conference is held from the 10th until the 12th. Any other dates that could be possible, or should we try pushing it for the evening of the 15th, or should we try giving it an another go maybe next time?

> On 2 Jul 2019, at 1:08 AM, Davor Bonaci <da...@apache.org> wrote:
> 
> Would love to meet if you are in the area.
> 
> Scheduling wise, I’ll be landing at SEA around 6 pm on the 15th, so 16th
> and onwards would be better. Evening on the 15th can work, but it is
> pushing it.
> 
> Davor
> 
> On Mon, Jul 1, 2019 at 8:40 AM 송원욱 <wo...@apache.org> wrote:
> 
>> Hi!
>> 
>> I got back from the Beam Summit Europe 2019 that happened last week in
>> Berlin, and I had lots of interesting conversations and feedbacks from the
>> people that I've met there. I thought I would share some of them with the
>> dev list. By the way you can check out the talk on youtube
>> <https://youtu.be/DKxYE8YWF_o>!
>> 
>> First of all, a lot of people were *very* interested in Apache Nemo! and a
>> lot of people from the Beam community were very excited to hear about a new
>> runner with primary support for their language! A few reasons for their
>> interest had been that since Beam does not actually get involved in the
>> runtime layer, where the actual scheduling or communication or distributed
>> computation happens, they were interested in the optimizations that can be
>> done in such layers.
>> 
>> Second, with all the support from the TFX team, as well as the Beam SQL
>> team, it would bring loads of new possibilities for Nemo by supporting the
>> *portability* *layer* of Beam, which supports applications written with any
>> languages among Java, Python, and Go (and more in the future!). The
>> portability layer is getting more and more mature, and I think it's about
>> time to support the portability layer for Nemo as well, as not a lot of
>> runners support it so far and it would give Nemo a head start.
>> 
>> Another thing that I've noticed is that a lot of people are still very much
>> interested in *batch* processing rather than stream processing. From the
>> people that I've talked to, I've learned that people found stream
>> processing to be quite pricey and that they haven't found stream processing
>> worth the price that they were paying (for example, Spotify runs all of
>> their data processing workloads as batch). I guess Nemo could be a good
>> candidate to run batch processing, as Spark often suffers from problems as
>> large-scale shuffle and data skew problems, if not provided with machines
>> with enough memory, whereas Nemo is able to provide the optimizations for
>> such problems. I've also found the people were interested if Nemo supports
>> Kubernetes, which is a topic that we should definitely look into.
>> 
>> I've also had many questions from the engineers from *Seznam.cz *and
>> *shopify.com
>> <http://shopify.com>* where they run their own datacenters to process
>> their
>> data (I think). They have been facing exactly the same problems as
>> illustrated above (large-scale shuffle, data skew, frequent data reloading
>> for broadcasted data, utilizing transient resources, etc.), and have had
>> questions about running their data processing workloads on their large
>> amounts of data that they are facing every day (upto 40TB/day). I should
>> definitely follow up with them to see how they are doing, if they are
>> trying to use Nemo in their production, to provide help if needed and to
>> see Nemo's performance with real workloads.
>> 
>> Lastly, I have been talking with Pablo (from Beam) about the trip to
>> *Seattle* and Renton, Washington next week regarding the USENIX ATC '19
>> conference, and have had a chat about organizing a lunch and maybe a small
>> talk with the Googlers there as well! I've also heard that Davor is also
>> based in Seattle, so I have been thinking that it would be a great
>> opportunity for us to meet in person. 😀The date would be probably the
>> *15th
>> of July*, so please keep the date in mind if you would be interested!
>> 
>> Cheers,
>> Wonook
>>

Re: Back from Berlin Beam Summit 2019

Posted by Davor Bonaci <da...@apache.org>.

Would love to meet if you are in the area.

Scheduling wise, I’ll be landing at SEA around 6 pm on the 15th, so 16th
and onwards would be better. Evening on the 15th can work, but it is
pushing it.

Davor

On Mon, Jul 1, 2019 at 8:40 AM 송원욱 <wo...@apache.org> wrote:

> Hi!
>
> I got back from the Beam Summit Europe 2019 that happened last week in
> Berlin, and I had lots of interesting conversations and feedbacks from the
> people that I've met there. I thought I would share some of them with the
> dev list. By the way you can check out the talk on youtube
> <https://youtu.be/DKxYE8YWF_o>!
>
> First of all, a lot of people were *very* interested in Apache Nemo! and a
> lot of people from the Beam community were very excited to hear about a new
> runner with primary support for their language! A few reasons for their
> interest had been that since Beam does not actually get involved in the
> runtime layer, where the actual scheduling or communication or distributed
> computation happens, they were interested in the optimizations that can be
> done in such layers.
>
> Second, with all the support from the TFX team, as well as the Beam SQL
> team, it would bring loads of new possibilities for Nemo by supporting the
> *portability* *layer* of Beam, which supports applications written with any
> languages among Java, Python, and Go (and more in the future!). The
> portability layer is getting more and more mature, and I think it's about
> time to support the portability layer for Nemo as well, as not a lot of
> runners support it so far and it would give Nemo a head start.
>
> Another thing that I've noticed is that a lot of people are still very much
> interested in *batch* processing rather than stream processing. From the
> people that I've talked to, I've learned that people found stream
> processing to be quite pricey and that they haven't found stream processing
> worth the price that they were paying (for example, Spotify runs all of
> their data processing workloads as batch). I guess Nemo could be a good
> candidate to run batch processing, as Spark often suffers from problems as
> large-scale shuffle and data skew problems, if not provided with machines
> with enough memory, whereas Nemo is able to provide the optimizations for
> such problems. I've also found the people were interested if Nemo supports
> Kubernetes, which is a topic that we should definitely look into.
>
> I've also had many questions from the engineers from *Seznam.cz *and
> *shopify.com
> <http://shopify.com>* where they run their own datacenters to process
> their
> data (I think). They have been facing exactly the same problems as
> illustrated above (large-scale shuffle, data skew, frequent data reloading
> for broadcasted data, utilizing transient resources, etc.), and have had
> questions about running their data processing workloads on their large
> amounts of data that they are facing every day (upto 40TB/day). I should
> definitely follow up with them to see how they are doing, if they are
> trying to use Nemo in their production, to provide help if needed and to
> see Nemo's performance with real workloads.
>
> Lastly, I have been talking with Pablo (from Beam) about the trip to
> *Seattle* and Renton, Washington next week regarding the USENIX ATC '19
> conference, and have had a chat about organizing a lunch and maybe a small
> talk with the Googlers there as well! I've also heard that Davor is also
> based in Seattle, so I have been thinking that it would be a great
> opportunity for us to meet in person. 😀The date would be probably the
> *15th
> of July*, so please keep the date in mind if you would be interested!
>
> Cheers,
> Wonook
>

Re: Back from Berlin Beam Summit 2019

Posted by Gyewon Lee <gw...@apache.org>.

Good job, wonwook! I will be looking forward to hearing further news from
you.

Best,
Gyewon

2019년 7월 2일 (화) 오전 3:17, Byung-Gon Chun <bg...@gmail.com>님이 작성:

> Thanks for sharing the news!
>
> Sounds very exciting. I’d be interested in following up with the companies
> that may use Nemo, too.
>
> Cheers,
> Gon
>
> Sent from my iPad
>
> > On 1 Jul 2019, at 5:40 PM, 송원욱 <wo...@apache.org> wrote:
> >
> > Hi!
> >
> > I got back from the Beam Summit Europe 2019 that happened last week in
> > Berlin, and I had lots of interesting conversations and feedbacks from
> the
> > people that I've met there. I thought I would share some of them with the
> > dev list. By the way you can check out the talk on youtube
> > <https://youtu.be/DKxYE8YWF_o>!
> >
> > First of all, a lot of people were *very* interested in Apache Nemo! and
> a
> > lot of people from the Beam community were very excited to hear about a
> new
> > runner with primary support for their language! A few reasons for their
> > interest had been that since Beam does not actually get involved in the
> > runtime layer, where the actual scheduling or communication or
> distributed
> > computation happens, they were interested in the optimizations that can
> be
> > done in such layers.
> >
> > Second, with all the support from the TFX team, as well as the Beam SQL
> > team, it would bring loads of new possibilities for Nemo by supporting
> the
> > *portability* *layer* of Beam, which supports applications written with
> any
> > languages among Java, Python, and Go (and more in the future!). The
> > portability layer is getting more and more mature, and I think it's about
> > time to support the portability layer for Nemo as well, as not a lot of
> > runners support it so far and it would give Nemo a head start.
> >
> > Another thing that I've noticed is that a lot of people are still very
> much
> > interested in *batch* processing rather than stream processing. From the
> > people that I've talked to, I've learned that people found stream
> > processing to be quite pricey and that they haven't found stream
> processing
> > worth the price that they were paying (for example, Spotify runs all of
> > their data processing workloads as batch). I guess Nemo could be a good
> > candidate to run batch processing, as Spark often suffers from problems
> as
> > large-scale shuffle and data skew problems, if not provided with machines
> > with enough memory, whereas Nemo is able to provide the optimizations for
> > such problems. I've also found the people were interested if Nemo
> supports
> > Kubernetes, which is a topic that we should definitely look into.
> >
> > I've also had many questions from the engineers from *Seznam.cz *and
> > *shopify.com
> > <http://shopify.com>* where they run their own datacenters to process
> their
> > data (I think). They have been facing exactly the same problems as
> > illustrated above (large-scale shuffle, data skew, frequent data
> reloading
> > for broadcasted data, utilizing transient resources, etc.), and have had
> > questions about running their data processing workloads on their large
> > amounts of data that they are facing every day (upto 40TB/day). I should
> > definitely follow up with them to see how they are doing, if they are
> > trying to use Nemo in their production, to provide help if needed and to
> > see Nemo's performance with real workloads.
> >
> > Lastly, I have been talking with Pablo (from Beam) about the trip to
> > *Seattle* and Renton, Washington next week regarding the USENIX ATC '19
> > conference, and have had a chat about organizing a lunch and maybe a
> small
> > talk with the Googlers there as well! I've also heard that Davor is also
> > based in Seattle, so I have been thinking that it would be a great
> > opportunity for us to meet in person. 😀The date would be probably the
> *15th
> > of July*, so please keep the date in mind if you would be interested!
> >
> > Cheers,
> > Wonook
>

Re: Back from Berlin Beam Summit 2019

Posted by Byung-Gon Chun <bg...@gmail.com>.

Thanks for sharing the news!

Sounds very exciting. I’d be interested in following up with the companies that may use Nemo, too.

Cheers,
Gon

Sent from my iPad

> On 1 Jul 2019, at 5:40 PM, 송원욱 <wo...@apache.org> wrote:
> 
> Hi!
> 
> I got back from the Beam Summit Europe 2019 that happened last week in
> Berlin, and I had lots of interesting conversations and feedbacks from the
> people that I've met there. I thought I would share some of them with the
> dev list. By the way you can check out the talk on youtube
> <https://youtu.be/DKxYE8YWF_o>!
> 
> First of all, a lot of people were *very* interested in Apache Nemo! and a
> lot of people from the Beam community were very excited to hear about a new
> runner with primary support for their language! A few reasons for their
> interest had been that since Beam does not actually get involved in the
> runtime layer, where the actual scheduling or communication or distributed
> computation happens, they were interested in the optimizations that can be
> done in such layers.
> 
> Second, with all the support from the TFX team, as well as the Beam SQL
> team, it would bring loads of new possibilities for Nemo by supporting the
> *portability* *layer* of Beam, which supports applications written with any
> languages among Java, Python, and Go (and more in the future!). The
> portability layer is getting more and more mature, and I think it's about
> time to support the portability layer for Nemo as well, as not a lot of
> runners support it so far and it would give Nemo a head start.
> 
> Another thing that I've noticed is that a lot of people are still very much
> interested in *batch* processing rather than stream processing. From the
> people that I've talked to, I've learned that people found stream
> processing to be quite pricey and that they haven't found stream processing
> worth the price that they were paying (for example, Spotify runs all of
> their data processing workloads as batch). I guess Nemo could be a good
> candidate to run batch processing, as Spark often suffers from problems as
> large-scale shuffle and data skew problems, if not provided with machines
> with enough memory, whereas Nemo is able to provide the optimizations for
> such problems. I've also found the people were interested if Nemo supports
> Kubernetes, which is a topic that we should definitely look into.
> 
> I've also had many questions from the engineers from *Seznam.cz *and
> *shopify.com
> <http://shopify.com>* where they run their own datacenters to process their
> data (I think). They have been facing exactly the same problems as
> illustrated above (large-scale shuffle, data skew, frequent data reloading
> for broadcasted data, utilizing transient resources, etc.), and have had
> questions about running their data processing workloads on their large
> amounts of data that they are facing every day (upto 40TB/day). I should
> definitely follow up with them to see how they are doing, if they are
> trying to use Nemo in their production, to provide help if needed and to
> see Nemo's performance with real workloads.
> 
> Lastly, I have been talking with Pablo (from Beam) about the trip to
> *Seattle* and Renton, Washington next week regarding the USENIX ATC '19
> conference, and have had a chat about organizing a lunch and maybe a small
> talk with the Googlers there as well! I've also heard that Davor is also
> based in Seattle, so I have been thinking that it would be a great
> opportunity for us to meet in person. 😀The date would be probably the *15th
> of July*, so please keep the date in mind if you would be interested!
> 
> Cheers,
> Wonook