You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Greg Hogan <co...@greghogan.com> on 2015/10/20 23:06:07 UTC

Scaling Flink

Is there guidance for configuring Flink on large clusters? I have recently
been working to benchmark some algorithms on and test AWS. I had no issues
running on a 16 node cluster but when moving to 64 nodes the JobManager
struggled mightily. It did not look to be parallelizing its workload. I was
in the process of modifying my code to reduce the parallelism of earlier,
smaller operations when I lost the cluster due to a spot price increase.

The instances were c3.8xlarge and in the larger cluster one instance hosted
the JobManager so the parallelism was 63 * 32 = 2016. The small cluster had
parallelism of 512.

I have seen the blog posts describing the performance of 640 core clusters
on GCE. Is this a known limitation or can Flink scale much further?

http://data-artisans.com/computing-recommendations-at-extreme-scale-with-apache-flink/

http://data-artisans.com/how-to-factorize-a-700-gb-matrix-with-apache-flink/

Thanks,
Greg

Re: Scaling Flink

Posted by Stephan Ewen <se...@apache.org>.

@Greg: Can you describe at what points the JobManager struggled heavily? I
would guess that it is at some point during deployment, that deployment
takes longer than you expected?

On Wed, Oct 21, 2015 at 10:14 AM, Maximilian Michels <mx...@apache.org> wrote:

> Hi Greg,
>
> It would be very interesting to do a profiling of the job master to
> see what it mostly spends time on. Did you run your experiments with
> 0.9.X or the 0.10-SNAPSHOT? Would be interesting to know if there is a
> regression.
>
> Best,
> Max
>
> On Wed, Oct 21, 2015 at 10:08 AM, Till Rohrmann <tr...@apache.org>
> wrote:
> > Hi Greg,
> >
> > there is no official guide for running Flink on large clusters. As far
> as I
> > know, the cluster we used for the matrix factorization was the largest
> > cluster we've run a serious job on. Thus, it would be highly interesting
> to
> > understand what made the JobManager to slow down. At some point, though,
> > this should happen since the JobManager always stays a single instance.
> Do
> > you have by chance access to the JobManager log file? This might be
> helpful.
> >
> > Thanks for your help,
> > Till
> >
> > On Tue, Oct 20, 2015 at 11:06 PM, Greg Hogan <co...@greghogan.com> wrote:
> >
> >> Is there guidance for configuring Flink on large clusters? I have
> recently
> >> been working to benchmark some algorithms on and test AWS. I had no
> issues
> >> running on a 16 node cluster but when moving to 64 nodes the JobManager
> >> struggled mightily. It did not look to be parallelizing its workload. I
> was
> >> in the process of modifying my code to reduce the parallelism of
> earlier,
> >> smaller operations when I lost the cluster due to a spot price increase.
> >>
> >> The instances were c3.8xlarge and in the larger cluster one instance
> hosted
> >> the JobManager so the parallelism was 63 * 32 = 2016. The small cluster
> had
> >> parallelism of 512.
> >>
> >> I have seen the blog posts describing the performance of 640 core
> clusters
> >> on GCE. Is this a known limitation or can Flink scale much further?
> >>
> >>
> >>
> http://data-artisans.com/computing-recommendations-at-extreme-scale-with-apache-flink/
> >>
> >>
> >>
> http://data-artisans.com/how-to-factorize-a-700-gb-matrix-with-apache-flink/
> >>
> >> Thanks,
> >> Greg
> >>
>

Re: Scaling Flink

Posted by Maximilian Michels <mx...@apache.org>.

Hi Greg,

It would be very interesting to do a profiling of the job master to
see what it mostly spends time on. Did you run your experiments with
0.9.X or the 0.10-SNAPSHOT? Would be interesting to know if there is a
regression.

Best,
Max

On Wed, Oct 21, 2015 at 10:08 AM, Till Rohrmann <tr...@apache.org> wrote:
> Hi Greg,
>
> there is no official guide for running Flink on large clusters. As far as I
> know, the cluster we used for the matrix factorization was the largest
> cluster we've run a serious job on. Thus, it would be highly interesting to
> understand what made the JobManager to slow down. At some point, though,
> this should happen since the JobManager always stays a single instance. Do
> you have by chance access to the JobManager log file? This might be helpful.
>
> Thanks for your help,
> Till
>
> On Tue, Oct 20, 2015 at 11:06 PM, Greg Hogan <co...@greghogan.com> wrote:
>
>> Is there guidance for configuring Flink on large clusters? I have recently
>> been working to benchmark some algorithms on and test AWS. I had no issues
>> running on a 16 node cluster but when moving to 64 nodes the JobManager
>> struggled mightily. It did not look to be parallelizing its workload. I was
>> in the process of modifying my code to reduce the parallelism of earlier,
>> smaller operations when I lost the cluster due to a spot price increase.
>>
>> The instances were c3.8xlarge and in the larger cluster one instance hosted
>> the JobManager so the parallelism was 63 * 32 = 2016. The small cluster had
>> parallelism of 512.
>>
>> I have seen the blog posts describing the performance of 640 core clusters
>> on GCE. Is this a known limitation or can Flink scale much further?
>>
>>
>> http://data-artisans.com/computing-recommendations-at-extreme-scale-with-apache-flink/
>>
>>
>> http://data-artisans.com/how-to-factorize-a-700-gb-matrix-with-apache-flink/
>>
>> Thanks,
>> Greg
>>

Re: Scaling Flink

Posted by Till Rohrmann <tr...@apache.org>.

Hi Greg,

there is no official guide for running Flink on large clusters. As far as I
know, the cluster we used for the matrix factorization was the largest
cluster we've run a serious job on. Thus, it would be highly interesting to
understand what made the JobManager to slow down. At some point, though,
this should happen since the JobManager always stays a single instance. Do
you have by chance access to the JobManager log file? This might be helpful.

Thanks for your help,
Till

On Tue, Oct 20, 2015 at 11:06 PM, Greg Hogan <co...@greghogan.com> wrote:

> Is there guidance for configuring Flink on large clusters? I have recently
> been working to benchmark some algorithms on and test AWS. I had no issues
> running on a 16 node cluster but when moving to 64 nodes the JobManager
> struggled mightily. It did not look to be parallelizing its workload. I was
> in the process of modifying my code to reduce the parallelism of earlier,
> smaller operations when I lost the cluster due to a spot price increase.
>
> The instances were c3.8xlarge and in the larger cluster one instance hosted
> the JobManager so the parallelism was 63 * 32 = 2016. The small cluster had
> parallelism of 512.
>
> I have seen the blog posts describing the performance of 640 core clusters
> on GCE. Is this a known limitation or can Flink scale much further?
>
>
> http://data-artisans.com/computing-recommendations-at-extreme-scale-with-apache-flink/
>
>
> http://data-artisans.com/how-to-factorize-a-700-gb-matrix-with-apache-flink/
>
> Thanks,
> Greg
>