You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@reef.apache.org by Douglas Service <ds...@gmail.com> on 2017/10/03 01:47:06 UTC

Maximum REEF scaling

I have heard that REEF is unable to scale past 1024 nodes on a cluster. Is
this a true statement and if so do we know what the limiting factor is that
prevents further scaling?

Doug

RE: Maximum REEF scaling

Posted by "Julia Wang (QIUHE)" <Qi...@microsoft.com.INVALID>.

BTW, without removing the synchronize in bridge driver handlers, yes, it doesn't scale. It can still pass 1024,  but cannot go too high. 

-Julia

-----Original Message-----
From: Julia Wang (QIUHE) 
Sent: Monday, October 2, 2017 7:16 PM
To: dev@reef.apache.org
Subject: RE: Maximum REEF scaling

REEF can definitely run 1024 containers even with only 512MB and 1 core for driver. 

To support more containers, we need to have a proper driver setting from client. We also have time contains for how quick we need to start all the evaluators before timeout. Otherwise, we might run as many as we can assuming YARN can give us. 

There are some locks in bridge code that impact the parallel processing. I made some fixes locally and able to run 5500 containers from .Net with the current YARN timeout settings. 

However, with bridge code, when number of containers is big, we are slow. We need to find root cause and make improvement. 

- Julia

-----Original Message-----
From: Douglas Service [mailto:dsopsrc@gmail.com] 
Sent: Monday, October 2, 2017 6:47 PM
To: dev@reef.apache.org
Subject: Maximum REEF scaling

I have heard that REEF is unable to scale past 1024 nodes on a cluster. Is this a true statement and if so do we know what the limiting factor is that prevents further scaling?

Doug

RE: Maximum REEF scaling

Posted by "Julia Wang (QIUHE)" <Qi...@microsoft.com.INVALID>.

REEF can definitely run 1024 containers even with only 512MB and 1 core for driver. 

To support more containers, we need to have a proper driver setting from client. We also have time contains for how quick we need to start all the evaluators before timeout. Otherwise, we might run as many as we can assuming YARN can give us. 

There are some locks in bridge code that impact the parallel processing. I made some fixes locally and able to run 5500 containers from .Net with the current YARN timeout settings. 

However, with bridge code, when number of containers is big, we are slow. We need to find root cause and make improvement. 

- Julia

-----Original Message-----
From: Douglas Service [mailto:dsopsrc@gmail.com] 
Sent: Monday, October 2, 2017 6:47 PM
To: dev@reef.apache.org
Subject: Maximum REEF scaling

I have heard that REEF is unable to scale past 1024 nodes on a cluster. Is this a true statement and if so do we know what the limiting factor is that prevents further scaling?

Doug