You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Сергей Лихоман <se...@gmail.com> on 2015/08/29 19:52:16 UTC

Research of Spark scalability / performance issues

Hi guys!

I am going to make a contribution to Spark, but I didn't have much
experience using it under high load and will be very appreciated for any
help for pointing out scalability or performance issues that can be
researched and resolved.

I have several ideas:
1. Nodes HA (Seems like this is resolved in spark, but maybe someone knows
existing problems..)
2. Improve data distribution between nodes. (analyze queries and
automatically suggest data distribution to improve performance)
3. To think about Geo distribution. but is it actual?

It will be master degree project. please, help me to select right
improvement.

Thanks in advance!

Re: Research of Spark scalability / performance issues

Posted by Сергей Лихоман <se...@gmail.com>.

I will focus on scheduling to improve throughput. I need some time to read
JIRAs tickets and analyze requirements for future scheduler.  I will update
this thread with doc when it will be done.

Thanks guys!

2015-08-31 14:43 GMT+03:00 Steve Loughran <st...@hortonworks.com>:

>
> If you look at the recurrent issues in datacentre-scale computing systems,
> two stand out
> -resilience to failure: that's algorithms and the layers underneath
> (storage, work allocation & tracking ...)
> -scheduling: maximising resource utilisation while prioritising high-SLA
> work (interactive things, HBase)  on a mixed-workload cluster
>
> Scheduling is where you get to use phrases like "APX-hard"  in papers
> attached to JIRAs and not only not scare people, you may actually get
> feedback. In large —and we are taking 10K+ node clusters these days—,
> fractional improvements in cluster utilisation are measurable in a size big
> enough to show up in spreadsheets that people who sign off on cluster
> purchases. But for the same reason, the maintainers of the big schedulers
> (e.g. the YARN ones) are usually pretty reluctant to trust getting patches
> in. Focusing on scheduling within an app is likely to be more tractable
>
>
> HA is one of those really-hard-get-right problems. If you like your
> Lamport papers and enjoy the TLA+ toolchain it's the one to go for. The
> best tactic here is start with the work of others, which outside of Google
> means "Zookeeper"..understanding its proof would be a good start into that
> work.
>
> Other topic
> -work optimisation (e.g. partitioning, placement, ordering of operations)
> -self-tuning & cluster operations optimisation: can logs & live monitoring
> be used to improve system efficiency. The fact that logs are themselves
> large datasets means you get to use the analysis layer to introspect on
> past work. Cluster ops don't viewed as a CS problem, more one of those
> implementation details
>
> Finally, and this isn't in a software stack itself, but something that'd
> be used to test everything, and again, uses the tooling, is something to
> radically improve how we test and understand those test results
>
> http://steveloughran.blogspot.co.uk/2015/05/distributed-system-testing-where-now.html
>
> A challenge there would actually be getting your supervisors to recognise
> the problem and accept that its worth you putting in the effort. Fault
> injection & failure simulation is something to consider here; it hooks up
> to HA nicely. Look at the jepsen work as an example
> https://aphyr.com/posts
>
> -Steve
>
>
> On 30 Aug 2015, at 02:42, Reynold Xin <rx...@databricks.com> wrote:
>
> Both 2 and 3 are pretty good topics for master's project I think.
>
> You can also look into how one can improve Spark's scheduler throughput.
> Couple years ago Kay measured it but things have changed. It would be great
> to start with measurement, and then look at where the bottlenecks are, and
> see how we can improve it.
>
>
> On Sat, Aug 29, 2015 at 10:52 AM, Сергей Лихоман <se...@gmail.com>
> wrote:
>
>> Hi guys!
>>
>> I am going to make a contribution to Spark, but I didn't have much
>> experience using it under high load and will be very appreciated for any
>> help for pointing out scalability or performance issues that can be
>> researched and resolved.
>>
>> I have several ideas:
>> 1. Nodes HA (Seems like this is resolved in spark, but maybe someone
>> knows existing problems..)
>> 2. Improve data distribution between nodes. (analyze queries and
>> automatically suggest data distribution to improve performance)
>> 3. To think about Geo distribution. but is it actual?
>>
>> It will be master degree project. please, help me to select right
>> improvement.
>>
>> Thanks in advance!
>>
>
>
>

Re: Research of Spark scalability / performance issues

Posted by Steve Loughran <st...@hortonworks.com>.

If you look at the recurrent issues in datacentre-scale computing systems, two stand out
-resilience to failure: that's algorithms and the layers underneath (storage, work allocation & tracking ...)
-scheduling: maximising resource utilisation while prioritising high-SLA work (interactive things, HBase) on a mixed-workload cluster

Scheduling is where you get to use phrases like "APX-hard" in papers attached to JIRAs and not only not scare people, you may actually get feedback. In large —and we are taking 10K+ node clusters these days—, fractional improvements in cluster utilisation are measurable in a size big enough to show up in spreadsheets that people who sign off on cluster purchases. But for the same reason, the maintainers of the big schedulers (e.g. the YARN ones) are usually pretty reluctant to trust getting patches in. Focusing on scheduling within an app is likely to be more tractable

HA is one of those really-hard-get-right problems. If you like your Lamport papers and enjoy the TLA+ toolchain it's the one to go for. The best tactic here is start with the work of others, which outside of Google means "Zookeeper"..understanding its proof would be a good start into that work.

Other topic
-work optimisation (e.g. partitioning, placement, ordering of operations)
-self-tuning & cluster operations optimisation: can logs & live monitoring be used to improve system efficiency. The fact that logs are themselves large datasets means you get to use the analysis layer to introspect on past work. Cluster ops don't viewed as a CS problem, more one of those implementation details

Finally, and this isn't in a software stack itself, but something that'd be used to test everything, and again, uses the tooling, is something to radically improve how we test and understand those test results
http://steveloughran.blogspot.co.uk/2015/05/distributed-system-testing-where-now.html

A challenge there would actually be getting your supervisors to recognise the problem and accept that its worth you putting in the effort. Fault injection & failure simulation is something to consider here; it hooks up to HA nicely. Look at the jepsen work as an example https://aphyr.com/posts

-Steve

On 30 Aug 2015, at 02:42, Reynold Xin <rx...@databricks.com>> wrote:

Both 2 and 3 are pretty good topics for master's project I think.

You can also look into how one can improve Spark's scheduler throughput. Couple years ago Kay measured it but things have changed. It would be great to start with measurement, and then look at where the bottlenecks are, and see how we can improve it.

On Sat, Aug 29, 2015 at 10:52 AM, Сергей Лихоман <se...@gmail.com>> wrote:
Hi guys!

I am going to make a contribution to Spark, but I didn't have much experience using it under high load and will be very appreciated for any help for pointing out scalability or performance issues that can be researched and resolved.

I have several ideas:
1. Nodes HA (Seems like this is resolved in spark, but maybe someone knows existing problems..)
2. Improve data distribution between nodes. (analyze queries and automatically suggest data distribution to improve performance)
3. To think about Geo distribution. but is it actual?

It will be master degree project. please, help me to select right improvement.

Thanks in advance!

Re: Research of Spark scalability / performance issues

Posted by Reynold Xin <rx...@databricks.com>.

Both 2 and 3 are pretty good topics for master's project I think.

You can also look into how one can improve Spark's scheduler throughput.
Couple years ago Kay measured it but things have changed. It would be great
to start with measurement, and then look at where the bottlenecks are, and
see how we can improve it.

On Sat, Aug 29, 2015 at 10:52 AM, Сергей Лихоман <se...@gmail.com>
wrote:

> Hi guys!
>
> I am going to make a contribution to Spark, but I didn't have much
> experience using it under high load and will be very appreciated for any
> help for pointing out scalability or performance issues that can be
> researched and resolved.
>
> I have several ideas:
> 1. Nodes HA (Seems like this is resolved in spark, but maybe someone knows
> existing problems..)
> 2. Improve data distribution between nodes. (analyze queries and
> automatically suggest data distribution to improve performance)
> 3. To think about Geo distribution. but is it actual?
>
> It will be master degree project. please, help me to select right
> improvement.
>
> Thanks in advance!
>