You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Yiannis Gkoufas <jo...@gmail.com> on 2016/02/03 16:26:56 UTC

SparkOscope: Enabling Spark Optimization through Cross-stack Monitoring and Visualization

Hi all,

I just wanted to introduce some of my recent work in IBM Research around
Spark and especially its Metric System and Web UI.
As a quick overview of our contributions:
We have a created a new type of Sink for the metrics ( HDFSSink ) which
captures the metrics into HDFS,
We have extended the metrics reported by the Executors to include OS-level
metrics regarding CPU, RAM, Disk IO, Network IO utilizing the Hyperic Sigar
library
We have extended the Web UI for the completed applications to visualize any
of the above metrics the user wants to.
The above functionalities can be configured in the metrics.properties and
spark-defaults.conf files.
We have recorded a small demo that shows those capabilities which you can
find here : https://ibm.app.box.com/s/vyaedlyb444a4zna1215c7puhxliqxdg
There is a blog post which gives more details on the functionality here:
*www.spark.tc/sparkoscope-enabling-spark-optimization-through-cross-stack-monitoring-and-visualization-2/*
<http://www.spark.tc/sparkoscope-enabling-spark-optimization-through-cross-stack-monitoring-and-visualization-2/>
and also there is a public repo where anyone can try it:
*https://github.com/ibm-research-ireland/sparkoscope*
<https://github.com/ibm-research-ireland/sparkoscope>

I would really appreciate any feedback or advice regarding this work.
Especially if you think it's worth it to upstream to the official Spark
repository.

Thanks a lot!

TakeOrderedAndProject operator may causes an OOM

Posted by 汪洋 <ti...@icloud.com>.

Hi,

Currently the TakeOrderedAndProject operator in spark sql uses RDD’s takeOrdered method. When we pass a large limit to operator, however, it will return partitionNum*limit number of records to the driver which may cause an OOM.

Are there any plans to deal with the problem in the community? 


Thanks.


Yang

Re: SparkOscope: Enabling Spark Optimization through Cross-stack Monitoring and Visualization

Posted by Stavros Kontopoulos <st...@typesafe.com>.

Cool work! I will have a look to the project.

Cheers

On Fri, Feb 5, 2016 at 11:09 AM, Pete Robbins <ro...@gmail.com> wrote:

> Yiannis,
>
> I'm interested in what you've done here as I was looking for ways to allow
> the Spark UI to display custom metrics in a pluggable way without having to
> modify the Spark source code. It would be good to see if we could have
> modify your code to add extension points into the UI so we could configure
> sources of the additional metrics. So for instance rather than creating
> events from your HDFS files I would like to have a module that is pulling
> in system/jvm metrics that are in eg Elasticsearch.
>
> Do any of the Spark committers have any thoughts on this?
>
> Cheers,
>
>
> On 3 February 2016 at 15:26, Yiannis Gkoufas <jo...@gmail.com> wrote:
>
>> Hi all,
>>
>> I just wanted to introduce some of my recent work in IBM Research around
>> Spark and especially its Metric System and Web UI.
>> As a quick overview of our contributions:
>> We have a created a new type of Sink for the metrics ( HDFSSink ) which
>> captures the metrics into HDFS,
>> We have extended the metrics reported by the Executors to include
>> OS-level metrics regarding CPU, RAM, Disk IO, Network IO utilizing the
>> Hyperic Sigar library
>> We have extended the Web UI for the completed applications to visualize
>> any of the above metrics the user wants to.
>> The above functionalities can be configured in the metrics.properties and
>> spark-defaults.conf files.
>> We have recorded a small demo that shows those capabilities which you can
>> find here : https://ibm.app.box.com/s/vyaedlyb444a4zna1215c7puhxliqxdg
>> There is a blog post which gives more details on the functionality here:
>> *www.spark.tc/sparkoscope-enabling-spark-optimization-through-cross-stack-monitoring-and-visualization-2/*
>> <http://www.spark.tc/sparkoscope-enabling-spark-optimization-through-cross-stack-monitoring-and-visualization-2/>
>> and also there is a public repo where anyone can try it:
>> *https://github.com/ibm-research-ireland/sparkoscope*
>> <https://github.com/ibm-research-ireland/sparkoscope>
>>
>> I would really appreciate any feedback or advice regarding this work.
>> Especially if you think it's worth it to upstream to the official Spark
>> repository.
>>
>> Thanks a lot!
>>
>
>


-- 



<http://www.typesafe.com>

Re: SparkOscope: Enabling Spark Optimization through Cross-stack Monitoring and Visualization

Posted by Pete Robbins <ro...@gmail.com>.

Yiannis,

I'm interested in what you've done here as I was looking for ways to allow
the Spark UI to display custom metrics in a pluggable way without having to
modify the Spark source code. It would be good to see if we could have
modify your code to add extension points into the UI so we could configure
sources of the additional metrics. So for instance rather than creating
events from your HDFS files I would like to have a module that is pulling
in system/jvm metrics that are in eg Elasticsearch.

Do any of the Spark committers have any thoughts on this?

Cheers,


On 3 February 2016 at 15:26, Yiannis Gkoufas <jo...@gmail.com> wrote:

> Hi all,
>
> I just wanted to introduce some of my recent work in IBM Research around
> Spark and especially its Metric System and Web UI.
> As a quick overview of our contributions:
> We have a created a new type of Sink for the metrics ( HDFSSink ) which
> captures the metrics into HDFS,
> We have extended the metrics reported by the Executors to include OS-level
> metrics regarding CPU, RAM, Disk IO, Network IO utilizing the Hyperic Sigar
> library
> We have extended the Web UI for the completed applications to visualize
> any of the above metrics the user wants to.
> The above functionalities can be configured in the metrics.properties and
> spark-defaults.conf files.
> We have recorded a small demo that shows those capabilities which you can
> find here : https://ibm.app.box.com/s/vyaedlyb444a4zna1215c7puhxliqxdg
> There is a blog post which gives more details on the functionality here:
> *www.spark.tc/sparkoscope-enabling-spark-optimization-through-cross-stack-monitoring-and-visualization-2/*
> <http://www.spark.tc/sparkoscope-enabling-spark-optimization-through-cross-stack-monitoring-and-visualization-2/>
> and also there is a public repo where anyone can try it:
> *https://github.com/ibm-research-ireland/sparkoscope*
> <https://github.com/ibm-research-ireland/sparkoscope>
>
> I would really appreciate any feedback or advice regarding this work.
> Especially if you think it's worth it to upstream to the official Spark
> repository.
>
> Thanks a lot!
>