You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Aniruddha P Tekade <at...@binghamton.edu> on 2020/02/26 19:16:35 UTC

Re: [External Email] Re: Standard practices for building dashboards for spark processed data

Hi Roland,

Thank you for your reply. That's quite helpful. I think I should try
influxDB then. But I am curious if in case of prometheus writing a custom
exporter be a good choice and solve the purpose efficiently? Grafana is not
something I want to drop.

Best,
Aniruddha
-----------

ᐧ

On Tue, Feb 25, 2020 at 11:36 PM Roland Johann <ro...@phenetic.io>
wrote:

> Hi Ani,
>
> Prometheus is not well suited for ingesting explicit timeseries data. Its
> purpose is for technical monitoring. If you want to monitor your spark jobs
> with prometheus you can publish the metrics so prometheus can scrape it.
> What you propably are looking for is a timeseries database that you can
> push metrics to.
>
> Looking for an alternative for grafana should be done only if you find
> grafana is not well suited for your use case regarding visualization.
>
> As said earlier, at a quick glance it sounds that you should look for an
> alternative to prometheus.
>
> For timeseries you can reach out to TimescaleDB, InfluxDB. Other databases
> like normal SQL databases or cassandra lacks up/downsampling capabilities
> that can lead to large query responses and the need for the client to post
> process.
>
> Kind regards,
>
> Aniruddha P Tekade <at...@binghamton.edu> schrieb am Mi. 26. Feb. 2020
> um 02:23:
>
>> Hello,
>>
>> I am trying to build a data pipeline that uses spark structured streaming
>> with delta project and runs into Kubernetes. Due to this, I get my output
>> files only into parquet format. Since I am asked to use the prometheus and
>> grafana
>> for building the dashboard for this pipeline, I run an another small
>> spark job and convert output into json so that I would be able to insert
>> them into Grafana. Although I can see that this step is redundant,
>> considering the important of delta lake project, I can not write my data
>> directly into json. Therefore I need some help/guidelines/opinions about
>> moving forward from here.
>>
>> I would appreciate if the spark user(s) can provide me some practices to
>> follow with respect to the following questions -
>>
>>    1. Since I can not have direct json output from spark structured
>>    streams, is there any better way to convert parquet into json? Or should I
>>    keep only parquet?
>>    2. Will I need to write some custom exporter for prometheus so as to
>>    make grafana read those time-series data?
>>    3. Is there any better dashboard alternative than Grafana for this
>>    requirement?
>>    4. Since the pipeline is going to run into Kubernetes, I am trying to
>>    avoid InfluxDB as time-series database and moving with prometheus. Is this
>>    approach correct?
>>
>> Thanks,
>> Ani
>> -----------
>> ᐧ
>>
> --
> Roland Johann
> Software Developer/Data Engineer
>
> phenetic GmbH
> Lütticher Straße 10, 50674 Köln, Germany
>
> Mobil: +49 172 365 26 46
> Mail: roland.johann@phenetic.io
> Web: phenetic.io
>
> Handelsregister: Amtsgericht Köln (HRB 92595)
> Geschäftsführer: Roland Johann, Uwe Reimann
>