You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Manjunath Shetty H <ma...@live.com> on 2020/03/01 12:32:28 UTC

How to collect Spark dataframe write metrics

Hi all,

Basically my use case is to validate the DataFrame rows count before and after writing to HDFS. Is this even to good practice ? Or Should relay on spark for guaranteed writes ?.

If it is a good practice to follow then how to get the DataFrame level write metrics ?

Any pointers would be helpful.


Thanks and Regards
Manjunath

Re: How to collect Spark dataframe write metrics

Posted by Manjunath Shetty H <ma...@live.com>.
Thanks Zohar,

Will try that


-
Manjunath
________________________________
From: Zohar Stiro <zs...@gmail.com>
Sent: Tuesday, March 3, 2020 1:49 PM
To: Manjunath Shetty H <ma...@live.com>
Cc: user <us...@spark.apache.org>
Subject: Re: How to collect Spark dataframe write metrics

Hi,

to get DataFrame level write metrics you can take a look at the following trait :
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteStatsTracker.scala
and a basic implementation example:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala

and here is an example of how it is being used in FileStreamSink:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala#L178

- about the good practise - it depends on your use case but Generally speaking I would not do it - at least not for checking your logic/ checking spark is working correctly.

‫בתאריך יום א׳, 1 במרץ 2020 ב-14:32 מאת ‪Manjunath Shetty H‬‏ <‪m...@live.com>‬‏>:‬
Hi all,

Basically my use case is to validate the DataFrame rows count before and after writing to HDFS. Is this even to good practice ? Or Should relay on spark for guaranteed writes ?.

If it is a good practice to follow then how to get the DataFrame level write metrics ?

Any pointers would be helpful.


Thanks and Regards
Manjunath

Re: How to collect Spark dataframe write metrics

Posted by Zohar Stiro <zs...@gmail.com>.
Hi,

to get DataFrame level write metrics you can take a look at the following
trait :
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteStatsTracker.scala
and a basic implementation example:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala


and here is an example of how it is being used in FileStreamSink:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala#L178

- about the good practise - it depends on your use case but Generally
speaking I would not do it - at least not for checking your logic/ checking
spark is working correctly.

‫בתאריך יום א׳, 1 במרץ 2020 ב-14:32 מאת ‪Manjunath Shetty H‬‏ <‪
manjunathshetty@live.com‬‏>:‬

> Hi all,
>
> Basically my use case is to validate the DataFrame rows count before and
> after writing to HDFS. Is this even to good practice ? Or Should relay on
> spark for guaranteed writes ?.
>
> If it is a good practice to follow then how to get the DataFrame level
> write metrics ?
>
> Any pointers would be helpful.
>
>
> Thanks and Regards
> Manjunath
>