You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Manjunath Shetty H <ma...@live.com> on 2020/03/01 12:32:28 UTC
How to collect Spark dataframe write metrics
Hi all,
Basically my use case is to validate the DataFrame rows count before and after writing to HDFS. Is this even to good practice ? Or Should relay on spark for guaranteed writes ?.
If it is a good practice to follow then how to get the DataFrame level write metrics ?
Any pointers would be helpful.
Thanks and Regards
Manjunath
Re: How to collect Spark dataframe write metrics
Posted by Manjunath Shetty H <ma...@live.com>.
Thanks Zohar,
Will try that
-
Manjunath
________________________________
From: Zohar Stiro <zs...@gmail.com>
Sent: Tuesday, March 3, 2020 1:49 PM
To: Manjunath Shetty H <ma...@live.com>
Cc: user <us...@spark.apache.org>
Subject: Re: How to collect Spark dataframe write metrics
Hi,
to get DataFrame level write metrics you can take a look at the following trait :
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteStatsTracker.scala
and a basic implementation example:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
and here is an example of how it is being used in FileStreamSink:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala#L178
- about the good practise - it depends on your use case but Generally speaking I would not do it - at least not for checking your logic/ checking spark is working correctly.
בתאריך יום א׳, 1 במרץ 2020 ב-14:32 מאת Manjunath Shetty H <m...@live.com>>:
Hi all,
Basically my use case is to validate the DataFrame rows count before and after writing to HDFS. Is this even to good practice ? Or Should relay on spark for guaranteed writes ?.
If it is a good practice to follow then how to get the DataFrame level write metrics ?
Any pointers would be helpful.
Thanks and Regards
Manjunath
Re: How to collect Spark dataframe write metrics
Posted by Zohar Stiro <zs...@gmail.com>.
Hi,
to get DataFrame level write metrics you can take a look at the following
trait :
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteStatsTracker.scala
and a basic implementation example:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/BasicWriteStatsTracker.scala
and here is an example of how it is being used in FileStreamSink:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala#L178
- about the good practise - it depends on your use case but Generally
speaking I would not do it - at least not for checking your logic/ checking
spark is working correctly.
בתאריך יום א׳, 1 במרץ 2020 ב-14:32 מאת Manjunath Shetty H <
manjunathshetty@live.com>:
> Hi all,
>
> Basically my use case is to validate the DataFrame rows count before and
> after writing to HDFS. Is this even to good practice ? Or Should relay on
> spark for guaranteed writes ?.
>
> If it is a good practice to follow then how to get the DataFrame level
> write metrics ?
>
> Any pointers would be helpful.
>
>
> Thanks and Regards
> Manjunath
>