You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Hamish Whittal <ha...@cloud-fundis.co.za> on 2021/05/27 17:03:36 UTC
Accumulators and other important metrics for your job
Hi folks,
I have a problematic dataset I'm working with and am trying to find ways of
"debugging" the data.
For example, the most simple thing I would like to do is to know how many
rows of data I've read and compare that to a simple count of the lines in
the file.
I could do:
df.count()
but this seems clunky (and expensive) for something that should be easy to
keep track of. I then thought accumulators might be the solution, but it
seems that I would have to do a second pass through the data at least to
"addInPlace" to the lines total. I might as well do that count then.
I would also expect that if I hit a row without the relevant data, I should
be able to tally that too. Say, a record without the requisite primary key.
I note too that accumulators are only tallies, but what if I want to keep
track of every file read. Say my directory has 100k files or some such, I
want to know that I have read each file by its filename. Accumulators won't
help me there since I want to keep filenames rather than just numbers of
files read. I might for example then be able to work out that it missed
file X because it was a corrupt file.
Has anyone got some advice on handling this sort of stuff?
Thanks in advance.