You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Hamish Whittal <ha...@cloud-fundis.co.za> on 2021/05/27 17:03:36 UTC

Accumulators and other important metrics for your job

Hi folks,

I have a problematic dataset I'm working with and am trying to find ways of
"debugging" the data.

For example, the most simple thing I would like to do is to know how many
rows of data I've read and compare that to a simple count of the lines in
the file.

I could do:
   df.count()

but this seems clunky (and expensive) for something that should be easy to
keep track of. I then thought accumulators might be the solution, but it
seems that I would have to do a second pass through the data at least to
"addInPlace" to the lines total. I might as well do that count then.

I would also expect that if I hit a row without the relevant data, I should
be able to tally that too. Say, a record without the requisite primary key.

I note too that accumulators are only tallies, but what if I want to keep
track of every file read. Say my directory has 100k files or some such, I
want to know that I have read each file by its filename. Accumulators won't
help me there since I want to keep filenames rather than just numbers of
files read. I might for example then be able to work out that it missed
file X because it was a corrupt file.

Has anyone got some advice on handling this sort of stuff?

Thanks in advance.