You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Nirav Patel <np...@xactlycorp.com> on 2016/05/13 19:04:24 UTC

API to study key cardinality and distribution and other important statistics about data at certain stage

Hi,

Problem is every time job fails or perform poorly at certain stages you
need to study your data distribution just before THAT stage. Overall look
at input data set doesn't help very much if you have so many transformation
going on in DAG. I alway end up writing complicated typed code to run
analysis vs actual job to identify this. Shouldn't there be spark api to
examine this in better way. After all it does go through all the records
(in most cases) to perform transformation or action so as a side job it can
gather statistics as well when instructed.

Thanks

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>