You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by OBones <ob...@free.fr> on 2017/07/12 15:33:53 UTC

More efficient RDD.count() implementation

Hello,

As I have written my own data source, I also wrote a custom RDD[Row] 
implementation to provide getPartitions and compute overrides.
This works very well but doing some performance analysis, I see that for 
any given pipeline fit operation, a fair amount of time is spent in the 
RDD.count method.
Its default implementation in RDD.scala is to go through the entire 
iterator, which in my case is counter productive because I already know 
the number of rows there are in the RDD or any partition returned by 
getPartitions.
As an initial attempt, I declared the following in my custom RDD 
implementation:

   override def count(): Long = { reader.RowCount }

but this never gets called which upon further inspection makes perfect 
sense. Indeed the internal code creates RDDs for every partition it has 
to work on. And this is where I'm a bit stuck because I have no idea as 
to how to override this creation.

Here is a call stack for a GBTRegressor run, but it's quite similar for 
RandomForestRegressor or DecisionTreeRegressor.

org.apache.spark.rdd.RDD.count(RDD.scala:1158)
org.apache.spark.ml.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:116)
org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:105)
org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:125)
org.apache.spark.ml.tree.impl.GradientBoostedTrees$.boost(GradientBoostedTrees.scala:291)
org.apache.spark.ml.tree.impl.GradientBoostedTrees$.run(GradientBoostedTrees.scala:49)
org.apache.spark.ml.regression.GBTRegressor.train(GBTRegressor.scala:154)
org.apache.spark.ml.regression.GBTRegressor.train(GBTRegressor.scala:58)
org.apache.spark.ml.Predictor.fit(Predictor.scala:96)

Any suggestion would be much appreciated.

Regards

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org