You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by OBones <ob...@free.fr> on 2017/07/12 15:33:53 UTC
More efficient RDD.count() implementation
Hello,
As I have written my own data source, I also wrote a custom RDD[Row]
implementation to provide getPartitions and compute overrides.
This works very well but doing some performance analysis, I see that for
any given pipeline fit operation, a fair amount of time is spent in the
RDD.count method.
Its default implementation in RDD.scala is to go through the entire
iterator, which in my case is counter productive because I already know
the number of rows there are in the RDD or any partition returned by
getPartitions.
As an initial attempt, I declared the following in my custom RDD
implementation:
override def count(): Long = { reader.RowCount }
but this never gets called which upon further inspection makes perfect
sense. Indeed the internal code creates RDDs for every partition it has
to work on. And this is where I'm a bit stuck because I have no idea as
to how to override this creation.
Here is a call stack for a GBTRegressor run, but it's quite similar for
RandomForestRegressor or DecisionTreeRegressor.
org.apache.spark.rdd.RDD.count(RDD.scala:1158)
org.apache.spark.ml.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:116)
org.apache.spark.ml.tree.impl.RandomForest$.run(RandomForest.scala:105)
org.apache.spark.ml.regression.DecisionTreeRegressor.train(DecisionTreeRegressor.scala:125)
org.apache.spark.ml.tree.impl.GradientBoostedTrees$.boost(GradientBoostedTrees.scala:291)
org.apache.spark.ml.tree.impl.GradientBoostedTrees$.run(GradientBoostedTrees.scala:49)
org.apache.spark.ml.regression.GBTRegressor.train(GBTRegressor.scala:154)
org.apache.spark.ml.regression.GBTRegressor.train(GBTRegressor.scala:58)
org.apache.spark.ml.Predictor.fit(Predictor.scala:96)
Any suggestion would be much appreciated.
Regards
---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org