You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by shivamverma <sh...@gmail.com> on 2015/07/13 11:45:01 UTC

Spark issue with running CrossValidator with RandomForestClassifier on dataset

Hi

I am running Spark 1.4 in Standalone mode on top of Hadoop 2.3 on a CentOS
node. I am trying to run grid search on an RF classifier to classify a small
dataset using the pyspark.ml.tuning module, specifically the
ParamGridBuilder and CrossValidator classes. I get the following error when
I try passing a DataFrame of Features-Labels to CrossValidator:



I tried the following code, using the dataset given in Spark's CV
documentation for  cross validator
<https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator> 
. I also pass the DF through a StringIndexer transformation for the RF:



Note that the above dataset works on logistic regression. I have also tried
a larger dataset with sparse vectors as features (which I was originally
trying to fit) but received the same error on RF.
My guess is that there is an issue with how
BinaryClassificationEvaluator(self, rawPredictionCol="rawPrediction",
labelCol="label", metricName="areaUnderROC") interprets the 'rawPredict'
column - with LR, the rawPredictionCol is a list/vector, whereas with RF,
the prediction column is a double.
Is it an issue with the evaluator, or is there something else that I'm
missing?

Thanks!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-issue-with-running-CrossValidator-with-RandomForestClassifier-on-dataset-tp23791.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark issue with running CrossValidator with RandomForestClassifier on dataset

Posted by Feynman Liang <fl...@databricks.com>.

Can you send the error messages again? I'm not seeing them.

On Mon, Jul 13, 2015 at 2:45 AM, shivamverma <sh...@gmail.com>
wrote:

> Hi
>
> I am running Spark 1.4 in Standalone mode on top of Hadoop 2.3 on a CentOS
> node. I am trying to run grid search on an RF classifier to classify a
> small
> dataset using the pyspark.ml.tuning module, specifically the
> ParamGridBuilder and CrossValidator classes. I get the following error when
> I try passing a DataFrame of Features-Labels to CrossValidator:
>
>
>
> I tried the following code, using the dataset given in Spark's CV
> documentation for  cross validator
> <
> https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator
> >
> . I also pass the DF through a StringIndexer transformation for the RF:
>
>
>
> Note that the above dataset works on logistic regression. I have also tried
> a larger dataset with sparse vectors as features (which I was originally
> trying to fit) but received the same error on RF.
> My guess is that there is an issue with how
> BinaryClassificationEvaluator(self, rawPredictionCol="rawPrediction",
> labelCol="label", metricName="areaUnderROC") interprets the 'rawPredict'
> column - with LR, the rawPredictionCol is a list/vector, whereas with RF,
> the prediction column is a double.
> Is it an issue with the evaluator, or is there something else that I'm
> missing?
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-issue-with-running-CrossValidator-with-RandomForestClassifier-on-dataset-tp23791.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>