You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Aakash Basu <aa...@gmail.com> on 2018/01/19 13:42:49 UTC

Spark MLLib vs. SciKitLearn

Hi all,

I am totally new to ML APIs. Trying to get the *ROC_Curve* for Model
Evaluation on both *ScikitLearn* and *PySpark MLLib*. I do not find any API
for ROC_Curve calculation for BinaryClassification in SparkMLLib.

The codes below have a wrapper function which is creating the respective
dataframe from the source data with two columns which is as attached.

I want to achieve the same result as Python code in the Spark to get the
roc_curve. Is there any API from MLLib side to achieve the same?

Python sklearn Code -

def roc(self, y_true, y_pred):
    df_a = self._df.copy()
    values_1_tmp = df_a[y_true].values
    values_1_tmp2 = values_1_tmp[~np.isnan(values_1_tmp)]
    values_1 = values_1_tmp2.astype(int)
    values_2_tmp = df_a[y_pred].values
    values_2_tmp2 = values_2_tmp[~np.isnan(values_2_tmp)]
    values_2 = values_2_tmp2.astype(int)
    specificity, sensitivity, thresholds = metrics.roc_curve(values_1,
values_2, pos_label=2)
    # area_under_roc = metrics.roc_auc_score(values_1, values_2)
    print(sensitivity, specificity)
    return sensitivity, specificity

Result:

[ 0.          0.34138342  0.67412045  1.        ] [ 0.          0.33373458
0.67378875  1.        ]


PySpark Code -

def roc(self, y_true, y_pred):
    print('using pyspark df')
    df_a = self._df
    values_1 = list(df_a[y_true, y_pred].toPandas().values)
    new_list = [l.tolist() for l in values_1]

    double_list = []
    for myList in new_list:
        temp = []
        for item in myList:
            temp.append(float(item))
        double_list.append(temp)

    new_rdd = self._sc.parallelize(double_list)
    metrics = BinaryClassificationMetrics(new_rdd)
    roc_calc = metrics.areaUnderROC
    print(roc_calc)
    print(type(roc_calc))
    return 1


Please help.

Thanks,
Aakash.

Re: Spark MLLib vs. SciKitLearn

Posted by Aakash Basu <aa...@gmail.com>.
Any help on the below?

On 19-Jan-2018 7:12 PM, "Aakash Basu" <aa...@gmail.com> wrote:

> Hi all,
>
> I am totally new to ML APIs. Trying to get the *ROC_Curve* for Model
> Evaluation on both *ScikitLearn* and *PySpark MLLib*. I do not find any
> API for ROC_Curve calculation for BinaryClassification in SparkMLLib.
>
> The codes below have a wrapper function which is creating the respective
> dataframe from the source data with two columns which is as attached.
>
> I want to achieve the same result as Python code in the Spark to get the
> roc_curve. Is there any API from MLLib side to achieve the same?
>
> Python sklearn Code -
>
> def roc(self, y_true, y_pred):
>     df_a = self._df.copy()
>     values_1_tmp = df_a[y_true].values
>     values_1_tmp2 = values_1_tmp[~np.isnan(values_1_tmp)]
>     values_1 = values_1_tmp2.astype(int)
>     values_2_tmp = df_a[y_pred].values
>     values_2_tmp2 = values_2_tmp[~np.isnan(values_2_tmp)]
>     values_2 = values_2_tmp2.astype(int)
>     specificity, sensitivity, thresholds = metrics.roc_curve(values_1, values_2, pos_label=2)
>     # area_under_roc = metrics.roc_auc_score(values_1, values_2)
>     print(sensitivity, specificity)
>     return sensitivity, specificity
>
> Result:
>
> [ 0.          0.34138342  0.67412045  1.        ] [ 0.
> 0.33373458  0.67378875  1.        ]
>
>
> PySpark Code -
>
> def roc(self, y_true, y_pred):
>     print('using pyspark df')
>     df_a = self._df
>     values_1 = list(df_a[y_true, y_pred].toPandas().values)
>     new_list = [l.tolist() for l in values_1]
>
>     double_list = []
>     for myList in new_list:
>         temp = []
>         for item in myList:
>             temp.append(float(item))
>         double_list.append(temp)
>
>     new_rdd = self._sc.parallelize(double_list)
>     metrics = BinaryClassificationMetrics(new_rdd)
>     roc_calc = metrics.areaUnderROC
>     print(roc_calc)
>     print(type(roc_calc))
>     return 1
>
>
> Please help.
>
> Thanks,
> Aakash.
>