You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean R. Owen (Jira)" <ji...@apache.org> on 2022/08/31 17:28:00 UTC
[jira] [Resolved] (SPARK-40232) KMeans: high variability in results despite high initSteps parameter value

     [ https://issues.apache.org/jira/browse/SPARK-40232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean R. Owen resolved SPARK-40232.
----------------------------------
    Resolution: Not A Problem

No, initSteps controls an aspect of the initialization. I don't think you want to change it. I would expect potentially different results with different seeds and initializations. Maybe not really different results but I don't know if your maxIter is high enough or whether the comparison to sklearn is apples to apples. Too many vairables

> KMeans: high variability in results despite high initSteps parameter value
> --------------------------------------------------------------------------
>
>                 Key: SPARK-40232
>                 URL: https://issues.apache.org/jira/browse/SPARK-40232
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, PySpark
>    Affects Versions: 3.3.0
>            Reporter: Patryk Piekarski
>            Priority: Major
>         Attachments: sample_data.csv
>
>
> I'm running KMeans on a sample dataset using PySpark. I want the results to be fairly stable, so I play with the _initSteps_ parameter. My understanding is that the higher the number of steps for k-means|| initialization mode, the higher the number of iterations the algorithm runs and in the end selects the best model out of all iterations. And that's the behavior I observe when running sklearn implementation with _n_init_ >= 10. However, when running PySpark implementation, regardless of the number of partitions of underlying data frame (tested on 1, 4, 8 number of partitions), even when setting _initSteps_ to 10, 50, or let's say 500, the results I get with different seeds are different and trainingCost value I observe is sometimes far from the lowest.
> As a workaround, to force the algorithm to iterate and select the best model I used a loop with dynamic seed.
> SKlearn in each iteration gets the trainingCost near 276655.
> PySpark implementation of KMeans gets there in the 2nd, 5th and 6th iteration, but all the remaining iterations yield higher values.
> Does the _initSteps_ parameter work as expected? Because my findings suggest that something might be off here.
> Let me know where I could upload this sample dataset (2MB)
>  
> {code:java}
> import pandas as pd
> from sklearn.cluster import KMeans as KMeansSKlearn
> df = pd.read_csv('sample_data.csv')
> minimum = 99999999
> for i in range(1,10):
>     kmeans = KMeansSKlearn(init="k-means++", n_clusters=5, n_init=10, random_state=i)
>     model = kmeans.fit(df)
>     print(f'Sklearn iteration {i}: {round(model.inertia_)}')from pyspark.sql 
> import SparkSession
> spark= SparkSession.builder \
>     .appName("kmeans-test") \
>     .config('spark.driver.memory', '2g') \
>     .master("local[2]") \
>     .getOrCreate()df1 = spark.createDataFrame(df)
> from pyspark.ml.clustering import KMeans
> from pyspark.ml.feature import VectorAssembler
> assemble=VectorAssembler(inputCols=df1.columns, outputCol='features')
> assembled_data=assemble.transform(df1)
> minimum = 99999999
> for i in range(1,10):
>     kmeans = KMeans(featuresCol='features', k=5, initSteps=100, maxIter=300, seed=i, tol=0.0001)
>     model = kmeans.fit(assembled_data)
>     summary = model.summary
>     print(f'PySpark iteration {i}: {round(summary.trainingCost)}'){code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org