You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Patryk Piekarski (Jira)" <ji...@apache.org> on 2022/08/26 13:36:00 UTC

[jira] [Created] (SPARK-40232) KMeans: high variability in results despite high initSteps parameter value

Patryk Piekarski created SPARK-40232:
----------------------------------------

             Summary: KMeans: high variability in results despite high initSteps parameter value
                 Key: SPARK-40232
                 URL: https://issues.apache.org/jira/browse/SPARK-40232
             Project: Spark
          Issue Type: Bug
          Components: ML, PySpark
    Affects Versions: 3.3.0
            Reporter: Patryk Piekarski
         Attachments: sample_data.csv

I'm running KMeans on a sample dataset using PySpark. I want the results to be fairly stable, so I play with the _initSteps_ parameter. My understanding is that the higher the number of steps for k-means|| initialization mode, the higher the number of iterations the algorithm runs and in the end selects the best model out of all iterations. And that's the behavior I observe when running sklearn implementation with _n_init_ >= 10. However, when running PySpark implementation, regardless of the number of partitions of underlying data frame (tested on 1, 4, 8 number of partitions), even when setting _initSteps_ to 10, 50, or let's say 500, the results I get with different seeds are different and trainingCost value I observe is sometimes far from the lowest.

As a workaround, to force the algorithm to iterate and select the best model I used a loop with dynamic seed.

SKlearn in each iteration gets the trainingCost near 276655.

PySpark implementation of KMeans gets there in the 2nd, 5th and 6th iteration, but all the remaining iterations yield higher values.

Does the _initSteps_ parameter work as expected? Because my findings suggest that something might be off here.

Let me know where I could upload this sample dataset (2MB)

 
{code:java}
import pandas as pd
from sklearn.cluster import KMeans as KMeansSKlearn
df = pd.read_csv('sample_data.csv')

minimum = 99999999
for i in range(1,10):
    kmeans = KMeansSKlearn(init="k-means++", n_clusters=5, n_init=10, random_state=i)
    model = kmeans.fit(df)
    print(f'Sklearn iteration {i}: {round(model.inertia_)}')from pyspark.sql 

import SparkSession
spark= SparkSession.builder \
    .appName("kmeans-test") \
    .config('spark.driver.memory', '2g') \
    .master("local[2]") \
    .getOrCreate()df1 = spark.createDataFrame(df)

from pyspark.ml.clustering import KMeans
from pyspark.ml.feature import VectorAssembler
assemble=VectorAssembler(inputCols=df1.columns, outputCol='features')
assembled_data=assemble.transform(df1)

minimum = 99999999
for i in range(1,10):
    kmeans = KMeans(featuresCol='features', k=5, initSteps=100, maxIter=300, seed=i, tol=0.0001)
    model = kmeans.fit(assembled_data)
    summary = model.summary
    print(f'PySpark iteration {i}: {round(summary.trainingCost)}'){code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org