You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "rajanimaski (JIRA)" <ji...@apache.org> on 2017/10/08 15:23:00 UTC

[jira] [Commented] (SPARK-20696) tf-idf document clustering with K-means in Apache Spark putting points into one cluster

    [ https://issues.apache.org/jira/browse/SPARK-20696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16196165#comment-16196165 ] 

rajanimaski commented on SPARK-20696:
-------------------------------------

Spark k-means(scala mllib api) is consistently producing highly skewed cluster size distributions in my experiments as well(see figure 1). Majority of the data-points are assigned to one cluster. This experiment was conducted using the 20 Newsgroup data for which ground truth is available: the ~10K data-points were manually categorized into fairly balanced 20 groups. http://qwone.com/~jason/20Newsgroups/

Initially I suspected that the vector creation step (using Spark's HashingTF and IDF libraries) was the cause of the incorrect clustering. However, even after implementing my own version of TF-IDF based vector representation I still got similar clustering results with highly skewed size distribution.

Eventually I implemented my own version of k-means on top of spark which uses standard TF-IDF vector representation and (-ve) cosine similarity as the distance metric. The results from this k-means look right. See the Figure 2 below.

Additionally, I experimented by plugging in Euclidean distance as a similarity metric (into my own version of kmean) and results continue to look right, not as skewed as spark k-means. Figure 1 and 2 : !https://i.stack.imgur.com/CkcIZ.png!



> tf-idf document clustering with K-means in Apache Spark putting points into one cluster
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-20696
>                 URL: https://issues.apache.org/jira/browse/SPARK-20696
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.1.0
>            Reporter: Nassir
>
> I am trying to do the classic job of clustering text documents by pre-processing, generating tf-idf matrix, and then applying K-means. However, testing this workflow on the classic 20NewsGroup dataset results in most documents being clustered into one cluster. (I have initially tried to cluster all documents from 6 of the 20 groups - so expecting to cluster into 6 clusters).
> I am implementing this in Apache Spark as my purpose is to utilise this technique on millions of documents. Here is the code written in Pyspark on Databricks:
> #declare path to folder containing 6 of 20 news group categories
> path = "/mnt/%s/20news-bydate.tar/20new-bydate-train-lessFolders/*/*" % 
> MOUNT_NAME
> #read all the text files from the 6 folders. Each entity is an entire 
> document. 
> text_files = sc.wholeTextFiles(path).cache()
> #convert rdd to dataframe
> df = text_files.toDF(["filePath", "document"]).cache()
> from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer 
> #tokenize the document text
> tokenizer = Tokenizer(inputCol="document", outputCol="tokens")
> tokenized = tokenizer.transform(df).cache()
> from pyspark.ml.feature import StopWordsRemover
> remover = StopWordsRemover(inputCol="tokens", 
> outputCol="stopWordsRemovedTokens")
> stopWordsRemoved_df = remover.transform(tokenized).cache()
> hashingTF = HashingTF (inputCol="stopWordsRemovedTokens", outputCol="rawFeatures", numFeatures=200000)
> tfVectors = hashingTF.transform(stopWordsRemoved_df).cache()    
> idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5)
> idfModel = idf.fit(tfVectors)
> tfIdfVectors = idfModel.transform(tfVectors).cache()
> #note that I have also tried to use normalized data, but get the same result
> from pyspark.ml.feature import Normalizer
> from pyspark.ml.linalg import Vectors
> normalizer = Normalizer(inputCol="features", outputCol="normFeatures")
> l2NormData = normalizer.transform(tfIdfVectors)
> from pyspark.ml.clustering import KMeans
> # Trains a KMeans model.
> kmeans = KMeans().setK(6).setMaxIter(20)
> km_model = kmeans.fit(l2NormData)
> clustersTable = km_model.transform(l2NormData)
> [output showing most documents get clustered into cluster 0][1]
> ID number_of_documents_in_cluster 
> 0 3024 
> 3 5 
> 1 3 
> 5 2
> 2 2 
> 4 1
> As you can see most of my data points get clustered into cluster 0, and I cannot figure out what I am doing wrong as all the tutorials and code I have come across online point to using this method.
> In addition I have also tried normalizing the tf-idf matrix before K-means but that also produces the same result. I know cosine distance is a better measure to use, but I expected using standard K-means in Apache Spark would provide meaningful results.
> Can anyone help with regards to whether I have a bug in my code, or if something is missing in my data clustering pipeline?
> (Question also asked in Stackoverflow before: http://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one)
> Thank you in advance!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org