You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Marco Gaido (JIRA)" <ji...@apache.org> on 2018/08/24 09:55:00 UTC

[jira] [Commented] (SPARK-25219) KMeans Clustering - Text Data - Results are incorrect

    [ https://issues.apache.org/jira/browse/SPARK-25219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16591423#comment-16591423 ] 

Marco Gaido commented on SPARK-25219:
-------------------------------------

Hi [~VVasanth], a JIRA like this is very difficult to work on: saying that something returns a result which is not the expected one is not a great starting point for taking an action.

It would be great if you could provide a simple reproducer. The reproducer needs to involve only one thing if possible (in this case KMeans, not involving other transformation), with a set of parameters to reproduce the problem and the expected result which is returned with the same parameters by the other libraries.

If the problem is more clear, I am happy to work on it, but first we need to understand whether this is indeed an issue and how to reproduce it. Thanks.

> KMeans Clustering - Text Data - Results are incorrect
> -----------------------------------------------------
>
>                 Key: SPARK-25219
>                 URL: https://issues.apache.org/jira/browse/SPARK-25219
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.3.0
>            Reporter: Vasanthkumar Velayudham
>            Priority: Major
>
> Hello Everyone,
> I am facing issues with the usage of KMeans Clustering on my text data. When I apply clustering on my text data, after performing various transformations such as RegexTokenizer, Stopword Processing, HashingTF, IDF, generated clusters are not proper and one cluster is found to have lot of data points assigned to it.
> I am able to perform clustering with similar kind of processing and with the same attributes on the SKLearn KMeans algorithm. 
> Upon searching in internet, I observe many have reported the same issue with KMeans clustering library of Spark.
> Request your help in fixing this issue.
> Please let me know if you require any additional details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org