You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Reth RM <re...@gmail.com> on 2017/03/25 01:37:47 UTC

KMean clustering resulting Skewed Issue

Hi,

  I am using spark k mean for clustering records that consist of news
documents, vectors are created by applying tf-idf. Dataset that I am using
for testing right now is the gold-truth classified
http://qwone.com/~jason/20Newsgroups/

Issue is all the documents are getting assigned to same cluster and others
just have the vector(doc) picked as cluster center(skewed clustering). What
could be the possible reasons for the issue, any suggestions? Should I be
retuning the epsilon?

Re: KMean clustering resulting Skewed Issue

Posted by Asher Krim <ak...@hubspot.com>.

As I said in my previous reply, I don't think k-means is the right tool to
start with. Try LDA with k (number of latent topics) set to 3 and go up to
say 20. The problem likely lies is the feature vectors, on which you
provided almost no information. Text is not taken from a continuous space,
so any bag-of-words approach to clustering will likely fail unless you
first convert the features to a smaller and denser space

Asher Krim
Senior Software Engineer

On Wed, Mar 29, 2017 at 5:49 PM, Reth RM <re...@gmail.com> wrote:

> Hi Krim,
>
>   The dataset that I am experimenting with is gold-truth and it has 3
> types of docs, one with terms relevant to topic1(sports) other with topic2
> (technology) and thirdly, topic3 with biology, so k setting is 3 and
> features are distinct in each topic(total features close to 1230). I think
> the issue is with centroids convergence. I have been testing with different
> iteration counts and I was assuming that with higher iteration count, the
> centroids will converge at one point and will not shift after that, and the
> 'computeCost' will remain close to same. However, when I test with
> incremental iteration counts and obtain 'cost' at each iteration (or window
> of 5 iterations each) the cost keeps shifting invariably. Below table is
> iteration count vs cost.  I passed the different epsilon value thinking if
> that will lead to consistent convergence, but no luck.  Screenshot
> <https://s04.justpaste.it/files/justpaste/d417/a15312908/screen_shot_2017-03-29_at_2_46_42_pm.png>[1]
> with different iteration count, epsilon vs cost
>
>
> Any thoughts on what am I doing wrong here?
>
>
> *3* *1.841406859*
> *4* *1.750348983*
> *5* *1.514564993*
> 6 1.514564993
> 7 1.514564993
> 8 1.514564993
> 9 1.514564993
> 10 1.514564993
> 11 1.514564993
> 12 1.514564993
> *13* *1.750348983*
> *14* *1.750348983*
> *15* *1.514564993*
> 16 1.514564993
> 17 1.514564993
> 18 1.514564993
> *19* *1.514564993*
> *20* *1.750348983*
>
> [1]https://s04.justpaste.it/files/justpaste/d417/
> a15312908/screen_shot_2017-03-29_at_2_46_42_pm.png
>
>
>
>
> On Sun, Mar 26, 2017 at 4:46 AM, Asher Krim <ak...@hubspot.com> wrote:
>
>> Hi,
>>
>> Do you mean that you'e running K-Means directly on tf-idf bag-of-word
>> vectors? I think your results are expected because of the general lack of
>> big overlap between one hot encoded vectors. The similarity between most
>> vectors is expected to be very close to zero. Those that do end up in the
>> same cluster likely have a lot of similar boilerplate text (assuming the
>> training data comes from crawled new articles, they likely have similar
>> menus and header/footer text)
>>
>> I would suggest you try some dimensionality reduction on the tf-idf
>> vectors first. You have many options to choose from (LSA, LDA,
>> document2vec, etc). Other than that, this isn't a Spark question.
>>
>> Asher Krim
>> Senior Software Engineer
>>
>> On Fri, Mar 24, 2017 at 9:37 PM, Reth RM <re...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>>   I am using spark k mean for clustering records that consist of news
>>> documents, vectors are created by applying tf-idf. Dataset that I am using
>>> for testing right now is the gold-truth classified http://qwone.com/~j
>>> ason/20Newsgroups/
>>>
>>> Issue is all the documents are getting assigned to same cluster and
>>> others just have the vector(doc) picked as cluster center(skewed
>>> clustering). What could be the possible reasons for the issue, any
>>> suggestions? Should I be retuning the epsilon?
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: KMean clustering resulting Skewed Issue

Posted by Asher Krim <ak...@hubspot.com>.

Hi,

Do you mean that you'e running K-Means directly on tf-idf bag-of-word
vectors? I think your results are expected because of the general lack of
big overlap between one hot encoded vectors. The similarity between most
vectors is expected to be very close to zero. Those that do end up in the
same cluster likely have a lot of similar boilerplate text (assuming the
training data comes from crawled new articles, they likely have similar
menus and header/footer text)

I would suggest you try some dimensionality reduction on the tf-idf vectors
first. You have many options to choose from (LSA, LDA, document2vec, etc).
Other than that, this isn't a Spark question.

Asher Krim
Senior Software Engineer

On Fri, Mar 24, 2017 at 9:37 PM, Reth RM <re...@gmail.com> wrote:

> Hi,
>
>   I am using spark k mean for clustering records that consist of news
> documents, vectors are created by applying tf-idf. Dataset that I am using
> for testing right now is the gold-truth classified http://qwone.com/~
> jason/20Newsgroups/
>
> Issue is all the documents are getting assigned to same cluster and others
> just have the vector(doc) picked as cluster center(skewed clustering). What
> could be the possible reasons for the issue, any suggestions? Should I be
> retuning the epsilon?
>
>
>
>
>