You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by amin mohebbi <am...@yahoo.com.INVALID> on 2014/11/19 07:21:17 UTC

k-means clustering

Hi there,
I would like to do "text clustering" using  k-means and Spark on a massive dataset. As you know, before running the k-means, I have to do pre-processing methods such as TFIDF and NLTK on my big dataset. The following is my code in python :

|
| if __name__ == '__main__': |
|  |  # Cluster a bunch of text documents. |
|  |  import re |
|  |  import sys |
|  |  |
|  |  k = 6 |
|  |  vocab = {} |
|  |  xs = [] |
|  |  ns=[] |
|  |  cat=[] |
|  |  filename='2013-01.csv' |
|  |  with open(filename, newline='') as f: |
|  |  try: |
|  |  newsreader = csv.reader(f) |
|  |  for row in newsreader: |
|  |  ns.append(row[3]) |
|  |  cat.append(row[4]) |
|  |  except csv.Error as e: |
|  |  sys.exit('file %s, line %d: %s' % (filename, newsreader.line_num, e)) |
|  |  |
|  |  |
|  |  remove_spl_char_regex = re.compile('[%s]' % re.escape(string.punctuation)) # regex to remove special characters |
|  |  remove_num = re.compile('[\d]+') |
|  |  #nltk.download() |
|  |  stop_words=nltk.corpus.stopwords.words('english') |
|  |  |
|  |  for a in ns: |
|  |  x = defaultdict(float) |
|  |  |
|  |  |
|  |  a1 = a.strip().lower() |
|  |  a2 = remove_spl_char_regex.sub(" ",a1) # Remove special characters |
|  |  a3 = remove_num.sub("", a2) #Remove numbers |
|  |  #Remove stop words |
|  |  words = a3.split() |
|  |  filter_stop_words = [w for w in words if not w in stop_words] |
|  |  stemed = [PorterStemmer().stem_word(w) for w in filter_stop_words] |
|  |  ws=sorted(stemed) |
|  |  |
|  |  |
|  |  #ws=re.findall(r"\w+", a1) |
|  |  for w in ws: |
|  |  vocab.setdefault(w, len(vocab)) |
|  |  x[vocab[w]] += 1 |
|  |  xs.append(x.items()) |
|  |



Can anyone explain to me how can I do the pre-processing step, before running the k-means using spark.
 
Best Regards

.......................................................

Amin Mohebbi

PhD candidate in Software Engineering 
 at university of Malaysia  

Tel : +60 18 2040 017



E-Mail : TP025921@ex.apiit.edu.my

              amin_524@me.com

Re: k-means clustering

Posted by Yanbo Liang <ya...@gmail.com>.
Pre-processing is major workload before training model.
MLlib provide TD-IDF calculation, StandardScaler and Normalizer which is
essential for preprocessing and would be great help to the model training.

Take a look at this
http://spark.apache.org/docs/latest/mllib-feature-extraction.html

2014-11-21 7:18 GMT+08:00 Jun Yang <ya...@gmail.com>:

> Guys,
>
> As to the questions of pre-processing, you could just migrate your logic
> to Spark before using K-means.
>
> I only used Scala on Spark, and haven't used Python binding on Spark, but
> I think the basic steps must be the same.
>
> BTW, if your data set is big with huge sparse dimension feature vector,
> K-Means may not works as good as you expected. And I think this is still
> the optimization direction of Spark MLLib.
>
> On Wed, Nov 19, 2014 at 2:21 PM, amin mohebbi <aminn_524@yahoo.com.invalid
> > wrote:
>
>> Hi there,
>>
>> I would like to do "text clustering" using  k-means and Spark on a
>> massive dataset. As you know, before running the k-means, I have to do
>> pre-processing methods such as TFIDF and NLTK on my big dataset. The
>> following is my code in python :
>>
>> if __name__ == '__main__': # Cluster a bunch of text documents. import re
>> import sys k = 6 vocab = {} xs = [] ns=[] cat=[] filename='2013-01.csv'
>> with open(filename, newline='') as f: try: newsreader = csv.reader(f) for
>> row in newsreader: ns.append(row[3]) cat.append(row[4]) except csv.Error
>> as e: sys.exit('file %s, line %d: %s' % (filename, newsreader.line_num,
>> e))  remove_spl_char_regex = re.compile('[%s]' %
>> re.escape(string.punctuation)) # regex to remove special characters
>> remove_num = re.compile('[\d]+') #nltk.download() stop_words=
>> nltk.corpus.stopwords.words('english') for a in ns: x = defaultdict(float
>> )  a1 = a.strip().lower() a2 = remove_spl_char_regex.sub(" ",a1) #
>> Remove special characters a3 = remove_num.sub("", a2) #Remove numbers #Remove
>> stop words words = a3.split() filter_stop_words = [w for w in words if
>> not w in stop_words] stemed = [PorterStemmer().stem_word(w) for w in
>> filter_stop_words] ws=sorted(stemed)  #ws=re.findall(r"\w+", a1) for w in
>> ws: vocab.setdefault(w, len(vocab)) x[vocab[w]] += 1 xs.append(x.items())
>>
>> Can anyone explain to me how can I do the pre-processing step, before
>> running the k-means using spark.
>>
>>
>> Best Regards
>>
>> .......................................................
>>
>> Amin Mohebbi
>>
>> PhD candidate in Software Engineering
>>  at university of Malaysia
>>
>> Tel : +60 18 2040 017
>>
>>
>>
>> E-Mail : TP025921@ex.apiit.edu.my
>>
>>               amin_524@me.com
>>
>
>
>
> --
> yangjunpro@gmail.com
> http://hi.baidu.com/yjpro
>

Re: k-means clustering

Posted by Jun Yang <ya...@gmail.com>.
Guys,

As to the questions of pre-processing, you could just migrate your logic to
Spark before using K-means.

I only used Scala on Spark, and haven't used Python binding on Spark, but I
think the basic steps must be the same.

BTW, if your data set is big with huge sparse dimension feature vector,
K-Means may not works as good as you expected. And I think this is still
the optimization direction of Spark MLLib.

On Wed, Nov 19, 2014 at 2:21 PM, amin mohebbi <am...@yahoo.com.invalid>
wrote:

> Hi there,
>
> I would like to do "text clustering" using  k-means and Spark on a massive
> dataset. As you know, before running the k-means, I have to do
> pre-processing methods such as TFIDF and NLTK on my big dataset. The
> following is my code in python :
>
> if __name__ == '__main__': # Cluster a bunch of text documents. import re
> import sys k = 6 vocab = {} xs = [] ns=[] cat=[] filename='2013-01.csv'
> with open(filename, newline='') as f: try: newsreader = csv.reader(f) for
> row in newsreader: ns.append(row[3]) cat.append(row[4]) except csv.Error
> as e: sys.exit('file %s, line %d: %s' % (filename, newsreader.line_num,
> e))  remove_spl_char_regex = re.compile('[%s]' %
> re.escape(string.punctuation)) # regex to remove special characters
> remove_num = re.compile('[\d]+') #nltk.download() stop_words=
> nltk.corpus.stopwords.words('english') for a in ns: x = defaultdict(float)
> a1 = a.strip().lower() a2 = remove_spl_char_regex.sub(" ",a1) # Remove
> special characters a3 = remove_num.sub("", a2) #Remove numbers #Remove
> stop words words = a3.split() filter_stop_words = [w for w in words if not
> w in stop_words] stemed = [PorterStemmer().stem_word(w) for w in
> filter_stop_words] ws=sorted(stemed)  #ws=re.findall(r"\w+", a1) for w in
> ws: vocab.setdefault(w, len(vocab)) x[vocab[w]] += 1 xs.append(x.items())
>
> Can anyone explain to me how can I do the pre-processing step, before
> running the k-means using spark.
>
>
> Best Regards
>
> .......................................................
>
> Amin Mohebbi
>
> PhD candidate in Software Engineering
>  at university of Malaysia
>
> Tel : +60 18 2040 017
>
>
>
> E-Mail : TP025921@ex.apiit.edu.my
>
>               amin_524@me.com
>



-- 
yangjunpro@gmail.com
http://hi.baidu.com/yjpro