You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by durin <ma...@simon-schaefer.net> on 2014/07/24 23:30:24 UTC

KMeans: expensiveness of large vectors

As a source, I have a textfile with n rows that each contain m
comma-separated integers. 
Each row is then converted into a feature vector with m features each.

I've noticed, that given the same total filesize and number of features, a
larger number of columns is much more expensive for training a KMeans model
than a large number of rows.

To give an example:
10k rows X 1k columns took 21 seconds on my cluster, whereas 1k rows X 10k
colums took 1min47s. Both files had a size of 238M. 

Can someone explain what in the implementation of KMeans causes large
vectors to be so much more expensive than having many of these vectors?
A pointer to the exact part of the source would be fantastic, but even a
general explanation would help me.


Best regards,
Simon 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: KMeans: expensiveness of large vectors

Posted by Xiangrui Meng <me...@gmail.com>.

Before torrent, http is the default way for broadcasting. The driver
holds the data and the executors request the data via http, making the
driver the bottleneck if the data is large. -Xiangrui

On Tue, Jul 29, 2014 at 10:32 AM, durin <ma...@simon-schaefer.net> wrote:
> Development is really rapid here, that's a great thing.
>
> Out of curiosity, how did communication work before torrent? Did everything
> have to go back to the master / driver first?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10870.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: KMeans: expensiveness of large vectors

Posted by durin <ma...@simon-schaefer.net>.

Development is really rapid here, that's a great thing.

Out of curiosity, how did communication work before torrent? Did everything
have to go back to the master / driver first?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10870.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: KMeans: expensiveness of large vectors

Posted by Xiangrui Meng <me...@gmail.com>.

Great! Thanks for testing the new features! -Xiangrui

On Mon, Jul 28, 2014 at 8:58 PM, durin <ma...@simon-schaefer.net> wrote:
> Hi Xiangrui,
>
> using the current master meant a huge improvement for my task. Something
> that did not even finish before (training with 120G of dense data) now
> completes in a reasonable time. I guess using torrent helps a lot in this
> case.
>
>
> Best regards,
> Simon
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10833.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: KMeans: expensiveness of large vectors

Posted by durin <ma...@simon-schaefer.net>.

Hi Xiangrui,

using the current master meant a huge improvement for my task. Something
that did not even finish before (training with 120G of dense data) now
completes in a reasonable time. I guess using torrent helps a lot in this
case.


Best regards,
Simon



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10833.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: KMeans: expensiveness of large vectors

Posted by Xiangrui Meng <me...@gmail.com>.

1. I meant in the n (1k) by m (10k) case, we need to broadcast k
centers and hence the total size is m * k. In 1.0, the driver needs to
send the current centers to each partition one by one. In the current
master, we use torrent to broadcast the centers to workers, which
should be much faster.

2. For MLlib algorithms, the number of partitions shouldn't be much
larger than the number of CPU cores. Your setting looks good.

3. You can use the hashing trick to limit the number of features, or
remove low-frequency and high-frequency words from the dictionary.

Best,
Xiangrui

On Mon, Jul 28, 2014 at 12:55 PM, durin <ma...@simon-schaefer.net> wrote:
> Hi Xiangru,
>
> thanks for the explanation.
>
> 1. You said we have to broadcast m * k centers (with m = number of rows). I
> thought there were only k centers at each time, which would the have size of
> n * k and needed to be broadcasted. Is that I typo or did I understand
> something wrong?
> And the collection of the average is partition-wise. So more partitions =
> more overhead, but basically same number of operations?
>
> 2. I have 5 executors with 8 CPU cores and 25G of memory each, and I usually
> split the input RDD into 80 partitions for a few Gigs of input data. Is
> there a rule of thumb for the number of partitions in relation to the input
> size?
>
>
> 3. Assuming I wouldn't use numeric data but instead converted text data into
> a numeric representation using a dictionary and a featurization function:
> The number of columns would be the number of entries in my dictionary (i.e.
> number of distinct words in my case). I'd use a sparse vector representation
> of course. But even so, if I have a few hundred thousand entries and
> therefore columns, broadcasting overhead will get very large, as the centers
> are still in a dense representation.
> Do you know of any way to improve performance then?
>
>
> Best regards,
> Simon
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10804.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: KMeans: expensiveness of large vectors

Posted by durin <ma...@simon-schaefer.net>.

Hi Xiangru,

thanks for the explanation.

1. You said we have to broadcast m * k centers (with m = number of rows). I
thought there were only k centers at each time, which would the have size of
n * k and needed to be broadcasted. Is that I typo or did I understand
something wrong?
And the collection of the average is partition-wise. So more partitions =
more overhead, but basically same number of operations?

2. I have 5 executors with 8 CPU cores and 25G of memory each, and I usually
split the input RDD into 80 partitions for a few Gigs of input data. Is
there a rule of thumb for the number of partitions in relation to the input
size?


3. Assuming I wouldn't use numeric data but instead converted text data into
a numeric representation using a dictionary and a featurization function:
The number of columns would be the number of entries in my dictionary (i.e.
number of distinct words in my case). I'd use a sparse vector representation
of course. But even so, if I have a few hundred thousand entries and
therefore columns, broadcasting overhead will get very large, as the centers
are still in a dense representation.
Do you know of any way to improve performance then?


Best regards,
Simon



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614p10804.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: KMeans: expensiveness of large vectors

Posted by Xiangrui Meng <me...@gmail.com>.

If you have an m-by-n dataset and train a k-means model with k, the
cost for each iteration is

O(m * n * k) (assuming dense data)

Since m * n * k == n * m * k, so ideally you would expect the same run
time. However,

1. Communication. We need to broadcast current centers (m * k), do the
computation, and then collect the average centers from each partition
(m * k). Having a large n (#cols) will increase the communication
cost.

2. Load. How many partitions did you use? If there are 10k rows, maybe
the rows are distributed well. But if there are only 1000 rows, you
may have, for example, 400 rows on a single partition and then 200
rows on the other three. Then the run time is determined by the
largest partition size.

Hopefully these could explain your observation.

Best,
Xiangru

On Thu, Jul 24, 2014 at 2:30 PM, durin <ma...@simon-schaefer.net> wrote:
> As a source, I have a textfile with n rows that each contain m
> comma-separated integers.
> Each row is then converted into a feature vector with m features each.
>
> I've noticed, that given the same total filesize and number of features, a
> larger number of columns is much more expensive for training a KMeans model
> than a large number of rows.
>
> To give an example:
> 10k rows X 1k columns took 21 seconds on my cluster, whereas 1k rows X 10k
> colums took 1min47s. Both files had a size of 238M.
>
> Can someone explain what in the implementation of KMeans causes large
> vectors to be so much more expensive than having many of these vectors?
> A pointer to the exact part of the source would be fantastic, but even a
> general explanation would help me.
>
>
> Best regards,
> Simon
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-expensiveness-of-large-vectors-tp10614.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.